Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

What is an Industrial IoT Engineer with Derek Morgan

January 22, 2021 by Thomas Henson Leave a Comment

Industrial IoT Engineer with Derek Morgan

Explore Career Path’s as Industrial IoT Engineer

IoT investments are projected to grow by 13.6% through 2022. There is a huge opportunity for developers to jump into a career in IoT and specifically as Industrial IoT Engineers. Data is at the forefront of skills needed for IoT workflows. In this interview I sat down with Derek Morgan to discuss the role of the IoT Engineer.

Derek has quite a bit of experience in IoT and has been focusing in Manufacturing space of IoT. During this episode of Big Data Big Questions we break down the skills needed to enter the IoT Engineering space and what certification matter.  Here are a few of the items we cover:

    • Tech Stack Rechner, Postgres, Python and Terraform
    • How C ++ doesn’t apply here
    • Cloud vs. Private Cloud for IoT
    • Security Challenges with IoT
    • Opportunities for IoT Engineers in this space

Make sure to checkout the full video below to understand the role of the Industrial IoT Engineer.

Industrial IoT Engineer Interview

IoT Engineer Show Notes

Derek Morgan LinkedIN
Terraform
More than Certified in Terraform Course

Filed Under: Career Tagged With: IoT, IoT Engineer, Python

Defining IoT Message Brokers

July 8, 2018 by Thomas Henson Leave a Comment

IoT Message Brokers

How Do IoT Message Brokers Work?

What are message brokers in IoT? Message brokers are the middle ware in IoT & Streaming applications. Think of these systems as queuing systems that allow for quick writes to one system that can be read by many applications.

Message broker are critical to IoT & Streaming analytics to give Data Engineer the ability to quickly move data/messages into a storage container. Once in those storage containers the data can be read by multiple sources.

In this video we will walk through the different open source message brokers in IoT & Streaming workflows.

Video

Transcript – Defining IoT Message Brokers

Hi, folks! Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question comes in, I want to talk more about IoT, like I was talking about in the last few videos on IoT, this is something huge. This is something I think that a lot of data engineers should really start digging into. These are going to be workloads that we’re going to see, and even modern application developers, making you don’t do big data, you’re going to be impacted by this. I want to talk about message brokers in IoT. I want to talk about what a message broker is, how that architecture works, and then also some of the major players in there. You’ve probably heard of a few of these, but find out more right after this.
Welcome back. Thanks for tuning in. Today, we’re going to dig into message brokers in IoT, so really want to talk about how message brokers work in IoT and how, really, the data push. It’s a little bit different, right? It’s not your traditional application. In IoT, your devices are out there with IP connections, and may have a spotty connection, but how do you ensure that you can bring the data back in? This is where we start to see message brokers being used.

Message brokers are the middleware that’s in distributed applications, and so it’s like a queueing system. It’s going to handle the message validation transformation and the routing of the messages. It allows for you to move in. Think about, if you have a Raspberry Pi set up for your garage, so every time your garage doors open, you send a message out, and it sits in a queue to your message broker where, you can know that, “Hey, you know, that garage door now?” It was in an open state, but now it’s in a closed state, or vice versa. Then, if it’s in an open state, maybe then I’ve got a message, I’ve got somebody that subscribed to it to turn my air down, because chances are, if my garage door is open, it means that I’m going to be home, or I’m just pulling in, so I want you to go ahead and kick that air down for me.
[adsense_hint]
The architecture behind these message brokers is normally going to be a publish-subscribe. This gives you the ability to, your IoT devices, they’re going to publish updates. Like I said, they may have a spotty connection, so this is important for them, to be able to send those out. It’s constantly not sending out a, “I’m open, I’m open, I’m open.” It’s going to send it out whenever that’s changed, whenever it has a connection to change. If you don’t have a connection for your garage door opener, if it changes from open to close, it’s still, in that message broker, going to be shown as being closed. Once that connection hits back up, it’s going to change it to be open. This gives you the ability to, one, work with non-persistent data. Work in locations where you’re not going to have such a great connection, but this is also going to give you the ability to have multiple subscribers. You can have multiple subscribers. I talked about the air conditioner working, but what about other applications?

What if I wanted to have certain lights that came on? What if I wanted to have multiple different subscribers, or different applications, or different, other IoT devices that are looking and keying off of what happens to that garage door from that Raspberry Pi?

That’s just a little bit about that publish-subscribe pattern. Probably do another video digging a little bit deeper, maybe throw up some slides on it, but I did want to talk a little bit about what some of the message brokers in IoT are. First one I want to talk about it Apache Kafka. Kafka was incubated and developed outside of LinkedIn, so they were looking for ways that they could be able to take in all these messages and have them in a queueing system. We think about what they were doing, we’re talking about millions and millions of messages, right? For many years, it was used in their production. You see a lot used in streaming analytics. You’ve heard me talk about Kafka in the Lambda architecture and being able to support that streaming analytics, and have that queueing system. As those messages come in, you just don’t have time for them to hit HTFS right then. That gives you the ability. Another one is Pravega. Pravega is open source out of Dell EMC. Heard me talk about it when we talk about the Kappa architecture. This gives you the ability to have that messaging queue for those devices as they come in. They’re sitting in that queue system, but because it’s part of the Kappa architecture, even your batch rights and your streaming rights can all be accessed through Pravega, versus typically when we talk about a Lambda architecture, we have our, think about it in our batch layer. We have our batch layer, traditional probably going to be in HTFS. Then, you have your stream-in layer that might be in Kafka, or Spark Streaming, or some of the other applications.

Then, you have two different code bases to be able to do that. Pravega, built from the ground up for streaming architecture, but also giving you that ability to really take advantage of the Kappa architecture, and be able to have one code base to be able to use, to be able to access, and write your, whether it be Spark jobs, or it be old MapReduce jobs, those types of things. Then, the third one that I wanted to talk about was RabbitMQ. Another message broker in IoT is RabbitMQ. Widely developed for web development, so it was originally developed for web services to be able to respond to a call request, and so if you think about, and you look at a lot of the frameworks that it supports, and a lot of the code levels, we’re talking still Ruby, PHP, .NET, a lot of the development stack, even a lot of JavaScript. I’ve seen some people who have some courses out there on RabbitMQ, just for the JavaScript developer. It’s another one that’s kind of a message queueing system, built to be able to stream, built to be out for streaming analytics, and be able to distribute those messages.

Still not seen or as popular as Kafka as far as when we start talking about big data analytics. You’re starting to see a little movement from that area, and then also, there are other ones out there with Azure having one, and AWS IoT, they use a publish-subscribe in their architecture for their IoT platform. There’s a lot of different ways to use those message brokers. I think this is a concept that you really should be familiar with to some extent, because you’re probably already using one, you maybe just haven’t referred to it as a message broker.

That’s all I have for today. Make sure you subscribe to the YouTube channel, here. You never want to miss an episode, and this gives you an opportunity to ask questions, submit them down here in the comments section below, but always stay tuned, make sure to keep your big data, data engineering knowledge on point.

Thanks again.

IoT Message Broker Show Links

Apache Kafka  – https://kafka.apache.org/
Pravega – http://pravega.io/ 
RabbitMQ – https://www.rabbitmq.com/ 

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Filed Under: IoT Tagged With: IoT, Kafka, Message Brokers

Phases of IoT Application Development

June 19, 2018 by Thomas Henson Leave a Comment

Phases of IoT Application Development

IoT Application Development

The Internet of Things is generating many opportunities for Data Engineers to develop useful applications. Think about self driving cars, they are just one large moving IoT devices. When developing IoT applications developers typically start with 3 different phases in mind. In this video I will explain the 3 Phases of IoT Application Development.

Transcript – Phases of IoT Application Development

Hi folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s episode, I wanna talk more about IoT. So, I know we started talking about it in the previous video, but I really wanna dig into it a little bit more because I think it’s something very important for data engineers and anybody involved in any kind of modern applications or anybody involved in Big Data. So, find out about the phases, the three phases of IoT right after this.

Welcome back. And so, today, we’re gonna continue to talk more about IoT or Internet of Things. And so, if you’re not familiar, I’ve got a video up here where we talked about why it’s important for data engineers to know, but I think it’s important for anybody involved in modern applications or even on the business side of things. You’re really gonna see a lot of different information that’s being able to come into your data center and into your projects because of these sensors out there. So, let’s get a little more comfortable with what’s going on with the technology and how that’s gonna implement to us.

And so, in this video, I wanna talk about the three phases. So, I think there are three phases of IoT and I think we’re starting to get into the third phase, and you’ll see why it’s gonna make sense for data engineers and modern applications when we talk about that third phase.

And so, just as a recap, remember, IoT, it’s not a new concept, it’s the ability to have devices. So, we have devices out in the physical world that are gonna have some kind of IP address, but us also be able to send data and receive data back from your core data center or from your core analytics processing. So, think about the example I’ve used before is the dash button, right? You have a dash button where if you’re out of toilet paper or if you’re out of whatever it is in your house, you’re able to push that button. It connects out to a gateway, it’s locked in the cloud with Amazon to be able to say, “Hey, order some more [INAUDIBLE 00:02:02] of this particular brand,” and sent it to your door, so a real quick example there.

But let’s talk a little bit more about the phases and I think that will give you a more understanding of, “Okay, this is how that concept and how that whole ecosystem of devices and data and gateways all work together.” And so, I think just like from a web development perspective when we talked about, hey, Web 1.0 and Web 2.0, I think with IoT, we’ve gone through phases of IoT 1.0 and 2.0. I think that these phases are more collapsed than the phases of the web, and part of that is just we change so fast.

And so, the first phase was everybody had a sensor, right? Maybe this is not a smartwatch, but think about tracking and the smartwatches, everybody is like, “Oh, it’s kinda cool, right? I can get in contest with my friends and track how many steps I have.” That’s pretty cool. Really, we didn’t understand what to do with it. It was still kind of somewhat of a novelty and so everybody who already had this since just really didn’t know what to do with them other than just tracking, right? That’s kinda things that we had been doing before, but now we have an internet connection and we can kinda control them on our phone.

Fast-forward into phase two, so once we go into phase two, it wasn’t just about these smart trackers and these devices that were attached to us, but it started to become… we had smart everything in our homes, right? So, in phase two, we started having, think of a refrigerator, so you had a smart refrigerator, and people are like, “Well, that’s kinda cool. We have a refrigerator that’s connected to the internet. I can look at photos of it from my phone. And so, if I’m at the grocery store, that’s like, hey, do I need any more ranch dressing or do I need any more Tide Pods…” well, maybe Tide Pods are in your refrigerator, but, well, hopefully not. But if you had pickles or things that you’re looking for at the grocery store or maybe even just your washing machine, you’re able to turn your washing machine on with your device and say, “Okay, let’s turn the washing machine on.” That’s pretty cool. You set it up.

It’s still kinda novel, not really where we’re really seeing this go because we know as data engineers and people that analytics and being able to predict and being able to prescribe outcome is where it really goes, and that’s where phase three is. So, phase three, and that’s where we’re really entering right now.

Phase three is when we’re able to take all this information. So, think of that washing machine. We’re not just turning that washing machine on from our phone. That device has diagnostics in it that’s gonna run, run those diagnostics. And so, let’s say that there’s an air in the onboard or maybe some kinda circuit, but maybe just a $10 component on your washing machine that if you replace it within the next 30 days, it’s gonna prevent you from having to get a brand new washing machine. Well, that’s pretty cool, right? That’s really cool. So, it can send you that information, but instead, it’s just sending you that information, check this out. So, that diagnostics that happens goes out, sends that information out to the data center, the data center actually looks and it finds service providers in your area because it knows where this device is, you’ve registered it. It knows where your home is. So, it’s gonna find those service providers in your area and it’s gonna send you back in an alert saying, “Hey, we found this component that needs to be replaced on your washing machine. This is gonna prevent you from having to buy a new washing machine, maybe it will prevent you from having a flood washing machine,” which man, who wants to clean up and then have to buy a new washing machine?

So, how about these are some times that we’ve scheduled, we have a service person that can come in your area and replace that part for you, when would you like to schedule that? That’s pretty awesome, right? That’s really starting to say, “Hey, we found an error. We believe this is the component that can fix it.” And then, also, here are some times for us to be able to fix it. So, how many steps did it take a human on the [INAUDIBLE 00:05:50]? Which is really good, right? From a consumer, we want products like that, right? And so, there’s a ton of different new cases that we can start to see. So, we’re starting to see that now with what I call IoT phase three.

So, the phases… just remember the phases. The first one, think of sensors everywhere, smart sensors, but we’re really just tracking things. Second became like mobile control or the ability to have smart everything, so we have the smart refrigerator, we have the smart washer and dryer, but we still just didn’t know what we could do with it. And now, we’re more into the phase three. We’re starting to prescribe, so we’re starting to have these predictive analytics saying, “Hey, these are things that might happen. Oh, and by the way, this is how we can fix it.” And, this is actually gonna give consumers and other products more information, and just a better feeling for the things. And so, it can save you time from having to pick up the phone and call to schedule a time for somebody to come in and fix your washing machine. It’s gonna prevent you from having to go out and buy a new washing machine. It makes products more sticky for those companies.

So, that’s all for today’s episode of Big Data Big Questions. Make sure to subscribe to the channel, submit any questions that you have. If you have any questions that’s related to Big Data, IoT, machine learning, hey, just send me any questions, I’ll try to answer them for you if I get an opportunity. But submit those here or go to my website. Make sure you subscribe and I’ll see you next time on the next episode of Big Data Big Questions.

 

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Filed Under: IoT Tagged With: IoT, IoT Development

Why Data Engineers Should Care About IoT

June 4, 2018 by Thomas Henson Leave a Comment

Data Engineers Should Care About IoT

Why Data Engineers Should Care About IoT

The Internet of Things has been around for a few years but has hit an all time high for buzzword status. Is IoT important for Data Engineers and Machine Learning Engineers to understand? By 2020 Gartner predicts there to be over 21 Billion connected devices world wide. The data from these devices will be included in current and emerging big data work flows. Data Engineers & Machine Learning Engineers will need to understand how to quickly process this data data merge with existing data sources. Learn why Data Engineers should care about IoT in this episode of Big Data Big Questions.

Transcript

Hi folks, Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I want to tackle IoT for data engineers. I’m going to explain why IoT, or the Internet of Things, matters for data engineers, and how it’s going to affect our careers, how it’s going to affect our day-to-day jobs, and honestly, just the data that we’re going to manage. Find out more, right after this.
[Sound effects]

Today’s question is, what is IoT, and how does that affect the data engineers? We’ve probably seen the buzz word, or the concept of the Internet of Things, but what does that really mean? Is it just these little dash buttons that we have? Is this? Wait a minute. Is that ordering something?

Is this what IoT is, or is it the whole ecosystem and concept around it? First things first. IoT, or the Internet of Things, is the concept of all these connected devices, right? It’s not something that is, I will say, brand new. Something that’s been out there for a while, and when we really think about it, getting down to it, it is a sensor. We have these sensors, these cheap sensors.

We’ve had them for a long time, but what we haven’t had is all these devices connected with an IP address to the Internet, that can send the data. That’s the big part of the concept. It’s not just about the sensor, it’s about being able to move the data from the sensor.

This gives us the ability to be able to manage things in the physical world, bring them back, do some analytics on it, and even push data back out to it. The cool thing is, generally with IoT devices these are, I would say, economical or cheap devices that have an IP address, that can just pull in information. Think about a sensor, if you have a smart watch that’s connected to the Internet and can feed up information to you. That’s where some of it all started. These dash buttons. I can have these dash buttons all installed around my house, push a button whenever I need something, or start to look at what we’re talking about with smart refrigerators. Smart refrigerators can take pictures and have images of what all’s, the content that’s in your refrigerator, so if you’re at the store, you look, and you’re like, “Hey, you know, what am I…? Do I need that ranch dressing? Yeah? Let me check in my refrigerator, here.”

Also, a sensor could be inside the refrigerator, and tell you if something’s going wrong. Maybe the ice maker is blocked. Maybe you need a new water filter in your refrigerator, and the refrigerator knows that, has a sensor into it. It can send information to wherever, to be able to order that water filter for you and send it to your home, so you don’t even have to go in, and remember, “Hey, has it been 90 days? Or was it 60 days? Is it time? Is it time to change it?” Then, you’re going to forget. You’re going to let it go over, but now, you can have this sensor that’s going to tell you, and it’s going to order that for you. That’s the concept. It’s not just about the sensor. It’s about that ecosystem.

It’s about being able to move the data. For data engineers, what does this mean? Why do we care?

There are a lot of predictions out there about IoT and where it’s going. One of the big ones is, Gardner has a prediction that by 2020 we will have 20 billion, over 20 billion, of these devices. Not just the dash buttons, but just think of all these sensors, all these things with IP addresses connected to the Internet. What does that mean, from a data perspective? Some numbers that I’ve seen are 44 zettabytes of data are some of the predictions that I’ve seen, that’s going to be contributed to new data that’s coming in and the data that we have that’s already existing. Think about it. What is a zettabyte? It’s not a petabyte. It’s bigger than a petabyte.

How are we going to manage all these data, when right now we’re still managing terabytes and petabytes of data, and being like, “Man! This is a lot of data!” That’s why it’s important for data engineers, is that’s contributing to this deluge of data. How does all that affect us, as far as what are some of the concepts? When we start talking about IoT, and sensors, and having these data on the edge, being able to pull information back, but also being able to push the information out. What does that start to say?

As we’ve talked more and more about real-time analytics, this is where we’re really going to start to see real-time analytics really taking hold. As soon as we can get that data, and be able to analyze it and push information back out, that’s what’s going to help us. Think about it with automated cars, with a lot of the things that are going on outside in the physical world, where we have sensors, and devices talking to devices, streaming analytics is going to be huge in IoT.

The question becomes, if you’re looking to get involved in IoT, what are some of the projects? What are some of the things you can do to contribute and be a part of this IoT revolution? I would look into some of the messaging queues. Look at Pravega, look at Kafka, even look at RabbitMQ, and some of the other messaging queues, because think about it. As 20 billion devices, maybe more, by 2020. As these devices come in, they have to come into a queue. They have to be stored somewhere before they can be processed and before we can analyze them. I would look into the storage aspect of that.

Also, know how to do the processing. Look at some of your streaming processing, whether it be Apache Beam, whether it be Flink, or whether it be Spark. I would look into those, if you’re looking to get involved in IoT. If you have any questions, make sure you submit those in the comments section here below, or go to thomashenson.com/big-questions. Submit your questions, and I’ll try to answer them on here.

Until next time, see you again.

 

Filed Under: IoT Tagged With: Big Data Big Questions, Data Engineer, IoT

Kappa Architecture Examples in Real-Time Processing

October 11, 2017 by Thomas Henson Leave a Comment

Kappa Architecture

“Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

Yes we can build models based on the real-time processing and in fact there are some you use every day….

In today’s episode of Big Data Big Questions, we will explore some real-world Kappa Architecture examples. Watch out this video and find out!

Video

Transcription

Hi, folks. Thomas Henson here with thomashenson.com. And today we’re going to have another episode of Big Data, Big Questions. And so, today’s episode, we’re going to focus on some examples of the Kappa Architecture. And so, stay tuned to find out more.

So, today’s question comes in from a user on YouTube, Yaso1977 . They’ve asked: “Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

And so, I think this user is stemming this question from their defense for either their master’s degree or their Ph.D. So, first off, Yaso1977, congratulations on standing on your defense and creating your research project around this. I’m going to answer this question as best I could and put myself in your situation where if I was starting out and had to come up with a research project to be able to stand for either my Master’s or my Ph.D. What would I do, and what are some of the things I would look at?

And so, I’m going to base most of these around the Kappa Architecture because that is the future, right, of streaming analytics and IoT. And it’s kind of where we’re starting to see the industry trend. And so, some of those examples that we’re going to be looking for are not just going to be out there just yet, right? We still have a lot of applications and a lot of users that are on Lambda. And Kappa is still a little bit more on the cutting edge.

So, there are three main areas that I would look for to find those examples. The first one is going to be in IoT. So your newer IoTs to the Internet of things workflows, you’re going to start to see that. One of the reasons that we’re going to see that is because there’s millions and millions of these devices that are out there.

And so, you can think of any device, you know, whether be it from a manufacturer that has sensors on manufacturing equipment, smart cards, or even smartphones, and just information from multiple millions of users that are all streaming back in and doing some kind of prediction modeling doing some kind of analytics on that data as it comes in.

And so, on those newer workflows, you’re probably going to start to see the Kappa Architecture being implemented in there. So, I would focus first off looking at IoT workflows.

Second, this is the tried and true one that we’ve seen all throughout Big Data since we’ve started implementing Hadoop, but fraud detection, specifically with credit cards and some of the other pieces. So, you know, look at that from a security perspective, and so a lot of security. I mean, we just had the Equifax data breach and so many other ones.

So, I would, for sure, look at some of the fraud detection around, you know, maybe, some of the major credit card companies and see kind of what they’re doing and what they have published around it. Because just like in our IoT example, we’re talking millions and millions, maybe, even billions of users all having, you know, multiple transactions going on at one time. All that data needs to be processed and needs to be logged, and, you know, we’re looking for fraud detection. That needs to be pretty quick, right? Because you need to be able to capture that in the moment that, you know…Whether you’re inserting your chip card or whether you’re swiping your card, you need to know whether that’s about to happen, right?

So, it has to be done pretty quickly. And so, it’s definitely a streaming architecture. My bet is there’s some people out there that are already using that Kappa Architecture.

And then another one is going to be anomaly detection. I’m going to break that into two different ones. So, anomaly detection ones talk about security from the insider threats. So, think of being able to capture, you know, insider threats in your organization that are maybe trying to leak data or trying to give access to people that don’t need to have it. Those are still things that happen in real-time. And, you know, the faster that you can make that decision, the faster that you could predict that somebody is an insider threat, or that they’re doing something malicious on your network, the quicker and the less damage that is going to be done to your environment.

And then, also, anomaly detection from manufacturers. So, we’re talking about a little bit about IoT but also looking at manufacture. So, there’s a great example. And I would say that, you know, for your research, one of the books that you would want to look into is the Introduction to the Apache Flink. There’s an example in there from a manufacturer of Erickson who’ve implemented the Kappa Architecture. And what they have is…I think it’s like 10 to 100 terabytes of data that they’re processing at one time. And they’re looking for anomaly detection in that workflow to see, you know, are there sensors? Are there certain things that are happening that are out of the norm so that maybe they can stop manufacturing defect or predict something that’s going to go wrong within their manufacturing area, and then also, you know, externally, you know, from when the users have their devices and be able to predict those too?

So, those are the three areas that I would check, definitely check out the Introduction to Apache Flink, a lot of talk about the Kappa Architecture. Use that as some of your resources and be able to, you know, pull out some of those examples.

But remember, those three areas that I would really key on and look at are IoT, fraud detection. So, look at some of the credit companies or other fraud detections. And then also, anomaly detection, whether be insider threats or manufacturers.

So, that’s the end of today’s episode for Big Data, Big Questions. I want to thank everyone for watching. And before you leave, make sure that you subscribe. So, you never want to miss an episode. You never want to miss any of my Big Data Tips. So, make sure you subscribe, and I will see you next time. Thank you

Filed Under: Big Data Tagged With: Big Data, Big Data Big Questions, IoT, Kappa

Top 3 Recommendations for Real-Time Analytics

July 11, 2017 by Thomas Henson Leave a Comment

I honestly think developing real-time analytics is one of the hardest feats for developers to take on!

I’ll admit I’m for sure biased, but that doesn’t make me wrong.

My first project in the Hadoop eco-system was a real-time application when the Hadoop community still didn’t have real-time processing. I’ve always been honest in my posts and always will. So let me not sugarcoat this…the project sucked and was deemed a failure!

My team didn’t understand the requirements for real-time and couldn’t meet the requirements. The project was over budget and delayed. However, all was not lost, years have past and I learned a lot and the Hadoop community now has a new frameworks to speed up processing in real-time. Before developing your real-time analytics project please read these top 3 recommendations for real-time analytics! You will thank me….

Real-Time Analytics

What is Real-Time Analytics

Real-time analytics is the ability to analyze data as soon as it is created. Not only as soon as it’s created, but before all the data is uncovered. Traditional batch architectures have all the data in place before processing, but real-time processing is done as the data is created.

To be picky there is really no such thing as real-time analytics! What we have right now is near real-analytics which to humans is at millisecond speed or just faster than our competitors. For true real-time we would have to analyze the data at the same time it occurs and right now there is always some latency from networking, processing, etc. Let’s table this discussion until quantum computing becomes mainstream……

Take for example a GPS enabled application for brick and mortar stores. Using the phone as a sensor the application will know the customers proximity to the store. When the customer is close the store a offer is sent via the phone. Sounds simple, but imagine millions of sensors entering data into the system and trying to analyze the data location information. Add to this example for knowing store locations, store hours, local events, inventory levels, etc. Now many things could go wrong here. For example, the application could send offer for a product not in stock, send offer too late once the customer is out of range, send offer to a store that is closed.

Still think building those real-time applications is easy? Let’s look into the future…

Tsunami of Real-Time Data

How much data are we talking about for the future of real-time analytics? Gartner predicts that by 2020 worldwide we will have 20.4 billion devices connect. The predcition roughly estimates the world population of 7 billion people with an averages 3 devices per person. Sounds like a lot, however, I think it’s a conservative prediction. How many devices do you have connect in your home? I have 25 in my home and I’m not considered on the bleeding edge. I’ve talked with quite a few people who have as many as 75 plus. So let’s say 1/4 of the population has 15 devices by 2020 that will total closer to 28 plus billion devices.

Recommendations for Real-Time Analytics

Since we know why streaming processing and real-time analytics are growing at a perplexing pace, lets’ discuss recommendations for building those real-time applications.

 1 – Timing is Everything

Know the time to value for the insights of the data. All data has different values assigned to it and that value degrades over time. Picture our previous example for a retailer using location services to send offers via a mobile application. How valuable is potential customer’s location? It’s really valuable, but only if the application can process the data quickly enough to send an incentive while the customer is near their physical location. Otherwise, the application is providing historical information.

After understanding the time value of the data you can find the correct framework (Flink, Spark, Storm, etc) to process the data. Most of the streaming data we need to be processed real-time for specific insights. Example pulling that user location data and time. Remember not all data is processed the same way batch vs. streaming.

2 – Make Sure Applications will Scale

Make sure your real-time application can scale. Not just scale with large influxes of data, but independently with processing and storage. In the future of IoT and Streaming data sources will be extremely unpredictable. One day you might ingest 2 TB of new data and the next 2 PB. If you think I’m joking checkout my talk from the DataWorks Summit of the Future Architecture of Streaming Analytics. Build application on the foundation of architectures, services, and components that can scale. Remember our friend Murphy and his law about how things can go wrong.

Scaling isn’t all focused on just being able to ingest more data, but scaling independently with compute and capacity. Make sure your real-time application supports a data lake strategy. Isilon’s Data Lake Platform give the ability to separate compute and capacity when growing your Hadoop clusters. So when a new set of data comes in that is 10 TB of data that isn’t really growing and probably will only run weekly or monthly you can scale your capacity without having to add unneeded capacity. Also, a data lake strategy gives you the ability to opt out of the 3x replication with 200% utilization vs. 80% utilization on Isilon. Whether you use Isilon or not make sure you have a data lake strategy that builds on the architecture of independent scaling!!

3 – Life Cycle Cost of Data

Since we know the value of the data decrease over time we need to assign a cost for that data. I know you probably just rolled your eyes when I mentioned the cost of data, but it’s important to understand that data is a product. Just like Amazon sells books for different prices, they also assign cost data’s value varies over time

As big data developers we want to hold on to data forever and bring in as many news sources as possible. However, when our manager or CFO gets the bill for all the capacity you need you will be sitting endless meetings and writing up justification reports for about why you are holding all this data. This means less time doing what we love, coding in our Hadoop Cluster. Know the value of your data and plan accordingly!!

Wrap Up of Real-time Analytics

Finishing up our discussion, remember that real-time analytics is processing of data as soon as the data is generated. By analyzing the data as it’s generated decision can be made quicker which helps create better applications for our users.  When building real-time applications make sure you follow my 3 recommendations by understanding the time value of the data, building on systems that scale independently, and assigning value to the data. Successfully building real-time applications depends on these 3 core points.

Filed Under: Streaming Analytics Tagged With: IoT, Real-Time Analytics, Streaming Analytics

Subscribe to Newsletter

Archives

  • February 2021 (2)
  • January 2021 (5)
  • May 2020 (1)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Machine Learning Engineer Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Follow me on Twitter

My Tweets

Recent Posts

  • Tips & Tricks for Studying Machine Learning Projects
  • Getting Started as Big Data Product Marketing Manager
  • What is a Chief Data Officer?
  • What is an Industrial IoT Engineer with Derek Morgan
  • Ultimate List of Tensorflow Resources for Machine Learning Engineers

Copyright © 2023 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in

 

Loading Comments...