Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Phases of IoT Application Development

June 19, 2018 by Thomas Henson Leave a Comment

Phases of IoT Application Development

IoT Application Development

The Internet of Things is generating many opportunities for Data Engineers to develop useful applications. Think about self driving cars, they are just one large moving IoT devices. When developing IoT applications developers typically start with 3 different phases in mind. In this video I will explain the 3 Phases of IoT Application Development.

Transcript – Phases of IoT Application Development

Hi folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s episode, I wanna talk more about IoT. So, I know we started talking about it in the previous video, but I really wanna dig into it a little bit more because I think it’s something very important for data engineers and anybody involved in any kind of modern applications or anybody involved in Big Data. So, find out about the phases, the three phases of IoT right after this.

Welcome back. And so, today, we’re gonna continue to talk more about IoT or Internet of Things. And so, if you’re not familiar, I’ve got a video up here where we talked about why it’s important for data engineers to know, but I think it’s important for anybody involved in modern applications or even on the business side of things. You’re really gonna see a lot of different information that’s being able to come into your data center and into your projects because of these sensors out there. So, let’s get a little more comfortable with what’s going on with the technology and how that’s gonna implement to us.

And so, in this video, I wanna talk about the three phases. So, I think there are three phases of IoT and I think we’re starting to get into the third phase, and you’ll see why it’s gonna make sense for data engineers and modern applications when we talk about that third phase.

And so, just as a recap, remember, IoT, it’s not a new concept, it’s the ability to have devices. So, we have devices out in the physical world that are gonna have some kind of IP address, but us also be able to send data and receive data back from your core data center or from your core analytics processing. So, think about the example I’ve used before is the dash button, right? You have a dash button where if you’re out of toilet paper or if you’re out of whatever it is in your house, you’re able to push that button. It connects out to a gateway, it’s locked in the cloud with Amazon to be able to say, “Hey, order some more [INAUDIBLE 00:02:02] of this particular brand,” and sent it to your door, so a real quick example there.

But let’s talk a little bit more about the phases and I think that will give you a more understanding of, “Okay, this is how that concept and how that whole ecosystem of devices and data and gateways all work together.” And so, I think just like from a web development perspective when we talked about, hey, Web 1.0 and Web 2.0, I think with IoT, we’ve gone through phases of IoT 1.0 and 2.0. I think that these phases are more collapsed than the phases of the web, and part of that is just we change so fast.

And so, the first phase was everybody had a sensor, right? Maybe this is not a smartwatch, but think about tracking and the smartwatches, everybody is like, “Oh, it’s kinda cool, right? I can get in contest with my friends and track how many steps I have.” That’s pretty cool. Really, we didn’t understand what to do with it. It was still kind of somewhat of a novelty and so everybody who already had this since just really didn’t know what to do with them other than just tracking, right? That’s kinda things that we had been doing before, but now we have an internet connection and we can kinda control them on our phone.

Fast-forward into phase two, so once we go into phase two, it wasn’t just about these smart trackers and these devices that were attached to us, but it started to become… we had smart everything in our homes, right? So, in phase two, we started having, think of a refrigerator, so you had a smart refrigerator, and people are like, “Well, that’s kinda cool. We have a refrigerator that’s connected to the internet. I can look at photos of it from my phone. And so, if I’m at the grocery store, that’s like, hey, do I need any more ranch dressing or do I need any more Tide Pods…” well, maybe Tide Pods are in your refrigerator, but, well, hopefully not. But if you had pickles or things that you’re looking for at the grocery store or maybe even just your washing machine, you’re able to turn your washing machine on with your device and say, “Okay, let’s turn the washing machine on.” That’s pretty cool. You set it up.

It’s still kinda novel, not really where we’re really seeing this go because we know as data engineers and people that analytics and being able to predict and being able to prescribe outcome is where it really goes, and that’s where phase three is. So, phase three, and that’s where we’re really entering right now.

Phase three is when we’re able to take all this information. So, think of that washing machine. We’re not just turning that washing machine on from our phone. That device has diagnostics in it that’s gonna run, run those diagnostics. And so, let’s say that there’s an air in the onboard or maybe some kinda circuit, but maybe just a $10 component on your washing machine that if you replace it within the next 30 days, it’s gonna prevent you from having to get a brand new washing machine. Well, that’s pretty cool, right? That’s really cool. So, it can send you that information, but instead, it’s just sending you that information, check this out. So, that diagnostics that happens goes out, sends that information out to the data center, the data center actually looks and it finds service providers in your area because it knows where this device is, you’ve registered it. It knows where your home is. So, it’s gonna find those service providers in your area and it’s gonna send you back in an alert saying, “Hey, we found this component that needs to be replaced on your washing machine. This is gonna prevent you from having to buy a new washing machine, maybe it will prevent you from having a flood washing machine,” which man, who wants to clean up and then have to buy a new washing machine?

So, how about these are some times that we’ve scheduled, we have a service person that can come in your area and replace that part for you, when would you like to schedule that? That’s pretty awesome, right? That’s really starting to say, “Hey, we found an error. We believe this is the component that can fix it.” And then, also, here are some times for us to be able to fix it. So, how many steps did it take a human on the [INAUDIBLE 00:05:50]? Which is really good, right? From a consumer, we want products like that, right? And so, there’s a ton of different new cases that we can start to see. So, we’re starting to see that now with what I call IoT phase three.

So, the phases… just remember the phases. The first one, think of sensors everywhere, smart sensors, but we’re really just tracking things. Second became like mobile control or the ability to have smart everything, so we have the smart refrigerator, we have the smart washer and dryer, but we still just didn’t know what we could do with it. And now, we’re more into the phase three. We’re starting to prescribe, so we’re starting to have these predictive analytics saying, “Hey, these are things that might happen. Oh, and by the way, this is how we can fix it.” And, this is actually gonna give consumers and other products more information, and just a better feeling for the things. And so, it can save you time from having to pick up the phone and call to schedule a time for somebody to come in and fix your washing machine. It’s gonna prevent you from having to go out and buy a new washing machine. It makes products more sticky for those companies.

So, that’s all for today’s episode of Big Data Big Questions. Make sure to subscribe to the channel, submit any questions that you have. If you have any questions that’s related to Big Data, IoT, machine learning, hey, just send me any questions, I’ll try to answer them for you if I get an opportunity. But submit those here or go to my website. Make sure you subscribe and I’ll see you next time on the next episode of Big Data Big Questions.

 

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Filed Under: IoT Tagged With: IoT, IoT Development

Isilon Quick Tips: Snapshots in OneFS

June 11, 2018 by Thomas Henson Leave a Comment

Snapshots in OneFS

Snapshots in OneFS 8.0 & Beyond

Isilon’s OneFS Snapshots allows administrators to protect data at directory and sub-directory level. Snapshots can be taken as one-time snaps or scheduled. The OneFS snapshots also integrate with windows shadow copies to allow for roll back of directories and files. In this video we will walk through setting up Snapshots in OneFS from the WebCLI. Watch to find out how to setup one time Snapshots in OneFS.

Transcript

Coming soon…

 

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Filed Under: Isilon Tagged With: Isilon, Isilon Quick Tips

GDPR Good or Bad?

June 7, 2018 by Thomas Henson Leave a Comment

GDPR Good Or Bad

Is GDPR Good or Bad?

How many emails have you received about GDPR? At this point I almost have to set a rule in Outlook to send all emails with the word “GDPR” in them to a separate folder. I’ve explained what GDPR is and how it applies to Data Engineer but is it good or bad. Generally regulations are put in place to make society better, but does Big Data need regulation? Find out my thoughts on the policies put in place with GDPR in the video below.

Transcript – GDPR Good Or Bad?

Hi folks! Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I’m going to jump back in a little bit more around GDPR. We want to find out, had a lot of questions, seen a lot of things on Twitter, and I just thought it would be a great time to discuss, is GDRP [Phonetic] good or bad?

This is not going to be about politics. It’s going to be about policy and what’s really driving GDPR. What does it really mean, as far as, is that a good thing for us that are involved in big data? And, it’s consumers. Find out more, right after this.

[Sound effects]

Welcome back. This is the second episode where we’re going to talk about GDPR. If you’re curious about what GDPR is and what it means to data engineers, make sure you check out the video that I did before just talking about, what does GDPR mean to data engineers, machine learning engineers, or data scientists?

I really wanted to focus this time on, we’ve talked about what it is, but what does it really mean? Is it a good thing? I’ve gotten a ton of emails just on my personal stuff, from people who’ve built websites for me, from different HortonWorks, and Cloudera, and everybody’s kind of talking about, what does GDPR mean to us? Every time you turn around right now, you’re going to have to update some kind of policy, whether it be from Apple on your iPhone, or from Android, or anybody that’s collecting or holding onto your data, all those privacy updates are all going on, and you’re going to have to click yes on each one of them.

Yes, I understand that you’re going to protect my data, and it’s going to be more private. Is that a good thing or is it a bad thing? Is it okay for us to have regulation around it?

I look at it from this perspective. I was thinking about it, and it’s like, if you really think about where we’re going, there’s regulation for everything. For most products, as they get big. What this really means to me, and why I think it’s a good thing, is because this shows that your digital data is growing up. It’s maturing. When you think about it, in America, when cars first came out, we didn’t really have regulations around it. You didn’t have to get a license. It was just something fun that you could do, and if you could afford a car, you could get it. As that product started maturing, we started realizing, “Hey, this is something that needs to be regulated to some extent.”

We need to have some kind of standards around who’s going to drive on what side of the road, and how all that’s going to work through. If you think about digital data, we’re getting to that point. A couple reasons why we’re at that point, if you think about it, the first thing, privacy matters. Privacy’s always kind of mattered, and people really pay attention to being able to be private and have those things. For a long time, data has not seen one of them.

We have regulations and laws around if people can go into private residencies without consent and things like that. Your data, it’s the same way, and that’s where we’re starting to look at it and say, “Okay, that data, you have rights to it. It’s yours. You created it, so your privacy does matter.” That’s where the regulations are coming. Also a big thing is, think about how many different data breaches we have.

For a long time, if you follow Troy Hunt, or anybody that’s big in security, you can always see at least weekly, they’re talking about a huge data breach that happened. That compounded with trying to figure out, “Okay, if you’re collecting these data, how much of a liability, how much is that for you, and then how much of a responsibility is it of yours if the data becomes breached? Are there certain standards that you should have to follow to be able to better protect that data, so that you can turn around and say, “Hey, we do have some bad actors out there, that have hacked and taken this data, but we went through these steps.”

There’s not really been a standard for what those steps are, and so this is a further implementation of it. The thing, and one of the reasons, a couple of the reasons, actually, that I think that it’s a good thing, right? Not talking politics here, just why I think GDPR’s good.

It gives you back control of your data. It gives you the opportunity to say, “Hey, I would like for you to be able to report and see what data you have on me. What does my digital footprint look like?” What kind of data are you collecting on me? You have the authorization to ask for that and to be able to get an answer to that.

Secondly, you can say, “Hey, I want to drop off. I want all my data gone. I don’t want you to collect and hold onto my data.” I think this is a big point, because while I’m on Facebook, and I’ve been a Facebook user for I don’t know how long, just a long time, I’ve heard of other people and other stories around people who’ve gone off Facebook. You’ve probably not seen them. They’ve deleted their profile, only to come back a year or two later, and all their stuff’s still there.

I can’t say that I’ve seen that happen for me, but I’ve heard a ton of stories, where I know that there much be some sort of truth to that. This is an opportunity where, if you do want to get off the grid, so it’s like, “Hey, you know, it’s 2018, I’m going to get off the grid,” this gives you the opportunity. That’s another reason why I think it’s really good. It puts you in control of your data and lets you decide.

Also, it’s going to create a framework for companies to have a standard around how they’re going to protect that data. It’s going to protect companies and organizations that collect data by having a set of standards that we’re able to follow, to say, “Hey, we’re doing as much as we can to be able to protect, and make sure that, your data, when it comes in, is as secure as can be.” This gives us the opportunity to start setting those standards and testing it. Maybe we won’t have as many data breaches in the future.

Maybe, we can trust and understand that, while there are bad actors out there, that maybe there will be less involvement around the hackers, because it really puts the onus on the people who collect the data. We had some of that before, but a lot of it has been, I would say, public perception. You want your public perception to be okay. How much of a law, and really bearing, is going to be on companies if that data is discovered, or data is breached? Now, this gives us the framework to say, “Hey, there are regulations, and we are saying that, you know, this is something that you need to protect.”

That was just my thoughts on it. I’d love to hear your thoughts. If anybody has any opposing views or anything like that, put them in the comments section here below or just reach out and ask. Let’s jump on YouTube, and let’s record a video, and maybe talk about it a little bit more. Let me know where I’m wrong, but, that’s my thoughts. I think, in general, GDPR is good. I think there’s going to be a lot of opportunities around products and around people with that expertise, so if you’re looking to get involved in big data, and you like looking into and following regulations, and putting security metrics into works, then I think GDPR is a good place to go.

I think there’s going to be a lot of companies that are going to make products. There’s going to be products that are out there, that’s going to help with GDPR compliance, because May 25th’s coming, 2018. I don’t know that everybody’s going to be ready. Until next time.

Filed Under: Business Tagged With: GDPR

Why Tensorflow is Awesome for Machine Learning

June 5, 2018 by Thomas Henson Leave a Comment

Why Tensorflow

Why Tensorflow is Awesome for Machine Learning

Machine Learning and Deep Learning has exploded in both growth and workflows in the past year. When I first started out with Machine Learning the process was still somewhat limited as were the frameworks. Data Scientist would configure and tune models on local machine only to have to recreate the work when pushing to production. This process was extremely time consuming. Google and the Google Brain team released Tensorflow in 2014. Find out why Tensorflow is awesome for Machine Learning in the video below.

Transcript – Why Tensorflow is Awesome for Machine Learning

Hi folks, welcome back to another episode of Big Data Big Questions. Today, I’m going to tackle a question around Tensorflow. I wanted to give you my feedback. I’ve been diving into Tensorflow and looking into how you set it up, and then actually playing around with it, and seeing how it differs from some of the other machine learning programs and things that I’ve used in the past like Mahout and MadLib, and some other things.
I wanted to give you my take on Tensorflow, tell you why I think it’s great. Tell you how you can get hands-on with it, and just give you some background on it. Find out more, right after this.

Welcome back. Before we jump into my thoughts on Tensorflow, I did want to encourage you to make sure you subscribe to the YouTube channel here, and then also, if you have any questions, go ahead. Send them in. You can go to my website, thomashenson.com/big-questions.

I will answer any of your answers there. You can put them in the YouTube comments section here below, or you can use the hashtag #bigdatabigquestions on Twitter. I will answer those as quickly as I can. Thank you everybody for subscribing. Now, let’s talk a little bit about Tensorflow.

Been going through and going down more of the deep learning paths. I’ve done, been doing, some research and some learning on my own. One of the first things that I’ve started really diving into is Tensorflow. I wanted to look at Tensorflow, because I have a background, when I first started out in the Hadoop ecosystem, we weren’t really doing streaming, so probably, you’ve heard me talk a little bit about the Kappa and the Lambda architecture here. Make sure you check those videos out. One of the things that we did use back when we were just using batch, more of a Lambda architecture’s workflow, is I used Mahout a good bit.

We used Mahout, and I used SVD. I wanted to see how Tensorflow differed, because a lot of people are talking about Tensorflow, like, “Hey, you know, a lot of training.” There’s a lot of training out there. There’s a lot of YouTube videos out there, and there’s just a lot of excitement for Tensorflow. Me, wanting to dig in, I looked in, and I started playing around with it.

One of the first things that I really noticed, and one of the things that I really liked about Tensorflow, was the fact that when we think about using Mahout or using some of the old, other algorithms, one of the problems that I had was, we had our data scientist, and they would look at, and they would play around, and figure out what they wanted from their data model, exactly what algorithms they were going to use.

A lot of times, they were coming, and we were still new to this, but they were coming in from using things on their machine. They were using MathLab, or Octave, or some of you have been using Excel. Once you go, and you say, “Hey, man, well, you know, I had this, and we were looking at this little sample of data. Now, let’s scale it out to, you know, terabytes and terabytes of data. I want to see how this is going to work.”

Those algorithms are totally different. What you can run on your local machine, and the way that those are processed is totally different than the way Mahout does it, or the way MadLib, or MLLib, any of the distributed machine learning algorithms. Not all the work that you did there, but there was a lot of new steps that you had to go through, versus with Tensorflow, the thing is, you can run it on your local machine. Don’t have to have a distributed environment, but those are the same processes and the same way the algorithm works. It’s going to run on your huge cluster.

Just think about it like this. To do Tensorflow, you don’t have to set up a distributed network. It’s not going to time out. It’s not going to go fast on your single machine. If we’re trying to turn over a terabyte of data that you’ve got on your laptop, have at it, but it’s not going to be as efficient as you set up in your data center, there.

The cool this is, when you’re doing sampling, and you’re doing testing, you can do that locally. You can do that on your local machine. Then, when it comes time to test it, you’re really just porting [Phonetic], because you can use Docker and some other cool tools on the back-end, to be able to just expand that into your data center.

I thought that was really cool. A little bit about Tensorflow. Tensorflow was incubated out of Google. If you’re interested in it, I would encourage you to… I’ll put this link in the show notes for on my website, but I would check out the research paper, Large-scale Machine Learning On Heterogeneous Distributed Systems, so Tensorflow. It goes into some of the research behind it, and why Tensorflow, why now?

I’m really, really heavy into it, and I know sometimes these research papers, or most of the time for me, the research papers are kind of over your head. First time you read it, you might be like, “I don’t really understand it,” but then the second time, and as you see it more, it’s going to help you. That’s my little tip about research papers, just go ahead, read them, become familiar with them. It’s okay that you don’t understand it, because it means that you’re actually learning.

Also, a lot of resources out there for Tensorflow. There’s a website that you can go to, and you can start playing around with how these neural networks go to work in Tensorflow, and different parameters you can play with, and it just gives you a visualization for how it’s going to identify image data, and then, be able to use Tensorflow in your own environment. I would encourage you to use the website, to go ahead and play, and look for all the stuff in the show notes here.

Until then, that’s all I wanted to talk about today on Tensorflow, but until the next time, I will see you on Big Data Big Questions. Thank you.

Filed Under: Uncategorized

Why Data Engineers Should Care About IoT

June 4, 2018 by Thomas Henson Leave a Comment

Data Engineers Should Care About IoT

Why Data Engineers Should Care About IoT

The Internet of Things has been around for a few years but has hit an all time high for buzzword status. Is IoT important for Data Engineers and Machine Learning Engineers to understand? By 2020 Gartner predicts there to be over 21 Billion connected devices world wide. The data from these devices will be included in current and emerging big data work flows. Data Engineers & Machine Learning Engineers will need to understand how to quickly process this data data merge with existing data sources. Learn why Data Engineers should care about IoT in this episode of Big Data Big Questions.

Transcript

Hi folks, Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I want to tackle IoT for data engineers. I’m going to explain why IoT, or the Internet of Things, matters for data engineers, and how it’s going to affect our careers, how it’s going to affect our day-to-day jobs, and honestly, just the data that we’re going to manage. Find out more, right after this.
[Sound effects]

Today’s question is, what is IoT, and how does that affect the data engineers? We’ve probably seen the buzz word, or the concept of the Internet of Things, but what does that really mean? Is it just these little dash buttons that we have? Is this? Wait a minute. Is that ordering something?

Is this what IoT is, or is it the whole ecosystem and concept around it? First things first. IoT, or the Internet of Things, is the concept of all these connected devices, right? It’s not something that is, I will say, brand new. Something that’s been out there for a while, and when we really think about it, getting down to it, it is a sensor. We have these sensors, these cheap sensors.

We’ve had them for a long time, but what we haven’t had is all these devices connected with an IP address to the Internet, that can send the data. That’s the big part of the concept. It’s not just about the sensor, it’s about being able to move the data from the sensor.

This gives us the ability to be able to manage things in the physical world, bring them back, do some analytics on it, and even push data back out to it. The cool thing is, generally with IoT devices these are, I would say, economical or cheap devices that have an IP address, that can just pull in information. Think about a sensor, if you have a smart watch that’s connected to the Internet and can feed up information to you. That’s where some of it all started. These dash buttons. I can have these dash buttons all installed around my house, push a button whenever I need something, or start to look at what we’re talking about with smart refrigerators. Smart refrigerators can take pictures and have images of what all’s, the content that’s in your refrigerator, so if you’re at the store, you look, and you’re like, “Hey, you know, what am I…? Do I need that ranch dressing? Yeah? Let me check in my refrigerator, here.”

Also, a sensor could be inside the refrigerator, and tell you if something’s going wrong. Maybe the ice maker is blocked. Maybe you need a new water filter in your refrigerator, and the refrigerator knows that, has a sensor into it. It can send information to wherever, to be able to order that water filter for you and send it to your home, so you don’t even have to go in, and remember, “Hey, has it been 90 days? Or was it 60 days? Is it time? Is it time to change it?” Then, you’re going to forget. You’re going to let it go over, but now, you can have this sensor that’s going to tell you, and it’s going to order that for you. That’s the concept. It’s not just about the sensor. It’s about that ecosystem.

It’s about being able to move the data. For data engineers, what does this mean? Why do we care?

There are a lot of predictions out there about IoT and where it’s going. One of the big ones is, Gardner has a prediction that by 2020 we will have 20 billion, over 20 billion, of these devices. Not just the dash buttons, but just think of all these sensors, all these things with IP addresses connected to the Internet. What does that mean, from a data perspective? Some numbers that I’ve seen are 44 zettabytes of data are some of the predictions that I’ve seen, that’s going to be contributed to new data that’s coming in and the data that we have that’s already existing. Think about it. What is a zettabyte? It’s not a petabyte. It’s bigger than a petabyte.

How are we going to manage all these data, when right now we’re still managing terabytes and petabytes of data, and being like, “Man! This is a lot of data!” That’s why it’s important for data engineers, is that’s contributing to this deluge of data. How does all that affect us, as far as what are some of the concepts? When we start talking about IoT, and sensors, and having these data on the edge, being able to pull information back, but also being able to push the information out. What does that start to say?

As we’ve talked more and more about real-time analytics, this is where we’re really going to start to see real-time analytics really taking hold. As soon as we can get that data, and be able to analyze it and push information back out, that’s what’s going to help us. Think about it with automated cars, with a lot of the things that are going on outside in the physical world, where we have sensors, and devices talking to devices, streaming analytics is going to be huge in IoT.

The question becomes, if you’re looking to get involved in IoT, what are some of the projects? What are some of the things you can do to contribute and be a part of this IoT revolution? I would look into some of the messaging queues. Look at Pravega, look at Kafka, even look at RabbitMQ, and some of the other messaging queues, because think about it. As 20 billion devices, maybe more, by 2020. As these devices come in, they have to come into a queue. They have to be stored somewhere before they can be processed and before we can analyze them. I would look into the storage aspect of that.

Also, know how to do the processing. Look at some of your streaming processing, whether it be Apache Beam, whether it be Flink, or whether it be Spark. I would look into those, if you’re looking to get involved in IoT. If you have any questions, make sure you submit those in the comments section here below, or go to thomashenson.com/big-questions. Submit your questions, and I’ll try to answer them on here.

Until next time, see you again.

 

Filed Under: IoT Tagged With: Big Data Big Questions, Data Engineer, IoT

4 Types of NoSQL Databases

May 30, 2018 by Thomas Henson Leave a Comment