Archives for June 2018

Web Developer to Machine Learning Engineer

June 25, 2018 by Thomas Henson Leave a Comment

My Journey Web Developer to Machine Learning Engineer

How can a Web Developer become a Machine Learning Engineer? It’s a journey I had quite a few years back when Hadoop became a popular platform for analyzing large data sets in a distributed environment. A lot has changed in Machine Learning since those days but many of the concepts are still the same. Tensorflow with Tensorflow.js has opened the door for machine learning to be spread out on various machines. In fact with Tensorflow.js Web developers can train machine learning or deep learning models in the browser using Javascript.

In this video I will give you 3 tips for becoming a Machine Learning Engineer with your Web Development skills. Watch the video to learn more.

Transcript – Web Developer to Machine Learning Engineer

Hi folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s question I’m gonna tackle is about Web Development to Machine Learning Engineer. So, what I wanna talk about is how do you become a machine learning engineer if you’ve got a background with web development? So, not only are we gonna talk about my journey on it, but I’m also gonna give you three tips for making that journey if you’re a web developer thinking about that. So, find out more right after this.
And, we’re back. And so, today’s question that we’re gonna tackle is about Web Development moving into Machine Learning Engineer, but before we tackle that, I wanna encourage you everybody out there. If you have any Big Data-related questions or any questions in general, just go ahead, throw them in the comment section here below or go to my website thomashenson.com/big-questions. Submit your questions there. I will answer them here on YouTube. I really love the engagement. I love being able to get out in front of the community, so just keep bringing those in. And then, if you haven’t already, go ahead and subscribe to my YouTube channel, that way, you never miss an episode of Big Data Big Questions or some of the other cool things I’m doing with interviews, book reviews, and just general technology and awesomeness, right?

So, let’s jump into the question for today. And so, today’s question comes in about how do you transition from a web developer to a machine learning engineer? So, one of the reasons behind this is, we talked about it, I had a video where we’ve done, where we talked about what the machine learning engineer is, a huge, huge hot topic area, a lot of people are looking into how they become a data engineer or a machine learning engineer. So, there’s a lot of interest there in the web development community, maybe there’s a lot of people that are out there looking and they’re saying, “Hey, I kinda wanna be on that bleeding edge. I wanna see what else is going on with this machine learning engineer, how do I get into that, is that even possible?” I’m sitting here creating maybe some C# development applications or maybe you’re a PHP developer, ASP.NET, any genre there, maybe just a general JavaScript, maybe you’re a node engineer, and you’re wanting to really branch out and look kinda what’s into that.

So, before we jump into the three tips, I’m gonna give you three tips for how you can do that. I wanna tell you that it is possible. So, my journey into Big Data, I was an ASP.NET developer and so this was a quite few years back, it was in the early days of the Hadoop community. So, I was involved and I’m starting in Hadoop 1.0. And so, for me, I was a contractor and we were coming to the end of a project that I’ve been working on for many years, and I was looking kind of trying to see what else was out there, and one of the projects that was out there was a data analytics project, and I knew that it was kinda gonna be real heavy in research and kind of on the cutting edge, so I really started looking into it. As I got more involved with it, I started learning about Hadoop and I learned about some of the things that we’re doing as a community.

And so, I really got sparked from that point, but it is possible. So, it’s something that I was able to do. There was a lot of things I had to learn and there were a lot of things that if I could go back, I would have learned first, I would have done it a little bit different. But so, my journey here now is because of all those trials and tribulations that I went to, but the cool thing is, is I got to share some of those for anybody out there that’s a web developer trying to look into it or anybody who’s just getting started. But today, I wanna focus on the web developer, so your skills are gonna translate, right? But there’s gonna be some gaps and some things that you need to do. So, if you’re a web developer looking out there and you’re saying, “Hey, how do I jump in and how do I transition into a machine learning engineer?” And so, you want to be involved with the algorithms and some of the… and the day-to-day development activities for this huge massively machine learning projects or deep learning or AI, how can you get involved?

So, my first tip, you might not like this one, but the first thing you need to do, don’t cheat and don’t do the other tips first, take two weeks 30 minutes a day and start learning about linear regression and linear algebra, and there’s a ton of free resources out there. You can go to Coursera, they have some free courses on it. You can opt-in and take the certificate to get certified. A lot of things on YouTube, so you can go through some daily training on YouTube. There’s a lot of different blog post out there. Just take 30 minutes to go through the math portion.

So, the math and statistics are gonna be one of the gaps that you’re gonna lack, and so that’s one of the things. I didn’t take any of the formal courses, I just kinda went back and started looking at some of my old college notes just trying to figure it out, but I would go through… I would. And, if you have the time, so if you don’t… go through the two weeks, watch some YouTube videos, some blog posts, go to Coursera, sign up for a course. If you’re involved in a Pluralsight or any of the other online trainings out there, find many resources there, take it and do it for two weeks. Also, these are pretty big careers. Maybe sign up for a course. So, if you’re in college right now, take a linear regression course. If you’re not in college and you have the opportunity, outseam [Phonetic] why you would wanna do something like that. So, if this is a path that you’re kinda going down and really serious about.

So, I know it’s not the most fun part of it, but I’ll promise you, it’s gonna help you down the road understanding the terminology and the math behind what we’re doing. So, even if you’re not looking to be a data scientist, so we’ve talked about it here before, data scientist versus data engineer, this is more to data engineer, but you’re still gonna wanna have that math background.

Tip number two, you wanna take two weeks, maybe three, but two weeks for sure, 30 minutes a day, so you start and see your trend here, you wanna walk through the Hadoop or Spark tutorials. And so, you want to learn and understand how to kinda do those. Do the basic tutorials. I’m not talking about setting up your own full Hadoop cluster in your own data center. I’m saying go to the tutorials with the Sandbox. There’s a ton of resources out there. I have a lot of Pluralsight Courses that’s based around having just a little bit of SQL knowledge and a little bit of Linux knowledge, and that will help you kinda go through it, but there are other resources out there, too. Obviously, I encourage you and love for you to join in and watch some of my Pluralsight Courses, but there’s a ton of resources out there. This is a huge community.

Coursera has some baseline courses. You can find things on YouTube. I’ve got a ton of free resources on my blog that you can just walk through, so if you like walking through the tutorial, grab on some of the code. I’ve got a ton of tutorials and a ton of just basic command [INAUDIBLE 00:06:40] stuff that you can do to start learning the Hadoop and Spark. So, I would just take, like I said, 30 minutes a day, not too much [INAUDIBLE 00:06:48]. You can download one in the Sandboxes whether it comes from Cloudera or from Hortonworks, they’ll have Hadoop installed with it and also have Spark. You can run through some of the basic tutorials. Ton of resources out there, like I said, I’ve got a Pluralsight Course. We have quite a few Pluralsight Courses actually around that. But take the time and go through that.

So, we’ve gone through the math portion. And now, we’ve gone through just learning the baseline of some of the bigger data stuff. Now, let’s get into some of the machine learning. So, now, you get to use your skills, your background with Javascript and I would start looking into TensorFlow.js. And so, TensorFlow.js is a machine learning in the browser. So, TensorFlow if you’re not familiar, I’ve kinda talked through it here a little bit around why it’s awesome, but TensorFlow released by Google, it’s really, really popular right now in the Big Data community especially around machine learning and deep learning. And so, there are some really cool features in it, but there’s a lot of stuff that you can do with it from the browser spectrum.

So, this is TensorFlow.js. And so, this is where your background and you get to shine. So, go through the tutorials, but don’t skip the other steps, but go through these tutorials, start playing with the Pac-Man interface that they have, do the pitch curve, so is it a fast ball or is it a curved ball, and now you’re gonna understand some of that math because you did step one first, right, you didn’t cheat. I know everybody right now is looking on the browser. “I’m going to TensorFlow.js right now. It’s a really cool resource.” But you’re gonna understand the math behind it and this is gonna get you started.

And so, now, that you’ve got the math background and you’ve also got the ability to say, “Hey, if we wanted to send [Phonetic] this up and put this in some kind of distributed system whether it be Hadoop or whether it be just understanding some of the baseline on Spark, you have that knowledge background, too.” And so, you’ve done all this probably in six weeks and it gives you an opportunity to start looking and start understanding, and maybe there are some projects internally in your organization that you can look for or maybe it’s something that you’re trying to look for further down the road. Maybe you’re in college right now looking to do it. By doing some of these steps right now, you’re really setting yourself up for success.

Well, thanks again. I hope everyone got a lot of information from this. I hope you’re going out there and learning linear algebra and you can also play that Pac-Man game on TensorFlow.js. Any questions, make sure you submit them. Until next time. Thanks.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Phases of IoT Application Development

June 19, 2018 by Thomas Henson Leave a Comment

IoT Application Development

The Internet of Things is generating many opportunities for Data Engineers to develop useful applications. Think about self driving cars, they are just one large moving IoT devices. When developing IoT applications developers typically start with 3 different phases in mind. In this video I will explain the 3 Phases of IoT Application Development.

Transcript – Phases of IoT Application Development

Hi folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s episode, I wanna talk more about IoT. So, I know we started talking about it in the previous video, but I really wanna dig into it a little bit more because I think it’s something very important for data engineers and anybody involved in any kind of modern applications or anybody involved in Big Data. So, find out about the phases, the three phases of IoT right after this.

Welcome back. And so, today, we’re gonna continue to talk more about IoT or Internet of Things. And so, if you’re not familiar, I’ve got a video up here where we talked about why it’s important for data engineers to know, but I think it’s important for anybody involved in modern applications or even on the business side of things. You’re really gonna see a lot of different information that’s being able to come into your data center and into your projects because of these sensors out there. So, let’s get a little more comfortable with what’s going on with the technology and how that’s gonna implement to us.

And so, in this video, I wanna talk about the three phases. So, I think there are three phases of IoT and I think we’re starting to get into the third phase, and you’ll see why it’s gonna make sense for data engineers and modern applications when we talk about that third phase.

And so, just as a recap, remember, IoT, it’s not a new concept, it’s the ability to have devices. So, we have devices out in the physical world that are gonna have some kind of IP address, but us also be able to send data and receive data back from your core data center or from your core analytics processing. So, think about the example I’ve used before is the dash button, right? You have a dash button where if you’re out of toilet paper or if you’re out of whatever it is in your house, you’re able to push that button. It connects out to a gateway, it’s locked in the cloud with Amazon to be able to say, “Hey, order some more [INAUDIBLE 00:02:02] of this particular brand,” and sent it to your door, so a real quick example there.

But let’s talk a little bit more about the phases and I think that will give you a more understanding of, “Okay, this is how that concept and how that whole ecosystem of devices and data and gateways all work together.” And so, I think just like from a web development perspective when we talked about, hey, Web 1.0 and Web 2.0, I think with IoT, we’ve gone through phases of IoT 1.0 and 2.0. I think that these phases are more collapsed than the phases of the web, and part of that is just we change so fast.

And so, the first phase was everybody had a sensor, right? Maybe this is not a smartwatch, but think about tracking and the smartwatches, everybody is like, “Oh, it’s kinda cool, right? I can get in contest with my friends and track how many steps I have.” That’s pretty cool. Really, we didn’t understand what to do with it. It was still kind of somewhat of a novelty and so everybody who already had this since just really didn’t know what to do with them other than just tracking, right? That’s kinda things that we had been doing before, but now we have an internet connection and we can kinda control them on our phone.

Fast-forward into phase two, so once we go into phase two, it wasn’t just about these smart trackers and these devices that were attached to us, but it started to become… we had smart everything in our homes, right? So, in phase two, we started having, think of a refrigerator, so you had a smart refrigerator, and people are like, “Well, that’s kinda cool. We have a refrigerator that’s connected to the internet. I can look at photos of it from my phone. And so, if I’m at the grocery store, that’s like, hey, do I need any more ranch dressing or do I need any more Tide Pods…” well, maybe Tide Pods are in your refrigerator, but, well, hopefully not. But if you had pickles or things that you’re looking for at the grocery store or maybe even just your washing machine, you’re able to turn your washing machine on with your device and say, “Okay, let’s turn the washing machine on.” That’s pretty cool. You set it up.

It’s still kinda novel, not really where we’re really seeing this go because we know as data engineers and people that analytics and being able to predict and being able to prescribe outcome is where it really goes, and that’s where phase three is. So, phase three, and that’s where we’re really entering right now.

Phase three is when we’re able to take all this information. So, think of that washing machine. We’re not just turning that washing machine on from our phone. That device has diagnostics in it that’s gonna run, run those diagnostics. And so, let’s say that there’s an air in the onboard or maybe some kinda circuit, but maybe just a $10 component on your washing machine that if you replace it within the next 30 days, it’s gonna prevent you from having to get a brand new washing machine. Well, that’s pretty cool, right? That’s really cool. So, it can send you that information, but instead, it’s just sending you that information, check this out. So, that diagnostics that happens goes out, sends that information out to the data center, the data center actually looks and it finds service providers in your area because it knows where this device is, you’ve registered it. It knows where your home is. So, it’s gonna find those service providers in your area and it’s gonna send you back in an alert saying, “Hey, we found this component that needs to be replaced on your washing machine. This is gonna prevent you from having to buy a new washing machine, maybe it will prevent you from having a flood washing machine,” which man, who wants to clean up and then have to buy a new washing machine?

So, how about these are some times that we’ve scheduled, we have a service person that can come in your area and replace that part for you, when would you like to schedule that? That’s pretty awesome, right? That’s really starting to say, “Hey, we found an error. We believe this is the component that can fix it.” And then, also, here are some times for us to be able to fix it. So, how many steps did it take a human on the [INAUDIBLE 00:05:50]? Which is really good, right? From a consumer, we want products like that, right? And so, there’s a ton of different new cases that we can start to see. So, we’re starting to see that now with what I call IoT phase three.

So, the phases… just remember the phases. The first one, think of sensors everywhere, smart sensors, but we’re really just tracking things. Second became like mobile control or the ability to have smart everything, so we have the smart refrigerator, we have the smart washer and dryer, but we still just didn’t know what we could do with it. And now, we’re more into the phase three. We’re starting to prescribe, so we’re starting to have these predictive analytics saying, “Hey, these are things that might happen. Oh, and by the way, this is how we can fix it.” And, this is actually gonna give consumers and other products more information, and just a better feeling for the things. And so, it can save you time from having to pick up the phone and call to schedule a time for somebody to come in and fix your washing machine. It’s gonna prevent you from having to go out and buy a new washing machine. It makes products more sticky for those companies.

So, that’s all for today’s episode of Big Data Big Questions. Make sure to subscribe to the channel, submit any questions that you have. If you have any questions that’s related to Big Data, IoT, machine learning, hey, just send me any questions, I’ll try to answer them for you if I get an opportunity. But submit those here or go to my website. Make sure you subscribe and I’ll see you next time on the next episode of Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Isilon Quick Tips: Snapshots in OneFS

June 11, 2018 by Thomas Henson Leave a Comment

Snapshots in OneFS 8.0 & Beyond

Isilon’s OneFS Snapshots allows administrators to protect data at directory and sub-directory level. Snapshots can be taken as one-time snaps or scheduled. The OneFS snapshots also integrate with windows shadow copies to allow for roll back of directories and files. In this video we will walk through setting up Snapshots in OneFS from the WebCLI. Watch to find out how to setup one time Snapshots in OneFS.

Transcript

Coming soon…

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

GDPR Good or Bad?

June 7, 2018 by Thomas Henson Leave a Comment

Is GDPR Good or Bad?

How many emails have you received about GDPR? At this point I almost have to set a rule in Outlook to send all emails with the word “GDPR” in them to a separate folder. I’ve explained what GDPR is and how it applies to Data Engineer but is it good or bad. Generally regulations are put in place to make society better, but does Big Data need regulation? Find out my thoughts on the policies put in place with GDPR in the video below.

Transcript – GDPR Good Or Bad?

Hi folks! Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I’m going to jump back in a little bit more around GDPR. We want to find out, had a lot of questions, seen a lot of things on Twitter, and I just thought it would be a great time to discuss, is GDRP [Phonetic] good or bad?

This is not going to be about politics. It’s going to be about policy and what’s really driving GDPR. What does it really mean, as far as, is that a good thing for us that are involved in big data? And, it’s consumers. Find out more, right after this.

[Sound effects]

Welcome back. This is the second episode where we’re going to talk about GDPR. If you’re curious about what GDPR is and what it means to data engineers, make sure you check out the video that I did before just talking about, what does GDPR mean to data engineers, machine learning engineers, or data scientists?

I really wanted to focus this time on, we’ve talked about what it is, but what does it really mean? Is it a good thing? I’ve gotten a ton of emails just on my personal stuff, from people who’ve built websites for me, from different HortonWorks, and Cloudera, and everybody’s kind of talking about, what does GDPR mean to us? Every time you turn around right now, you’re going to have to update some kind of policy, whether it be from Apple on your iPhone, or from Android, or anybody that’s collecting or holding onto your data, all those privacy updates are all going on, and you’re going to have to click yes on each one of them.

Yes, I understand that you’re going to protect my data, and it’s going to be more private. Is that a good thing or is it a bad thing? Is it okay for us to have regulation around it?

I look at it from this perspective. I was thinking about it, and it’s like, if you really think about where we’re going, there’s regulation for everything. For most products, as they get big. What this really means to me, and why I think it’s a good thing, is because this shows that your digital data is growing up. It’s maturing. When you think about it, in America, when cars first came out, we didn’t really have regulations around it. You didn’t have to get a license. It was just something fun that you could do, and if you could afford a car, you could get it. As that product started maturing, we started realizing, “Hey, this is something that needs to be regulated to some extent.”

We need to have some kind of standards around who’s going to drive on what side of the road, and how all that’s going to work through. If you think about digital data, we’re getting to that point. A couple reasons why we’re at that point, if you think about it, the first thing, privacy matters. Privacy’s always kind of mattered, and people really pay attention to being able to be private and have those things. For a long time, data has not seen one of them.

We have regulations and laws around if people can go into private residencies without consent and things like that. Your data, it’s the same way, and that’s where we’re starting to look at it and say, “Okay, that data, you have rights to it. It’s yours. You created it, so your privacy does matter.” That’s where the regulations are coming. Also a big thing is, think about how many different data breaches we have.

For a long time, if you follow Troy Hunt, or anybody that’s big in security, you can always see at least weekly, they’re talking about a huge data breach that happened. That compounded with trying to figure out, “Okay, if you’re collecting these data, how much of a liability, how much is that for you, and then how much of a responsibility is it of yours if the data becomes breached? Are there certain standards that you should have to follow to be able to better protect that data, so that you can turn around and say, “Hey, we do have some bad actors out there, that have hacked and taken this data, but we went through these steps.”

There’s not really been a standard for what those steps are, and so this is a further implementation of it. The thing, and one of the reasons, a couple of the reasons, actually, that I think that it’s a good thing, right? Not talking politics here, just why I think GDPR’s good.

It gives you back control of your data. It gives you the opportunity to say, “Hey, I would like for you to be able to report and see what data you have on me. What does my digital footprint look like?” What kind of data are you collecting on me? You have the authorization to ask for that and to be able to get an answer to that.

Secondly, you can say, “Hey, I want to drop off. I want all my data gone. I don’t want you to collect and hold onto my data.” I think this is a big point, because while I’m on Facebook, and I’ve been a Facebook user for I don’t know how long, just a long time, I’ve heard of other people and other stories around people who’ve gone off Facebook. You’ve probably not seen them. They’ve deleted their profile, only to come back a year or two later, and all their stuff’s still there.

I can’t say that I’ve seen that happen for me, but I’ve heard a ton of stories, where I know that there much be some sort of truth to that. This is an opportunity where, if you do want to get off the grid, so it’s like, “Hey, you know, it’s 2018, I’m going to get off the grid,” this gives you the opportunity. That’s another reason why I think it’s really good. It puts you in control of your data and lets you decide.

Also, it’s going to create a framework for companies to have a standard around how they’re going to protect that data. It’s going to protect companies and organizations that collect data by having a set of standards that we’re able to follow, to say, “Hey, we’re doing as much as we can to be able to protect, and make sure that, your data, when it comes in, is as secure as can be.” This gives us the opportunity to start setting those standards and testing it. Maybe we won’t have as many data breaches in the future.

Maybe, we can trust and understand that, while there are bad actors out there, that maybe there will be less involvement around the hackers, because it really puts the onus on the people who collect the data. We had some of that before, but a lot of it has been, I would say, public perception. You want your public perception to be okay. How much of a law, and really bearing, is going to be on companies if that data is discovered, or data is breached? Now, this gives us the framework to say, “Hey, there are regulations, and we are saying that, you know, this is something that you need to protect.”

That was just my thoughts on it. I’d love to hear your thoughts. If anybody has any opposing views or anything like that, put them in the comments section here below or just reach out and ask. Let’s jump on YouTube, and let’s record a video, and maybe talk about it a little bit more. Let me know where I’m wrong, but, that’s my thoughts. I think, in general, GDPR is good. I think there’s going to be a lot of opportunities around products and around people with that expertise, so if you’re looking to get involved in big data, and you like looking into and following regulations, and putting security metrics into works, then I think GDPR is a good place to go.

I think there’s going to be a lot of companies that are going to make products. There’s going to be products that are out there, that’s going to help with GDPR compliance, because May 25th’s coming, 2018. I don’t know that everybody’s going to be ready. Until next time.

Why Tensorflow is Awesome for Machine Learning

June 5, 2018 by Thomas Henson Leave a Comment

Why Tensorflow is Awesome for Machine Learning

Machine Learning and Deep Learning has exploded in both growth and workflows in the past year. When I first started out with Machine Learning the process was still somewhat limited as were the frameworks. Data Scientist would configure and tune models on local machine only to have to recreate the work when pushing to production. This process was extremely time consuming. Google and the Google Brain team released Tensorflow in 2014. Find out why Tensorflow is awesome for Machine Learning in the video below.

Transcript – Why Tensorflow is Awesome for Machine Learning

Hi folks, welcome back to another episode of Big Data Big Questions. Today, I’m going to tackle a question around Tensorflow. I wanted to give you my feedback. I’ve been diving into Tensorflow and looking into how you set it up, and then actually playing around with it, and seeing how it differs from some of the other machine learning programs and things that I’ve used in the past like Mahout and MadLib, and some other things.
I wanted to give you my take on Tensorflow, tell you why I think it’s great. Tell you how you can get hands-on with it, and just give you some background on it. Find out more, right after this.

Welcome back. Before we jump into my thoughts on Tensorflow, I did want to encourage you to make sure you subscribe to the YouTube channel here, and then also, if you have any questions, go ahead. Send them in. You can go to my website, thomashenson.com/big-questions.

I will answer any of your answers there. You can put them in the YouTube comments section here below, or you can use the hashtag #bigdatabigquestions on Twitter. I will answer those as quickly as I can. Thank you everybody for subscribing. Now, let’s talk a little bit about Tensorflow.

Been going through and going down more of the deep learning paths. I’ve done, been doing, some research and some learning on my own. One of the first things that I’ve started really diving into is Tensorflow. I wanted to look at Tensorflow, because I have a background, when I first started out in the Hadoop ecosystem, we weren’t really doing streaming, so probably, you’ve heard me talk a little bit about the Kappa and the Lambda architecture here. Make sure you check those videos out. One of the things that we did use back when we were just using batch, more of a Lambda architecture’s workflow, is I used Mahout a good bit.

We used Mahout, and I used SVD. I wanted to see how Tensorflow differed, because a lot of people are talking about Tensorflow, like, “Hey, you know, a lot of training.” There’s a lot of training out there. There’s a lot of YouTube videos out there, and there’s just a lot of excitement for Tensorflow. Me, wanting to dig in, I looked in, and I started playing around with it.

One of the first things that I really noticed, and one of the things that I really liked about Tensorflow, was the fact that when we think about using Mahout or using some of the old, other algorithms, one of the problems that I had was, we had our data scientist, and they would look at, and they would play around, and figure out what they wanted from their data model, exactly what algorithms they were going to use.

A lot of times, they were coming, and we were still new to this, but they were coming in from using things on their machine. They were using MathLab, or Octave, or some of you have been using Excel. Once you go, and you say, “Hey, man, well, you know, I had this, and we were looking at this little sample of data. Now, let’s scale it out to, you know, terabytes and terabytes of data. I want to see how this is going to work.”

Those algorithms are totally different. What you can run on your local machine, and the way that those are processed is totally different than the way Mahout does it, or the way MadLib, or MLLib, any of the distributed machine learning algorithms. Not all the work that you did there, but there was a lot of new steps that you had to go through, versus with Tensorflow, the thing is, you can run it on your local machine. Don’t have to have a distributed environment, but those are the same processes and the same way the algorithm works. It’s going to run on your huge cluster.

Just think about it like this. To do Tensorflow, you don’t have to set up a distributed network. It’s not going to time out. It’s not going to go fast on your single machine. If we’re trying to turn over a terabyte of data that you’ve got on your laptop, have at it, but it’s not going to be as efficient as you set up in your data center, there.

The cool this is, when you’re doing sampling, and you’re doing testing, you can do that locally. You can do that on your local machine. Then, when it comes time to test it, you’re really just porting [Phonetic], because you can use Docker and some other cool tools on the back-end, to be able to just expand that into your data center.

I thought that was really cool. A little bit about Tensorflow. Tensorflow was incubated out of Google. If you’re interested in it, I would encourage you to… I’ll put this link in the show notes for on my website, but I would check out the research paper, Large-scale Machine Learning On Heterogeneous Distributed Systems, so Tensorflow. It goes into some of the research behind it, and why Tensorflow, why now?

I’m really, really heavy into it, and I know sometimes these research papers, or most of the time for me, the research papers are kind of over your head. First time you read it, you might be like, “I don’t really understand it,” but then the second time, and as you see it more, it’s going to help you. That’s my little tip about research papers, just go ahead, read them, become familiar with them. It’s okay that you don’t understand it, because it means that you’re actually learning.

Also, a lot of resources out there for Tensorflow. There’s a website that you can go to, and you can start playing around with how these neural networks go to work in Tensorflow, and different parameters you can play with, and it just gives you a visualization for how it’s going to identify image data, and then, be able to use Tensorflow in your own environment. I would encourage you to use the website, to go ahead and play, and look for all the stuff in the show notes here.

Until then, that’s all I wanted to talk about today on Tensorflow, but until the next time, I will see you on Big Data Big Questions. Thank you.

Why Data Engineers Should Care About IoT

June 4, 2018 by Thomas Henson Leave a Comment

Why Data Engineers Should Care About IoT

The Internet of Things has been around for a few years but has hit an all time high for buzzword status. Is IoT important for Data Engineers and Machine Learning Engineers to understand? By 2020 Gartner predicts there to be over 21 Billion connected devices world wide. The data from these devices will be included in current and emerging big data work flows. Data Engineers & Machine Learning Engineers will need to understand how to quickly process this data data merge with existing data sources. Learn why Data Engineers should care about IoT in this episode of Big Data Big Questions.

Transcript

Hi folks, Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I want to tackle IoT for data engineers. I’m going to explain why IoT, or the Internet of Things, matters for data engineers, and how it’s going to affect our careers, how it’s going to affect our day-to-day jobs, and honestly, just the data that we’re going to manage. Find out more, right after this.
[Sound effects]

Today’s question is, what is IoT, and how does that affect the data engineers? We’ve probably seen the buzz word, or the concept of the Internet of Things, but what does that really mean? Is it just these little dash buttons that we have? Is this? Wait a minute. Is that ordering something?

Is this what IoT is, or is it the whole ecosystem and concept around it? First things first. IoT, or the Internet of Things, is the concept of all these connected devices, right? It’s not something that is, I will say, brand new. Something that’s been out there for a while, and when we really think about it, getting down to it, it is a sensor. We have these sensors, these cheap sensors.

We’ve had them for a long time, but what we haven’t had is all these devices connected with an IP address to the Internet, that can send the data. That’s the big part of the concept. It’s not just about the sensor, it’s about being able to move the data from the sensor.

This gives us the ability to be able to manage things in the physical world, bring them back, do some analytics on it, and even push data back out to it. The cool thing is, generally with IoT devices these are, I would say, economical or cheap devices that have an IP address, that can just pull in information. Think about a sensor, if you have a smart watch that’s connected to the Internet and can feed up information to you. That’s where some of it all started. These dash buttons. I can have these dash buttons all installed around my house, push a button whenever I need something, or start to look at what we’re talking about with smart refrigerators. Smart refrigerators can take pictures and have images of what all’s, the content that’s in your refrigerator, so if you’re at the store, you look, and you’re like, “Hey, you know, what am I…? Do I need that ranch dressing? Yeah? Let me check in my refrigerator, here.”

Also, a sensor could be inside the refrigerator, and tell you if something’s going wrong. Maybe the ice maker is blocked. Maybe you need a new water filter in your refrigerator, and the refrigerator knows that, has a sensor into it. It can send information to wherever, to be able to order that water filter for you and send it to your home, so you don’t even have to go in, and remember, “Hey, has it been 90 days? Or was it 60 days? Is it time? Is it time to change it?” Then, you’re going to forget. You’re going to let it go over, but now, you can have this sensor that’s going to tell you, and it’s going to order that for you. That’s the concept. It’s not just about the sensor. It’s about that ecosystem.

It’s about being able to move the data. For data engineers, what does this mean? Why do we care?

There are a lot of predictions out there about IoT and where it’s going. One of the big ones is, Gardner has a prediction that by 2020 we will have 20 billion, over 20 billion, of these devices. Not just the dash buttons, but just think of all these sensors, all these things with IP addresses connected to the Internet. What does that mean, from a data perspective? Some numbers that I’ve seen are 44 zettabytes of data are some of the predictions that I’ve seen, that’s going to be contributed to new data that’s coming in and the data that we have that’s already existing. Think about it. What is a zettabyte? It’s not a petabyte. It’s bigger than a petabyte.

How are we going to manage all these data, when right now we’re still managing terabytes and petabytes of data, and being like, “Man! This is a lot of data!” That’s why it’s important for data engineers, is that’s contributing to this deluge of data. How does all that affect us, as far as what are some of the concepts? When we start talking about IoT, and sensors, and having these data on the edge, being able to pull information back, but also being able to push the information out. What does that start to say?

As we’ve talked more and more about real-time analytics, this is where we’re really going to start to see real-time analytics really taking hold. As soon as we can get that data, and be able to analyze it and push information back out, that’s what’s going to help us. Think about it with automated cars, with a lot of the things that are going on outside in the physical world, where we have sensors, and devices talking to devices, streaming analytics is going to be huge in IoT.

The question becomes, if you’re looking to get involved in IoT, what are some of the projects? What are some of the things you can do to contribute and be a part of this IoT revolution? I would look into some of the messaging queues. Look at Pravega, look at Kafka, even look at RabbitMQ, and some of the other messaging queues, because think about it. As 20 billion devices, maybe more, by 2020. As these devices come in, they have to come into a queue. They have to be stored somewhere before they can be processed and before we can analyze them. I would look into the storage aspect of that.

Also, know how to do the processing. Look at some of your streaming processing, whether it be Apache Beam, whether it be Flink, or whether it be Spark. I would look into those, if you’re looking to get involved in IoT. If you have any questions, make sure you submit those in the comments section here below, or go to thomashenson.com/big-questions. Submit your questions, and I’ll try to answer them on here.

Until next time, see you again.