Archives for November 2017

Should Data Engineers Know Machine Learning Algorithms?

November 10, 2017 by Thomas Henson Leave a Comment

How involved should Data Engineers be in learning Machine Learning Algorithms?

For the past few years Data Scientist are one of the hottest jobs in IT. A huge part of what Data Scientist do is selecting Machine Learning Algorithms for projects like SmartHomes and SmartCars. What about the Data Engineer, should they know Machine Learning Algorithms as well? Find out in this episode of Big Data Big Questions.

Transcript – Should Data Engineers Know Machine Learning Algorithms?

Hi, folks. Thomas Henson here, with ThomasHenson.com, and today is another episode of Big Data, Big Questions. And so, today’s question that comes in is, “Should data engineers know machine-learning algorithms?” So, we’re going to tackle that question right after this.

Welcome back. So, today, we’re going to talk a little bit about algorithms, right? So, you know, put your math hat on, and let’s dive into this question today. And so, today’s question, it’s one I get a lot. It’s about the role of a data engineer in machine-learning. And basically, it is… You know, I’ve taken this question from a couple of different sources that I’ve seen, where they’ve asked, you know, “Should data engineers know machine learning algorithms?” And kind of where some of that falls into is, you know, what is the role of the data engineer, and what is the role of the data scientist? And so, really, this question, for me, is really simple. I’m going to go off of my experience and kind of share with you what I’ve done around machine-learning algorithms and how I’ve approached it in my career as a data engineer, software engineer, you know, Hadoop administrator.

There’s a couple of different ways to look at it, but basically, the way that I’ve approached it is I haven’t really learned it. And when I say, “Learned it,” or, “Know it,” I’ve not been in…you know, I’m not going to make a recommendation on it. So, you know, the way I look at it is you should be familiar with them. So, you should be familiar with them, especially familiar with them as far as, like, what’s involved in the package? So, are you using Mahout? You know, what are the algorithms in there, what are the algorithms in your workflow? And then, all the other libraries too.

So, if you’re evaluating other libraries…so, maybe you guys are looking to…you know, maybe you haven’t used Spark and you want to look at the e-mail library that’s there, and you’re kind of going back and forth through those, you want to understand from a basic very high-level, you know, what those algorithms are, and for sure, what algorithms you’re using in your environment, so you can make an educated recommendation saying, “Hey, you know, I think we should move this. Let’s still have the data scientists involved, and have them, you know, look and make sure that the algorithm that we’re going to be using from those packages are going to fall in line with what we’re really using,” because that’s one of the things too, you’ll find that they will differentiate a little bit, so, you know, what we’re using in my house may not be exactly the same, you know, version in, you know, MadLib, or, you know, the ML library.

And so, just be able to understand kind of for sure what’s in your workflow. Be familiar with them too. Another thing that I did…so, like I said, be familiar with them from a high level, but not be making a recommendation, I actually did, you know, picked one, so I would say, you know, be familiar with them, but pick one that you really want to…you know, really want to understand and learn. I picked Singular Value Decomposition, because that’s something that we used a lot in our workflow, and so, I was just kind of…had a natural curiosity for it, and it…you know, it had a really cool story too around it. So, you know, I found some stories around it, you know, it was made really popular with the Netflix Challenge. So, back…Netflix had a challenge for…you know, to, “Beat our data scientists with your algorithm.” And so, SVD was used to, you know, do some of the sorting there, and it was kind of made famous from that perspective, and so, you know, I was familiar with it, but I made sure that I understood one, just for natural curiosity.

Now, if you are looking to, you know, at some point, make a jump, right, to data scientist, if you’re a data engineer, and at some point down the road, you’d like to be…you know, “I want to be the data scientist. I want to say, ‘Hey, this is the algorithm we should use.’” You know, maybe you just want to be a data scientist because, you know, for a couple years running, it’s been the…you know, the sexiest career, you know, in IT for a while, and so, if that’s kind of your approach, you know, definitely start to know them.

Obviously, learn the ones that are in your environment first, because that’s going to be the easiest, because you’re going to have the access to, you know, why you’re using it, how you’re using it, and you have access to the data scientists too, to kind of, you know, take you under their wing, to some extent, and, you know, show you the ins and outs of why you’re using what you did and, you know, kind of why you didn’t use other ones too. For an aspiring data scientist, then yes, for sure, you want to jump in and, you know, start to understand and start to know them. But for a data engineer, I don’t think you have to learn the algorithms, right? I think you have to be familiar with them, I think, you know, for natural curiosity, you know, maybe learn one or two.

But really, our role is not to recommend and say, “Hey, you know, these are the algorithms I think we should use,” or even, like, to pick packages and say, “Hey, these packages here, we’re going to…you know, we’re going to standardize on that and that’s the only thing we’re going to use.” That’s…you know, that’s not really our role, right?

If you have any questions, make sure you submit them to Big Data, Big Questions. You can do it from the website, go to Twitter, use the hashtag #BigDataBigQuestions, in the comment section there, however you want to get in touch with me and get those questions answered. Also, make sure you subscribe so that you never miss all these Big Data, Big Questions goodness, and so that you can always, you know, learn more. Thanks again, folks!

Show Notes

Mahout

Spark MLlib

MADlib

Big Data Big Questions

Netflix Prize

Singular-Value Decomposition

Python vs. Scala Freelance Data Engineers

November 9, 2017 by Thomas Henson Leave a Comment

Which is better for Freelance Data Engineers Scala or Python?

Picking up freelance gigs can be a challenge especially when just starting out. So which language is better for getting freelance gigs Scala or Python? In today’s episode Big Data Big Questions I answer a question about freelance options for Data Engineers. Watch the video below to find out about Scala and Python Freelance options.

Transcript – Python vs. Scala Freelance Data Engineers Video

Hi. I’m Thomas Henson with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question focuses around Python versus Scala for freelancers. Today, we’re going to tackle that question. Find out more right after this.

Today’s question comes in from our Big Data Big Questions series. If you have any questions, go to thomashenson.com Big Data Big Questions. Submit your questions. That’s where today’s is coming from.

Today, our question’s coming in from Gill Taso [Phonetic]. I hope I’m saying your name right. I hope I’m not mispronouncing it. Thanks for following, and thanks for asking a question, Gill.

Today, Gill’s question is, “I saw your video on learning how to become a data engineer, and it was very interesting.” Thank you, Gill. “I subscribe to your Twitter and your YouTube channel.” Well, thanks again. That’s awesome. That’s how you can make sure that you never miss an episode. Currently Gill is a junior data engineer who works mainly with technologies like Impala and Hive, with a little bit of Spark SQL.

Said, “I would like to improve my skills in Spark. I have intermediate skills in Python, but as I review the industry, it seems it’s better to learn Scala, which is very tough, at least for me. Do you think it’s better for me to sharpen my skills in Python or transition to Scala now? Also, I’m looking for ways to be a freelance data engineer in the short-term, maybe two to three years, because I’m looking more for more flexibility. Do you have any advice to give me for training, skills, and technology?”

Thanks for the question, Gill. That’s actually a two-part question, I feel like, so I feel like there’s a couple different ways that we can look at it. First, the question is, for a junior data engineer that’s working a really good intermediate skills in Python, but should you transition more to Scala? The second part of the question is, what should you do for freelancing?

There’s two different ways to look at it. If we look at it in your current career, where the industry’s going, is Python a viable option, or should you start to do Scala? Or, where’s that market going to be at for freelancing? I think those are two different things. What we see for freelancing and then what we see for industry and in your career progression, and especially if you’re in the Spark community. It seems like you’ve got, hinted around that, you’ve been working with Spark for a little while, and starting to get into more of the Spark, and that’s why you’re starting to look at Scala.

We’ll break it down real quick. First, my thoughts on Python versus Scala, just in the industry, I think that, with you saying that you have intermediate Python skills, if you’re looking at Spark, and that’s what you’re really wanting to specialize a little bit more in, I don’t think it would hurt you. I think it would be beneficial for you to start going ahead and looking into Scala, and maybe doing some research, there.

I definitely wouldn’t push back, and say, “I don’t need those Python skills anymore.” Both of them are interpreters. You can write both of those inside of Spark, so you’re covered there. There a little more functionality, it seems like, and more documentation around Scala. Part of it is because it’s expanding out that Java, so it keeps a lot of people from having to write those Java jobs versus just using the Scala interpreter ripple or just using the functions there.

From where I see it, and where you want to go, and especially if, you being a junior, saying that you’re a junior, data engineer, if you want to progress, I don’t think there’s a problem with you learning Scala. I just wouldn’t say that you really need to jump to it now.

It’s hard for us to, as technologists, as engineers, as developers, we love playing with new toys. Any time there’s a new language or a new toy, a new kind of technology, whether it be some kind of project with dockers containers, or Mesos, or Yarn, anytime there’s something new, we always look at that, like, “Oh, man!” That’s a shiny new object we want to play with.

What should we do from our career perspective? I don’t think you have to go all in on Scala. I think that, even with Python, and where we see Python now, and just look at the different open source projects and everything that’s going on with Python. I think you’re set there. One recommendation that I was given early in my career, that I want to pass along to you, is being a junior engineer, saying you have intermediate Python skills, what I would start to do is, I would start to maybe start looking at specialization.

Look at a couple different open source projects, maybe get involved, that revolve around Python. That can help you build your foundation, your portfolio, and yourself as a career. You still have options.

You’ve heard me talk about it before in the data engineer video. Take 30 minutes. Take 30 minutes a day and maybe two to three times a week, mix in some Scala research in there. That way, you’re learning it, and you’re starting to get it, but you’re also still contributing to your Python skills, which, like I said, you’re at the intermediate phase. There’s a way for you to specialize, because Python’s a pretty big language, and so there are different specializations that you can do. I would look at that.

The second part of the question that’s a little bit different, and I’m going to have somewhat of a different opinion here, is you’re asking about options for freelancing. Which one should you focus on if you want to freelance?

The way that I would look at that is, there are three important things that I look at for freelancing. If you’re freelancing, you’re focused on your brand. That’s how you’re going to get freelancing jobs. That’s how you’re going to be able to market yourself as a freelancer and continue to get that revenue there.

That starts with specialization. Like I was talking about a little bit with Python and starting to pick a specialization, I would really focus on it if that’s what I wanted to do. If I wanted to get into freelancing, wanted to be a freelance data engineer, you really want to start focusing and start specializing.

That’s not to say that you’re not going to pick up other skills and continue to learn those, but I think specialization, especially when you’re new to freelancing, is going to be very, very important, even more so important than just in your current career, where you’re working, maybe, at your company. Specialization is important there, but not as important as a freelancer. So, the second thing that’s really big with freelancing is the market. What is the total market need? Just looking based off that, if you were looking and asking that question, “Hey, I want to be a freelancer,” Python or Scala? There’s going to be a lot more jobs that are available right now, a lot more gigs to be able to get, that are going to revolve around Python, and I think that trend’s going to continue. I don’t think that Scala’s going to overtake that market.

Even segmented in the big data area, too, so even if we got outside of what’s going on with the big data community, I think those jobs are still going to continue to be maybe two to three times more for Python developers.

You’re looking at the brand. You’re looking at the market need, but then you’re also, the third key for freelancing is your reputation. Being new, you’re not going to have a lot of reputation, so what are some things that you can do to be able to get your name out there, and also get involved in some jobs?

It’s going to go back to the open source. I would look at some open source projects, maybe around the areas that you want to specialize in, and just get involved. Getting involved in those is sometimes just as easy as joining the user discussion list, and maybe answering some questions there. Getting on the developer list, and just maybe reading the emails, and see how it’s going.

I’m telling you, just getting on those email lists and seeing what’s going on, and as you start to contribute on there, that’s a great way for you to be able to, one, have code that you can point back to, or that you’ve been a part of the community. Two, that helps your reputation, so it’s going to be easier for you to find jobs. There’s a lot of times that you’ll see jobs, or requests for podcast interviews, or just a lot of different things, all come from that open source email list. I would get involved there.

Also, too, another thing that you do, just being new to the area of freelancing, is maybe you look for one of the gig, one of the freelancing places online, and start doing some of those jobs. A lot of times, those jobs, because you’re just starting out, they’re going to be very competitive, and you’re going to maybe be doing it for 10 or 20 bucks, or 30 bucks an hour, for a project, but that helps you build that portfolio. I wouldn’t focus too much on being a part of that, and saying that’s going to be your revenue for freelancing, but I think it’s a great way for you to get started, and start to see how it works for freelancing, and if it’s something that you’re interested in, too.

It’ll help you build that reputation, but I think the open source, long term, is going to really build that reputation for a lot longer than even the gig sites, and some of the other things. Three things for freelancing. Definitely, for that question around Python versus Scala, I would continue to focus on Python for your freelancing gigs, because I think that’s going to be your fastest way to be able to build that specialization. I think the market need’s there, and then also being able to brand yourself as well. That’s today’s episode of Big Data Big Questions.

Make sure you subscribe, so that you never miss an episode. Also, go, if you have any questions for Big Data Big Questions, check it out, thomashenson.com, I’ve got a place out there for you to just fill out. I can answer any questions you have.

Thanks again. See you!

Book Review: Boyd the Fighter Pilot Who Changed the Art of War

November 8, 2017 by Thomas Henson Leave a Comment

Why read a fighter pilot book?

Ever heard of the OODA loop?

It’s the basis for agile development. Observe, Orient , Decide, and Act (OODA) is the feedback loop coined by John Boyd. The point of the loop is to go through these steps repeatedly faster than your foe.

In Agile software development we try to mimic these steps in our iterations of work. First, we find a problem to solve (Observe). Next, we create a user story how to solve the problem (Orient). Then, we add the user story to our development schedule (Decide). Finally, we develop the solution based on the user story (Act). At the end of this loop the feature/enhancement is pushed to production where its usefulness is tested which starts a new feedback loop. The faster we can iterate through these loops the better our applications will be for our users.

Watch the video below to find out more.

Transcript How a Fighter Pilot Created DevOps

Hi, Thomas Henson here with thomashenson.com, and today, we’re going to do a book review on Boyd: The Fighter Pilot Who Changed the Art of War.

Stay tuned, right after this.

Hi, welcome back. Today, I’m doing a book review on Boyd: The Fighter Pilot Who Changed the Art of War. You’re probably thinking, “Okay, wait a minute. This is a fighter pilot book. How is this going to relate to technology, big data, you know, software development?” Let me just say, it absolutely applies. You use it every day, and you probably learned about it in any of your business courses or any of your corporate meetings that you’ve been in.

I’m sure somebody’s referenced something in there called the OODA loop. I’ll talk a little bit about that here in just a second. I did want to tell you how I got into the book.

I got into the book following Ryan Holiday’s, I think it was books, maybe 25 or 20 different books that are biographies that everyone must read. It’s been on that list, and I’ve been making my way through that list. I’ve also seen it recommended a couple different other places. I’ve been wanting to do it, kind of been putting it off.

I probably should have gotten to it a little bit faster. The book is about John Boyd. He was fighter pilot back in, I think started out in World War II. He didn’t really ever see any combat experience there. The book is really good. It’s really about his career and the way that he solved problems.

The way that he approached solving different problems. One of the first things that he did, and was really known for, is he changed the way that fighter pilots and the way that fighter planes are designed, and how they’re measured as well.

He came up with a formula and different way of being able to measure how velocity, and how banking, and some of the other terms that I don’t really understand around fighter pilots and around planes, but just avionics in general.

He came up with a way to change that, and it changed the way that every fighter plane around the world, not just in America, was designed, and how they’re designed still to this day. He totally changed that part.

Another huge part that he changed was, he changed the way that wars are fought. A little bit about war strategy in there, so if you’ve read The Art of War, and some of the other books, he took a lot of that information for a lot of different campaigns throughout the years, researched some of it, and actually came up with a different way to strategize, and to do war fighting.

When you start looking at that, that actually bleeds over into business. He came up with what you’ve probably heard of as the OODA loop. It’s the observe, orient, decide, and act. The OODA loop is what we use in business strategy, but it’s also what we use in agile development.

Jeff Sutherland, who was a fighter pilot, I think in Vietnam. You can read in his agile book, where he actually voices and references the OODA loop in the way that agile development is done, and the way that we solve problems even in software engineering.

You’re probably seen it, too, in a lot of corporate events, and I think it started becoming part of business curriculum in a lot of colleges around the US, sometime in the early ’80s, too, so all from that strategy, John Boyd was the father of that OODA loop and the strategy there, and doesn’t get a lot of credit for it.

I’m not going to spoil too much of it and tell you why he doesn’t get credit, or how it all evolves, but let me just say, I recommend the book. It’s a really good read. It’s probably something you can do in a week, week and a half, if you’re just reading about 30 to 40 pages a day.

I definitely think that anybody should read it, so there’s also some personality things in there too that you can learn from. It’s not just about learning all the things he did, but you can actually, because it follows his career, and it’s a biography of him, you could follow and learn from some of the mistakes that he made, too.

Really good book. Everybody should probably check it out. At least put it on your list and get to it at some point. Learn more about the OODA loop and some of the other things. Make sure you subscribe. I know this is a book review, but if you have any questions around Big Data Big Questions, you can subscribe here, so that you’re never going to miss an episode of that. If you have any questions, make sure to put it in the comments here or go to my website, thomashenson.com, and look for the Big Data Big Questions, and submit the questions there.

Thanks, and I will see you again next time.

Big Data Beard Podcast Announcement

November 7, 2017 by Thomas Henson Leave a Comment

How do you keep up with all the news going on in the Big Data community?
Announcing the Big Data Beard Podcast, a Podcast devoted to Big Data news, architecture, and the software powering the big data ecosystem. Watch the video below to learn how I feel about the new podcast.

Transcript – Big Data Beard Podcast Announcement

Hi, folks! Thomas Henson here with thomashenson.com. Today, I’m in a different location. Looks like a construction site, right?

That’s because changes are coming. I’m building an office right now, and at some point in the future, I’m going to have a video maybe showing that office off.

With all these changes coming, I wanted to announce another big change. That’s a new podcast for you to listen to. If you follow me on Twitter, you’ve probably heard about the Big Data Beard Podcast. Look at all these tweets.

That’s a good way to keep in touch, but the Big Data Beard Podcast just released a couple weeks ago. Big Data Bears Podcast? This is going to be epic!

Beard. Check. Talks about big data, check. The Big Data Beard Podcast is a Podcast with a group of engineers I’ve been working with, and what we decided was, we decided we should take our conversations that we have over coffee, or beers, or just at conferences, start recording those, and maybe have some guests along the way. This is a great way for you to be able to find out what’s going on in big data and data analytics, and also a great way to get more information.

I’m all about learning, and all about finding ways to get more involved in the community, and find out what’s going on. This is a great way, in under an hour, once a week, to be able to be involved, have some information, and then even interact with us.

If you’d like to appear on the show, or you have any ideas for the show, post them in the comments here. Send them on Twitter. However, you can get in touch with me, just give us those ideas, and we’ll be sure to field those questions.

Make sure you subscribe and check out the Big Data Beard Podcast. Thanks, folks!

Isilon Quick Tips: Creating Snapshots with Isilon’s OneFS from Command Line

November 6, 2017 by Thomas Henson Leave a Comment

How do you manage OneFS snapshots from the CLI?

It’s easy to use the isi snapshot snapshots commands.

We have worked through setting up Isilon’s OneFS Snapshots from the WebCLI in multiple Isilon Quick Tips. Let’s turn our focus now to setting up our snapshots from the CLI. Watch the video below and follow along while we use the CLI to create onetime snapshots and snapshot schedules.

Transcript – Creating Snapshots with Isilon’s OneFS from Command Line

Welcome back to another episode of Isilon Quick Tips. Today, we’re going to be talking about snapshots. We’ve covered snapshots in previous episodes, but everything we’ve done has always revolved around that web-cli.

Today, we’re going to go behind the scenes, and see what we can accomplish with snapshots, as far as creating and listing out different snapshots, all from the command line. Get ready to follow along by opening up your command prompt.

Once we’re logged in to the CLI, we can use your ISI-Snapshot snapshots list to list out all our snapshots. You can see your ID, name, and path here. What if we want to get some more details on this? We can use the ISI-Snapshot snapshots view, and we’re going to pull in that specific ID, so the ID I want to pull is number two, which corresponds to the nasa-snaps. When we run that command, what we can look at, or we can see that ID, but we can also see the path, so we know that it’s on the IFS NASA directory. We can see when it expires. We can also see the size and some other information, too. Let’s create a one-time snap using the command line.

To do that, what we’re going to do is, we’re going to use our ISI-Snapshot snapshots create command. What we’re going to do with that is, we’re going to pick a path. We’ve already got a snapshot schedule set up for the NASA directory, but what I want to do is, I want to set one up for the videos directory. I’m going to put the absolute path, and so that’s the ifs/videos. Then, we’ll also pass in our name. The name I’m going to use is the video-snaps.

That complete, let’s list out our snapshots and see if our one-time snap was taken. Remember, that’s ISI-Snapshot snapshots list, so we take out that S.

That was how we take a one-time snap. What happens when we want to set a schedule up for our snapshots? Before we set up that snapshot schedule, I want to reference the CLI guide. In the CLI guide, here, you can see a table with all these different percentage and letters. I’m going to reference these are we’re creating that snapshot schedule. These are going to be a way for us to be able to name how we want to show the time-date stamp on our snapshot schedules.

Our snapshot schedule we’re going to create is going to be for the ifs/videos directory, but we want to set a schedule instead of just a one-time snap. We’re going to use the ISI-Snapshot schedules create, going to pass in our name, so video-snaps, going to keep that as the name for this one. We’re going to do it ifs/videos, that’s our directory.

Now, we’re going to pass in video-%c, and that’s going to give us the year, month, day of the week, hour, minute, and second, for each time the snap is taken. The %c is what I was talking about, use the table that we had just looked at to be able to pass that in. Now, we’re going to select every day, every hour. I want a snap every day, of every hour. The last parameter we’re going to pass in is going to be the duration. That duration is going to be when we want it to expire.

I’m going to let these snaps be okay for a year. They’re going to roll off in a FIFO fashion every year. We can create that schedule, and we want to view it. To view it, we’re going to use the isi snapshot schedules list. You can see we have two schedules here. The video snap that we just created, and one we previously had for our Nasa Snapshots.

Now, let’s view the details. isi snapshot schedules view, and then the ID number, so 3. Now, we can see we have an ID number of 3. That’s our absolute path, and we have that snapshot schedule happening every day, of every hour, and the duration is for one year. We didn’t specify an alias. We can see when it’s going to run next. That’s how you view snapshots from the command line, how you create one-time snaps, and even set up snapshot schedules, all from the command line. Make sure that you subscribe, so that you never miss an episode of Isilon Quick Tips, or more videos are big data and Hadoop.

Video Links

Isilon OneFS 8.0 CLI Reference