Data Engineers Archives - Thomas Henson

O’Reilly AI Conference London 2019

October 9, 2019 by Thomas Henson Leave a Comment

The Big Data Big Data Questions show is heading to London for the O’Reilly AI Conference October 15 – 17 2019. I’m excited to be a part of the O’Reilly AI Conference series. In fact, this will be my third O’Reilly AI conference in the past year. Let’s look back at those events and forward to London.

San Jose & New York

View this post on Instagram

Late night packing my conference gear for my trip to O’Reilly AI Conference this week. Most important items: 1️⃣ Stickers 2️⃣ 🎧 3️⃣ 💻 4️⃣ Bandages? (I’ll explain later) 5️⃣ 📚 (this weeks its my Neural Networking) What’s your list of must have gear for tech conferences? #programming #coding #AI #conference #techconference

A post shared by Thomas Henson (@thomas_henson) on Sep 5, 2018 at 5:09am PDT

First in 2018 I attended the San Jose conference where I spent a good portion of the time in the Dell EMC booth talking with Data Engineers and Data Scientist. One of the major themes I heard from Data professionals was they were attending to learn how to incorporate Tensorflow into their workflows. In my opinion Tensorflow was talked about in every aspect of the conference. We had a blast learning from attendees and discussing how to Scale Deep Learning Workloads. Also this was my first time attending a conference with 14 stitches in my left hand (trouble on the pull up bar)!

Next was O’Reilly AI New York. Forever this conference will be known in my head as the Sofia the Robot trip. During this conference I worked with Sofia the Robot not only at the conference but in a Dell EMC event at Time Square Studios (part of the Dell Technologies Magic of AI Series). Before the Magic of AI event, Sofia and I spent the day recording with O’Reilly TV about the current state of AI and what’s driving the widespread adoption. After a day of recording, I had a keynote for day two of the O’Reilly AI Conference where I discussed how AI is impacting future generations already. Then there was a whirlwind of activity as Sofia the Robot took questions at the Dell Technologies booth. The last thing of the day was the Magic of AI event in Time Square Studio where we had 100 people taking part in a questions and answer session with Sofia the Robot.

Keynote O’Reilly AI Conference New York

Coffee with Sofia the Robot

http://https://youtu.be/KbBvdoUOpmY

On To London

Next up is O’Reilly AI London. To say I’m excited is an understatement. During this trip I will accomplish many first time moments.

To begin with it’s my first international conference along with my first time in London. So many things to see and so little time to do it. Feel free to give me suggestions about visit locations in the comment section below.

Second at O’Reilly AI London I will give my first breakout session at an O’Reilly Conference. While I’ve been on O’Reilly TV and given a keynote I’ve yet to have a breakout session. My session is titled AI Growing Pains: Platform Considerations for Moving from POC to Large-Scale Deployments. The world is changing to innovate and incorporate Artificial Intelligence in many applications and services. However, with all this excitement many Data Engineers are still struggling with how to get projects past the Proof-of-Concept phase (POC) and into Production. Production environments present a list of challenges. The 3 biggest challenges I see when moving from POC to Production are the following:

The gravity of data is just as real as the gravity in the physical world. As Deep Learning workloads continue grow so does the amount of data stored to train these models. The data has gravity that will attract services and applications to the data. The trouble here making sure you have correct Data pipelines Strategy on place.
Once I had dinner with one of the Co-founders of Hortonworks, during which he said “Everything as Scale is exponentially harder. Have you ever moved around photos on your desktop? For the most part this is an easy task except when you accidentally move a large set of photos. Instantly after moving these large folders you are endlessly waiting for the hour glass to finish. Image doing this with 10 PBs of data. I think you get the picture here.
The talent pool today compared to early days of “Big Data” is much larger. However, the demand for skills in Deep Learning, Machine Learning, and Data Engineering is stressing the system. Which still leaves a skills gap for experienced engineers with Deep Learning and Machine Learning skills. The skills gap is one huge factor for why many projects get stuck in the POC phase instead into production.

If you would like to know more about moving projects from POC to Production make sure to checkout my session if you are attending O’Reilly AI Conference in London. AI Growing Pains: Platform Considerations for Moving from POC to Large-Scale Deployments @ 11:55 on October 16, 2019.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Why Data Engineers Should Blog

July 16, 2019 by Thomas Henson Leave a Comment

Blogging For Data Engineers?

How important is it for Data Engineers to have a blog? In this episode of Big Data Big Questions I talk about importance of building a blog in your career in Data Engineering, Data Analysis, or Data Science. Learn my thoughts on What Every Data Engineers Should Have A Blog in the video below.

Transcript – Why Data Engineers Should Blog

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of…

Big Data Big Questions. Today’s question, I thought I would take a topic that I’ve seen and keeps coming up in some of my videos, and really dig down into it. Maybe this is going to be a multi-part series, but we’re going to talk about starting a blog to build your brand as a data engineer, data scientist, or if you’re watching this and you’re just a technologist or somebody that just wants to do book reviews, trust me, there’s going to be some topics in here that are generalized for everybody, but it really shows you how to key in on your field.

Before we jump into that, though, I want to say, if you have any questions, put them in the comment section here below. This is where I find content to make sure I’m interacting with the community and answering the questions that you want. It also gives me an idea. Hey, there’s enough people that ask a question or interested in a certain topic, and I haven’t done any research on it, gives me an opportunity to study and see what’s going on. This is all about being a community here. Reach out to me on thomashenson.com/big-questions if you don’t want to put it in the comment section here below. I’ll do my best to answer those quick as I can.

Today, I want to talk about why you should start a blog as a data engineer, or data scientist, or if you’re a web developer, and you’re watching this, or anything. I think it’s very important. In 2019, should you start a blog? I think so. I don’t think it’s something that is going away. Just because I say start a blog, you don’t have to start a blog and just write. You can start a vlog. I think you definitely should have your own domain. I bought thomashenson.com. It cost me, I think, $12 a month. No, $12 a year, but it’s like, hosting and everything like that can be really, really cheap. I wouldn’t worry about that. It’s really important. I’m going to talk a little bit first about my journey and why I started a blog.

When I got my first job, like I said, I’ve talked about it before, I was a web developer. One of the things where I was working at, we weren’t really embracing. We were using open source, but we weren’t really contributing, and it was shunned upon or shied upon for us to actually have any code to be able to show or anything like that. One of the things, I didn’t really think about it at the time, but you get a couple years into your role, and you might get opportunities to interview at other places, to do other things, and one of the things that came up that was really whole when I was going through the interview process was, I didn’t have any example code or anything like that I could show. I wasn’t involved in the open source community outside of work, and I didn’t have my code. It was my company’s property, and there were some other pretty big reasons I couldn’t, I didn’t have anything I could point to and show. That got me thinking. I don’t have anything that really captures the work and some of the things that I do. Then, at this time, too, I’d already embraced trying to do at least 30 minutes a day, or maybe even four times a week getting 30 minutes in of learning new things. I had all these ideas and all these things that I was going through and learning in the process, but I could only talk about them. I’m on a whiteboard or from a resume perspective, but I didn’t really, couldn’t really show. Couldn’t let it stand on its own. That’s where I started really looking into blogging. I was like, “Man, maybe I should start a blog.” Start a blog, didn’t really know what I was going to do with it. If you go back and look at some of my early posts, it was like, “You know, I’m doing this, and I’m starting a business!” It really wasn’t a business, it was just me writing. As I started writing, I started talking about some of the things I’ve learned. I would go through and look, and be able to create articles around something I’ve learned, maybe even create some test projects.

A lot of that, they weren’t very good when I started. It can be an opinion thing if they’re good now, but I definitely know that I’ve improved, and I feel like that, but I think it’s something that really helped me and really focused me, too. Like I said, I was a web developer. You’ve all heard my story before, about when I became a data engineer, and jumped into the Hadoop area. I had that platform, and I had already been practicing doing some of the blogging and stuff like that. It was really easy for me, as I was going through, and learning, and learning things that other people wanted to see, to be able to start writing pig Latin tutorials. Hive, and what I’m doing with H base [Phonetic 00:04:40] and HTFS, and just general tips of things that I learned. It was like, strengthening that muscle, and it really helped me just accelerate just in being a part of the community as well, too. That’s my journey. That’s one of the main reasons that I’m so big on it, is because I came from that area, where I didn’t have anything that I could point to and say, “Hey, look.” These are all the cool things that I’m doing.

That’s why I started a blog, but why do I think that you should? What should your story be? Your story, you’re still writing it. You should write it on a blog. I really think it’s something that’s help you build out your brand, and I think it’s always something good that shows, one, you’re interactive in the community. It keeps you honest and keeps you motivated, too. It’s late at night. I didn’t really want to have to record any videos. I wanted to put it off. I have an audience. I have a schedule, and I try to keep content coming out. This made me come out to the office, and make sure that I got on camera, and was able to create content here, too. The same thing with your blog. If you create a blog, say you create a schedule, and you’re like, hey. I mean, I’ve done this before. I’m going to publish once every month. When I was first starting out, you feel horrible when you don’t. I missed quite a few months. It took me a long time before I published every month. I just really wasn’t consistent. It’ll keep you honest about learning. It’ll keep you honest about creating content and being a part of that community, too. I really think that it’s good at any stage in your career, but especially if you’re watching this channel, and you’re trying to figure out, “Where do I get started? What are some things that I should be doing?” You’ve probably heard me say it a ton of times. Start creating something to be a part of the community. I’m not saying go out and… We’ll have a longer session about how to start blogging and how to find, how to create your own content. I’m not saying go out and borrow people’s content or anything like that and put it as your own. There’s a definite way that you can do a lot of different things. I’m going to end this video this time, but maybe this is, we’ll just call this part one. I definitely think we should dig into how to start that blog, some content ideas, but I think today just kick around the idea, just think about it, start churning, start kicking those around in your idea, and then we’ll talk, and follow up later on with some content ideas. I’ll show you how to set up on, I think, I used Dream Host, but there’s a ton of other places out there. It’s something simple that you can set up in 10 minutes, and if you’re using [Inaudible 00:07:18] you can start publishing some of your own content, having your own audience, heck, you can put it in the comment section here below, to build, and we can use our audience to help everybody push their content out there. We can all support each other as well, too.

That’s all I have for today. Like I said, I’m going to follow up. I really like this idea, here. If you have some comments, or you think it’s a bad idea to start a blog in 2019, which I don’t think it is, but I’d love to hear your opinion. All opinions are welcome, so, thanks again, and I will see you next time on Big Data Big Questions.

[Music]

Speaking Skills For Data Engineers

July 15, 2019 by Thomas Henson Leave a Comment

How Important Is Public Speaking For Data Engineers?

Brand new question on Big Data Big Questions is around public speaking in Data Engineering. I’ve often heard that public speaking is the universal number 1 fear for most people. So many people choose to avoid it for various reasons. While no where will you see public speaking called out in Data Engineering descriptions, I believe it’s a skill that worth investing in. Find out my thoughts on Speaking Skills for Data Engineers in the video below.

Transcript – Speaking Skills For Data Engineers

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in from a user. If you have a question, find me on Twitter or put it in the comment section here below. Send me an email. There’s a ton of different ways to get in touch and have your question answered on the next episode of Big Data Big Questions.

Today’s comes in from Bobby, and it says, “Can you let me know which career path is better between data scientist or data engineers, which we’ve talked about, but this one is for a person suffering with anxiety or difficulty giving presentations?” So, thank you for your question, Bobby, and I totally understand where you’re coming from as far as having challenges that you’re trying to deal with. Trying to pick out a career path, like, we want to play to things that we’re going to be successful at and things that we’re going to be able to excel in. You’re looking for that career path. I’ll say just right off the bat, a couple things stuck out to me about it. I’m going to get to those as we talk about why I think presentations and stepping outside your comfort level are some options for you. Let’s answer your question first, before we dive into Thomas’s thoughts on some of that.

Depending on which was you want to go, it’s not going to matter. It’s going to be more about if you’re more technical as far as wanting to be code, and hands-on, and building out clusters. Maybe starting to play with Kubernetes, Linux, those types of systems. Then, being on the data engineer side, it’s going to be a good way to go, or if you’re more math-based and want to get into the specifics of, hey, some of these features or some of these pieces of data may be able to give us better insight into what we’re trying to solve, then the data science path is going to be there. Don’t let your anxiety or your difficulty giving presentations say that, “I must go data engineer,” or must go data science, because I think they’re both equal to give you the opportunity to not have to present and not have to have as much interaction as you would maybe in a different role where it’s more customer-facing and job-driven.

My thought process about how much you’re going to have to deal with in that situation is, I’ve worked with people who never had to present. When we were in that role, that just wasn’t their thing. They may be at the meetings. They’ll be at the meetings, but they’re not the point person. Maybe get a one-off question or something like that, but most of that’s in the confines of their team. You’re still going to have team interaction, but there’s still a ton of downtime where it’s like, “Hey, headphones on,” just banging out your own code, or doing your deployments, and stuff like that. There’s not a ton of interaction there. You may have some user interaction that you’re working through, depending on where you are in the stage of your project, but for the most part, I don’t think even outside of the questions here, most of your customer interactions, a lot of times, maybe not so much on the data science side, but it’s going to be nothing like you would think from a web development perspective or front end developer. Still engaging with the users, but more on the team atmosphere. Feel free to choose any of those paths to be able to deal with your anxiety and difficulty giving those presentations. I think you’ll be totally fine, and I think you can get away with never having to give a presentation, if that’s in your vote.

But, I think you should. I think you should try to work towards conquering those difficulties and those presentations, and I’m not saying that you start off going out, and being like, “You know what? I’m going to try to go to a conference and give a keynote.” I’m going to try to go to a conference and give a breakout session. That’s not what I’m saying. I think you should start a little bit smaller, and just on your team, and then if you find a new feature or new software tool, or just a new process that you like doing, present that to your team. I know it’s tough, and I know it’s hard, because they even did a study a while back about the number one fear most people said that they fear public speaking more than they fear death.

Let me say that again. They feared death less than they feared public speaking. Most people would rather die than do public speaking. Definitely it’s something that I’ve been working on for quite a few years, and I’ll be honest, I get nervous each time. I get nervous, start talking to people. I’m like, “Oh, I’m about to go on.” It doesn’t matter. It doesn’t matter to the fact that, maybe I’ve given a certain presentation 25 times.

Heck, every time I turn the camera on, and there’s nobody in this room, here, on Big Data Big Questions, I still get nervous, too. There’s going to be some amount of nervousness, and I understand that, there’s varying levels, too. I’m not looking over and saying that, “Hey, you know, everybody, you know, everybody can be able to do that.” I do think that you can work towards it, and so maybe everybody’s not going to be able to do it on the same is maybe what I mean to say. I think it’s something you should try to, because presenting is going to open up doors for your career. It’s going to make you feel good, too. Each time I talked about how nervous I was, I just spoke in front of over 1,000 people for the first time in my life. That was huge, but I didn’t start out that way all in one day. I’ll tell you, I was super nervous, and it was just for a short amount of time, but I was nervous the whole time leading up to it, and then afterwards, after you get it, it’s like, yes! You get that amazing feeling that you’ve done something. I don’t know if you’re into sports or something like that, but you feel like you’ve won. Even though, who knows, it’s the first time speaking to that many people. I’ll probably hopefully have that opportunity again, and I’ll be better at it next time. It probably wasn’t my best time, if you’re looking at it.

It’s something that you start to work towards. It’ll be interesting, how much networking, and how many doors are open by doing that, and it’s all about giving back to the community as well. To recap, I don’t think that you have to choose data science or choose data engineer to be able to not have to present and do some of the other things. However, I think most people, and if you’re watching this channel, and you’re really curious about career development, I do think that everybody should have some kind of presentation skills, and this is something they should practice towards, and I totally understand. There’s a lot of anxiety whenever you’re doing something like that. If it’s something that you can work towards, and you can conquer, then I think it’s going to be something that’s going to be amazing. One, for the community, because we need more voices. And then two, it’s going to be something that you’re going to be proud of, and you’re going to be able to work on, and it’s just another challenge, too.

That’s all I have today for Big Data Big Questions. Make sure that you hit the subscribe and ring that bell, so you never miss another episode of [Whispers] Big Data Big Questions.

Data Engineers: Python VS. C#

June 18, 2019 by Thomas Henson 8 Comments

Which Is Better Python Or C#?

Getting into wars over different programming languages is a no no in the world of programing. However, I recently had a question on Big Data Big Questions about which is better for Data Engineers Python or C#. So in the spirit of examining the difference through the lens of Data Engineering I decided to weigh in.

Python has long been used in Data Analytics for building Machine Learning models. C# is an object oriented programing language developed by Microsoft and used widely in all ranges of applications. Both have a ton of community support and a large user base but, which one is better? In this episode Big Data Big Questions I breakdown both Python and C# for Data Engineers. Make sure to watch the video to find out my thoughts on which is better in Data Engineering.

Transcripts – Data Engineer Python VS. C#

Hi folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question, we’re going to do a comparison between Python and C#. It’s a question that I’ve had coming in, and it’s also something that’s a passion of mine, because I used to be a C# developer back in the day. Then, I’ve currently, I guess the last four or five, maybe six years, I’ve learned Python. I thought it would be good to go through some of that, especially if you’re just starting out. Maybe you’re in high school, or maybe you’re in college, or maybe you’re even looking to make a jump into data engineering or machine learning engineering, and you’re like, “Hey, man, there’s C# out here. There’s Python.” What are some of the differences? What should I learn? Find out right after this.

Today’s episode of Big Data Big Questions, I wanted to do some of the differences between Python and C#. First thing, we’ll start off with C#. C#, heavily developed by Microsoft. I think it was released in 2000. It’s an object-oriented programming language. You see it a lot. I used it, for instance, back when I was doing asp.net. There’s a lot of things that you can do, use it for. It relies on the .NET framework. You have to have the .NET framework to be able to go. They are in version 7.0. Primarily used, I used it a lot for web application development, but you can do a lot of different things with it, build out really complex and awesome applications, whether it be a desktop application, whether it be web, mobile, they’ve just got so much of a community that, there’s a lot of different things that you can do with it. Another thing, too, one of the comparisons to it is, it looks just like Java. Another reason I rotated to it, because one of my first languages I learned, I think I learned VB first, but I did a lot of stuff around Java, and actually when I graduated out of college, I thought I was going to be a Java developer for a long time. Really got engrained in that community there. Fast forward to being a web developer, and transitioning to C#, it was a really natural process for me. Like I said, heavy community, heavy packages, and frameworks, and things to be able to use. See it a lot with Microsoft. If you’re doing C#, you’re probably used VisualStudio or I think it’s VS Code. They’ve got a couple different IDEs for development and everything like that. See it a lot there.

Python. If you’ve been following this channel, you’ve probably seen a ton of videos that we’ve done around Python. Python was developed in 1991. It’s in version three. We talked about C# being in version 7. Python’s in version 3. I wouldn’t put a lot into that, because we talked about C# being in 2000 and Python’s been around since 1991. Heavily involved, both of them. It’s object-oriented just like that, just like C#. Also, you see it a lot used in, for sure, data analytics, but there’s a lot of different other frameworks that you can use to do web development. Pretty much, you can do anything you want with Python. You do have to install Python and have that running in your version. Sometimes that can be a little bit clunky, especially maybe in a Windows environment, but it’s something that you can download and start playing with, and have going on your machine. Man, probably in less than five minutes. Maybe I should do a video on that, but you can go ahead, download that, and be up and running, and start running your own code. Huge community support. There’s a ton of things out there for it. Like I said, talked about, I think even in our book review, we talked about some of the books for data engineers. I think there were two to three Python books that I had showed there, too. Heavy use there. Like I said, a lot of involvement from data analytics, whether it be data scientists or machine learning engineers, and just like with Tensorflow, or PyTorch, a lot of the deep learning frameworks that we’ve talked about on this channel have Python APIs.

The question is, you’re a data engineer just starting out, which one should you learn? I’m going to go through three different questions, where we’re going to talk about what you should learn, and which one is better? I hate doing which one is better, because each one is a different tool, and some tools are better at other things, or have more functionality to do certain tasks. Let’s jump right into it.

Which one is easier to learn? Err! I’m having to put myself in there, because I’m biased as far as C# and just having been a part of that community. Like I said, my first language being in VB, which was similar, and then a ton of work in Java. C# on the premise looked a little easier, but the way I’m going to do this criteria is which one do I think is easier to get up and get started from a data engineering perspective or data analytics perspective. I’m going to have to give it to Python. Like I said, can be a little clunky when you’re first installing it, but if you were just able to open up a Linux, build out a Linux machine, you can do, especially if you’re in the red hot, and you can do Yum install Python, and then you can start scripting away on some of the code there. Then, also, too, I’ll give it Python just from the perspective of a lot of things from a data analytics perspective. Number two, I’m a data engineer. I’m a machine learning engineer. Which one should I learn today? Which one would I start off with if I had to choose, only choose between Python and C#? I would probably go with Python, right? Go ahead and learn Python. I would encourage anybody watching this channel, jump into that community. There’s a ton of books out there. We’ve talked about on this channel where you can go, and learn how to do data analytics from that perspective. Python’s going to get the win there. Which one do I enjoy coding in more? Personal preference, man, I think C# will always have that win for me. Like I said, this is a data engineering channel, but like I said. I started off as a web developer. I really like VisualStudio, and I know there’s some plugins you can do with VS Code. You can use that as your IDE for Python and everything like that. There’s something about C# and that language that I really found comfortable and probably will always have a special place in my heart. Like I said, just coming from a Java perspective and everything like that, I’ll give that the win. The overall win, the overall win between the three categories, if you’re a data engineer, a machine learning engineer, you have to start somewhere, I’d say start with Python. Go through some of the tutorials. Got some on this channel. I’ve got some on my blog, but get started there. I hope you enjoyed this. Tell me what you think. Did I miss something on the differences? Would you have chosen C# as something to start off with? Do you like Python better than C#, versus like I said, C# has a special place in my heart, let me know in the comments section below, or if you have any questions, Do you want me to answer on the show? Put it in here, and then make sure you subscribe and ring that bell, so you never miss an episode of Big Data Big Questions.

Spark vs. Hadoop 2019

June 12, 2019 by Thomas Henson Leave a Comment

Spark vs. Hadoop 2019

In 2019 which skill is in more demand for Data Enginners Spark or Hadoop? As career or aspiring Data Engineers it makes sense to keep up with what skills are in demand for the market. Today Spark is hot and Hadoop seems to be on it’s way out but how true is that?

Hadoop born out the web search era and part of the open source community since 2006 has defined Big Data. However, Spark’s release into the open source Big Data community and boosting 100x faster processing for Big Data created a lot of confusion about which tool is better or how each one works. Find out what Data Engineers should be focusing on this episode of Big Data Big Questions Spark vs. Hadoop 2019.

Transcript – Spark vs. Hadoop 2019

Hi folks. Thomas Henson here with thomashenson.com, and today is another episode of Wish That Chair Spun Faster. …Big Data Big Questions!

Today’s question comes in from some of the things that I’ve been seeing in my live sessions, so some of the chats, and then also comments that have been posted on some of the videos that we have out there. If you have any comments or have any ideas for the show, make sure you put them in the comments section here below or just let me know, and I’ll try my best to answer these. This question comes in around, should I learn Spark, or should I learn Hadoop in 2019? What’s your opinion?

A lot of people are just starting out, and they’re like, “Hey, where do I start?” I’ve heard Spark, I’ve heard Hadoop’s dead. What do we do here? How do you tackle it?

If you’ve been watching this show for a long time, you’ve probably seen me answer questions similar to this and compare the differences between Spark and Hadoop. This is still a viable question, because I’ve actually changed a little bit he way I think about it, and I’m going to take a different approach with the way that I answer this question, especially for 2019.

In the past I’ve said that it really just depends on what you want to do. Should you learn Spark? Should you learn Hadoop? Why can’t you learn both? Which, I still think, from the perspective of your overall learning technology and career, you’re probably going to want to learn both of them. If we’re talking about, hey, I’ve only got 30 days, 60 days. “I want the quickest results possible, Thomas.” How can I move into a data engineer role, find a career? Maybe I just graduated college, or maybe I’m in high school, and I want to get an internship that maybe turns into a full-time gig. Help me, in the next 30 to 90 days, get something going.

Instead of saying depends, I’m really going to tell you that I think it’s going to be Spark. That’s a little bit of a change, and I’ll talk about some of the reasons why I think that change too. Before we jump into that, let’s talk a little bit about some of the nomenclature that we have to do around Hadoop. When we talk about Hadoop, a lot of times that we’re talking about Hadoop, and MapReduce and Htfs [Phonetic 00:02:10] in this whole piece. From the perspective of writing MapReduce jobs or processing our data, Spark is far and clear the leader in that. Even MapReduce is being decoupled, has been decoupled, and more and more jobs are not written in MapReduce. They’re more written with Flink [Phonetic 00:02:28], or Spark, or Apache Beam, or even [Inaudible 00:02:32] on the back-end. That war has been won by Spark for the most part. Secondly, when we talk about Hadoop, I like to talk about it from an ecosystem perspective. We’re talking about Htfs, we’re talking about even Spark included in that, and Flume, all the different pieces that make up what we call the big data ecosystem. We just call that with the Hadoop ecosystem.

The way that I’m answering this question today is, hey, I’m looking for something in 2019 that could really move the needle. What do you see that’s in demand? I see Spark is very, very much in demand, and I even see Spark being used outside of just Htfs as well, too. That’s not saying that if you’ve learned Hadoop or you’ve learned Htfs you’ve gone down the wrong path. I don’t think that’s the case, and I think that’s still viable. You’re asking me, what can you do to move the needle in 30 to 90 days? Digging down and becoming a Spark develop, that opens up a career option. That’s one of the quickest ways that you can get, and one of the big things we’ve seen out there with the roles. Roles for data engineers. Another huge advantage, we’ve talked about it a little bit on this channel, but the big announcement for what Data Bricks is going from the perspective of investment and what their valuation is. They’re an $2.5 billion advancement, and they’re huge in the Spark community. They’re part of the incubators and on a lot of steering committees for Spark. They have some tools and everything that they sell on top of that, but it’s just really opened my eyes to what’s out there. I knew Spark was pretty big, but the fact that Data Bricks and where they’re going, I think that’s a lot of what we’re seeing. Another point, too, you’ve heard me talk about it a good bit, but where we’re going with deep learning frameworks and bringing it into the core big data area. Spark is going to be that big bridge, I believe. People love to develop in Spark. Spark’s been out there. It gives you the opportunity now with Project Hydrogen and some of the other things that are coming to be able to take and do ETL over GPUs, but also import data and be able to implement and use Tensorflow or PyTorch, or even Caffe 2. It you’re looking in 2019 to choose between Spark and Hadoop to find something in the next 30 to 90 days, I would go all in with Spark. I would learn Spark, whether it be from Java, Scala, or Python, but be able to learn, and be able to start doing some tutorials around that, being able to code. Being able to build out your own projects, and I think that’s going to really open your eyes, and that can really get the needle moving. At some point, you want to go back, and you want to learn how to navigate data with Htfs. How to find things. They’re going on from the Hadoop ecosystem, because it’s all a big piece here, but if you’re asking me, the one thing to do to move the needle in 30 to 90 days, learn Spark.

Thanks again. That’s all I have today for Big Data Big Questions. Remember, subscribe and ring that bell, so you never miss an episode of Big Data Big Questions. If you have any questions, put them in the comment section here below. We’ll answer them on Big Data Big Questions.

Certifications Required For Hadoop Administrators?

June 11, 2019 by Thomas Henson 1 Comment

Hadoop Certifications

Data Engineers looking to grow their careers are constantly learning and add new skills. What kind of impact do Hadoop Certifications have during the hiring process?

Data Engineers, Developers, and IT in general are known for their abundance of certifications. Everyone has an opinion as well about how much those certifications mean to real skills. On this episode of Big Data Big Questions find out what my thoughts are for Hadoop Admin Certifications and if Enterprises are requiring those for Data Engineers.

Transcript – Certifications Required For Hadoop Administrators

It’s the Big Data Big Questions show! Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of… Come on, I just said it. Big Data Big Questions. Today’s question comes in from a user here on YouTube. If you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson/big-questions. I’ll do my best to answer your questions here, live on one of our shows, or in one of our live YouTube sessions that we’re starting to do on Saturday mornings. Throw those questions in there. Let me know how are doing with this channel, and also if you have any questions around data engineering, machine learning. I’ll even take some of those data science questions, as well. Today’s question comes in around certifications in the Hadoop ecosystem. Are certifications required for Hadoop administrators/Hadoop developers? Absolutely, positively not. They’re not required, right?

Now, there may be some places where they’ll require you to. I did see that back in my day, in software engineering, but in general, they’re not going to be, not going to require you to have that before gaining entry. Now, they might be nice to have. Especially if you’re talking about going into an established team or an established group within your organization that, hey! We’re on the Horton Works stack, and we like to have everybody up to par from a certification perspective.

I haven’t seen that a lot specifically in the data engineering field, but it is something I’ve seen over the years in software engineering, but just not as much here lately. Now, does that mean that I’m saying that you shouldn’t go get a certification? That’s not what I’m saying at all. Especially if you’re learning and trying to get into the Hadoop ecosystem, and you’re like, hey, where do I really start?

First, you start with Big Data Big Questions.

Really, I can use the certifications. Whether it be from Azure, AWS, Cloudera, Horton Works, or Google GCP, Google’s Cloud Platform, you can take any of their certifications and really see, and build out your own learning path. That’s an opportunity there. Even if you’re not going to go down the path of trying to get that certification, if you’re trying to gain information and learn some of the things that you need to know as a good data engineer, whether it be on the developer side, whether it be on the administrative side, that’s definitely where I would start. That’s an opportunity there.

When we look at the question, it poses more of a philosophical question, if we will, in the data engineering and IT world, meaning how do we feel about IT certifications? I’ve answered this question. I had myself and Aaron Banks [Phonetic 00:02:31]. We were talking about specifically around IT certifications, and are they worth it, and we have a full-length video where we really dip into it. I’ll give you a little bit of a preview of my thought process around it.

The way that I look at certifications is, if you’re looking to be able to prove out, especially if you’re outside of a field, then hey, getting a certification might benefit you to make yourself more desirable to getting your application and getting your brand in there. Having a certification does lend some credence in those situations. However, if you’re established in the role, and you’ve been doing data engineering, and you have a lot of experience in it, necessarily you’re not going to really need to have that certification. You’ve been proven. You’ve down the due diligence of being in that role, and you’re applying for a role as a data engineer. You don’t necessarily have to go through that certification process.

Like I said, I really think that certifications are really good. Whenever we’re talking about, hey, maybe I don’t have that experience in that role, and I want to prove to you that, hey, I’m coming from, maybe you’re a web developer like I was. You’re a web developer, and you’re like, “Man. I’d really like to get into this,” Hadoop and this data engineering side of things. Where can I start, or how can I really identify myself to be somebody who wants to take on that next role? That’s where a certification is really going to help. You can get that certification. You can walk through it, but, you’re not going to walk in though, and say, “Hey, I’ve got the certification,” and you, Mr. Data Engineer, Miss Data Engineer, that’s been in that role for six years, “I know more than you do, because I have the certification.” That’s not really the case, and that’s probably not what you want to do, especially if you’re new to an organization.

Be honest, and be gentle in your interview process if you have a certification, but you don’t have the experience, and just say, “Hey, you know, I’m really passionate about it.” I’ve been following Big Data Big Questions for some time, and I thought that it’d be good to get into the data engineering field. To show how serious I am, I actually went through and got, walked through some of the certification process in there, too. Just an opportunity there for you to stand out from the crowd and show your experience, when you don’t really have experience. Let me know how you feel about this question and this answer here. Put it in the comment section below. Love to hear feedback. Also, if you have any questions, make sure you put them in the comments section here below, and never, never forget to subscribe and ring that bell, so that you’ll never miss an episode. Thanks again, and I’ll see you on the next episode of Big Data Big Questions.

Data Engineer in 2019

May 31, 2019 by Thomas Henson Leave a Comment

What’s a Data Engineer Career Like in 2019?

Times change and keeping up with maintaining skills while managing day to day projects can be exhausting.

Should I learn Hive or Tensorflow?

Which is better Flink or Spark?

How as a Data Engineer will I focus on Containers?

Questions like these come up all the times when I speaking with aspiring and career focused Data Engineers. Find out my thoughts around skills and career outlook for Data Engineers in 2019 on this episode of Big Data Big Questions.

Transcript – Data Engineer in 2019

Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in around, what does data engineering in 2019 look like? What are some of the trends? What are some of the things that are going on? Has this question come in from a comments section here on YouTube, so if you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson.com. And, I’ll discuss it in an orderly fashion as they come in, provided I have the time. I’ve been getting a ton of questions, so I really appreciate it. Thank you for this community here.

Today’s question, before we jump into it, I want to give you my three top trends to watch for in 2019. Before we did, I did want to credit with an article that they did for their 10 trends in big data. I talked about them on my YouTube live session, so if you’re ever around Saturday mornings, jump on. Throw me a question in the chat. Let’s get to cracking. I try to answer as many questions as I can there, and try to do that Saturday mornings. Jump in there.

The 10 here, you can check in the comments section here below, where I have some of the link to the [Inaudible 00:01:13] trend here. I’m going to read some of them real quick. The first one they said for the top 10 trends in 2019. Data management and [Inaudible 00:01:22]. They’re talking a little bit about ETL and how ETL’s not going away. I’ve said that for a while, but we did read an article not too long ago that’s saying, “Hey, you know, there’s some tools out there that are really going to make ETL kind of a thing of the past.” We’ll see. Hopefully, right?

I’m not for ETL, I just, man. Started out there, and it seemed like I was never going to get out of it. Number two, data siloes continue to proliferate. This goes into what we saw when Hadoop emerged as this huge, big data lake, where the data’s only going to exist there. We’ve been talking about it, especially on this channel, over the past few years where, hey, data has a lot of gravity to it. There’s going to be data out on your edge. There’s going to be data in the cloud. There’s going to be data still in core data centers.

The idea of a fluid data lake is a little bit more consolidated. You still have those main areas, but you still have to do analytics and place in some area. number three, streaming analytics has a breakout year. Talked about streaming analytics on this channel for the last couple years. Actually did a session about the future architectures of streaming analytics at the 2017, was it Hadoop Summit? They call it Data Works, now.

Data governance builds steam, talked about some of that here. Soft skills start to emerge as tech evolves. Just talking about the soft skills of understanding the business, talks about that with the book, the big data MBA here. Deep learning gets a little bit deeper. Hm. Have we talked about deep learning on this channel? Special K expands footprint. They’re talking about Kubernetes and what’s going on with the doctorization. Clouds are hard to ignore. New tech will emerge, talking about how Silicon Valley and a lot of open source, and closed source, tools have been emerged, and they don’t see that stopping anytime soon. Then, smart things everywhere. I’ve talked about those a good bit here, too.

Without further ado, let’s jump into my three trends for 2019. My three trends to watch for in 2019. The first one, deep learning and Hadoop. How are these ecosystems going to interact with each other? A lot of project out there have talked about it last year, around project hydrogen, submarines, another project, and NVIDIA’s Rapid. It’s all about being able to use GPU and also be able to use those deep learning libraries with data that’s in your Hadoop ecosystem or just for some ETL. That’s one of the things that NVIDIA Rapid’s… Maybe I should do a video just specifically on that. Watch that trend. Start watching what’s going on with TensorFlow and being able to use integrated in with Spark and some of your other tools that are more traditional in the Hadoop ecosystem. That was number one. Number two. Two? Yep. Number two, containerization of the world overtakes data engineering. Similar to what they were talking about it [Inaudible 00:04:11], with their trends, with Special K being special. I think the containerization, we’ve seen it a lot, a lot of announcements here lately with cloud native applications and cloud native experiences on the Cloudera side, and you even saw in Hadoop 3.0 where they were laying the groundwork to be able to containerize your Yarn, schedule your engine, and some of the other components there. We’re going to continue to see that, and that’s one skill that you’re going to be looking for. If you’re in data engineering right now, you want to know what’s coming up down the pipe for you, I would look into doing some things and getting more familiar with the containerization. That’s actually in my roadmap for the end of the year for me, to understand a little more around docker, and Kubernetes, and that whole ecosystem. That is a big trend we will see for data engineering. It’s not going to slow down. It’s been picking up steam a lot here lately, but it’s going to go full force. My third trend, thing that I’m looking for, for data engineers in 2019, streaming analytics. I was doing some research and looking around some IDC numbers around where we’re talking about from a data perspective. We’re gonna be, one of the interesting tidbits that they were talking about is how streaming analytics will take up anywhere from around 30% of all the analytics and things that are going on in Azure. Think about all these different devices bringing in data here by 2025. 30% of that’s going to have to be streaming analytics. That’s a huge number. There’s a number of tools out there that are helping to try to deal with what’s going on from a streaming analytics perspective.

We’ve got , we’ve got Kafka. On the cloud side we’ve got Kinesis. A lot of different tools. We had [Inaudible 00:05:42] on this channel here, but there’s a lot of tools in place, a lot of tools being created, because streaming analytics is a huge beast of data to handle. It’s a different kind of problem than what we’ve seen, and it’s only going to get worse as we start bringing in more data, more devices. Really cool opportunities for you as data engineers. Outside of my goals for 2019, if you’re looking for some things to jump into and some educational paths for yourself as a data engineer in 2019, I would look into those three trends. Deep learning, containerization, and then streaming analytics. That’s all I have for today. Make sure to subscribe and ring that bell so that you never miss an episode of Big Data Big Questions. Throw a comment in the comments section here below if you have any questions. If you like the video, if you hated it, just let me know how you feel about this, and I will see you next time on this episode of Big Data Big Questions.

Data Engineering Courses Without Java

February 11, 2018 by Thomas Henson Leave a Comment

There are a ton of resources out there for Data Engineers to learn to develop/administrate Hadoop/Big Data Ecosystem. Do all these courses require basic Java Knowledge? When you are starting out learning to become a data engineer it can seem everywhere you look basic Java skills are required, but I’m telling you that is not the case. In this video I’ll explore the options and thoughts on learning big data without Java skills.

Video – Data Engineer Courses without Java

Landing a Data Engineer Internship

February 10, 2018 by Thomas Henson Leave a Comment

What are the skills required to land a Big Data Engineer Internship? Is it possible to gain those skills in under 3 months?

Today’s question comes in from YouTube: “Please make a video on how much time is required to become big data engineer ?? is it possible in 3 months and also try to mention about minimum skills required for internship.”

In this episode of Big Data Big Questions we will explore what the basic skills are for landing an Data Engineer Internship and tips for excelling after landing that internship. Watch the video below to find out what skills Data Engineers should focus on before applying for a Data Engineer Internship.

Video – Data Engineer Internship

Talking Heron Real-Time Analytics with Streamlio

January 31, 2018 by Thomas Henson Leave a Comment

To say Streaming Analytics is popular is an understatement. Right now Streaming Engineering is a top skill Data Engineers must understand. There are a lot of options and development stacks when it comes to analyzing data in a streaming architecture. Today I sat down with Lewis Kaneshiro (CEO & Co-founder) and Karthik Ramasamy (Co-founder) of Streamlio to get their thoughts on Streaming Analytics and Data Engineering careers.

Streamlio Opensource Stack

Streamlio is a full stack streaming solution that handles the messaging, processing, and stream storage in real-time applications. The Streamlio development stack is built primary from Heron, Pulsar, and BookKeeper. Let’s dicuss each of these opensource projects.

Heron

Heron is real-time processing engine used/incubated by Twitter. Currently Heron is going through the transition of moving into the Apache software foundation (learn more about this in the interview). Heron is at the heart of real-time analytics by processing data before the time value expires.

Pulsar

Pulsar is an Apache incubated project for distributed publishing and subscribing messaging real-time architectures. The origin of Pulsar is similar to that of many opensource big data projects in that it was used first by Yahoo.

BookKeeper

BookKeeper the scalabale, fault-tolerant, and low-latency storage service used in many development stacks. BookKeeper is under the Apache Software foundation and popular in many opensource streaming architectures.

Interview Questions

Have we as a community accepted Hadoop related tools to be virtualized or containerized?
How do Data Engineers get started with Streamlio?
What are the biggest real-time Analytics use cases?
Is the Internet Of Things (IoT) the primary driver behind the explosion in Streaming Analytics?
What skills should new Data Engineer focus on to be amazing Data Engineers?

Checkout the interview to learn answers to these and more questions.

Video Streamlio Interview

Links from Streamlio Interview

Learning Roadmap for Data Engineers?

December 19, 2017 by Thomas Henson Leave a Comment

Is there a learning Roadmap for Data Engineers?

Data Engineers are highly sought after field for Developers and Administrators. One factor driving developers into that space is the average salary of 100K – 150Kwhich is well above average for IT professionals. How does a developer start to transition into the field of Data Engineering? In this video I will give the four pillars that developers/administrators need to follow to develop skills in Data Engineering. Watch the video to learn how to become a better Data Engineer…

Transcript – Learning Roadmap For Data Engineers

Thomas: Hi, Folks. I’m Thomas Henson with thomashenson.com. Welcome back to another episode of Big Data, Big Questions. Today we’re going to talk about some learning challenges for the data engineer. And so we’re actually going to title this a roadmap learning for data engineers. So, find out more right after this.

Big Data Big Questions

Thomas: So, today’s question comes in from YouTube. And so if you want to ask a question, post it here in the comments, have these questions answered. So, most of these questions that I’m answering are questions that I’ve gotten from the field that I’ve met with customers and talked about, or I get over, and over, and over. And then a lot of the questions that I’m answering are coming in from YouTube, from Twitter. You can go to Big Data, Big Questions on my website, thomashenson.com…Big Data, Big Questions, submit your questions there. Anyway that you want to use it, use the #bigdatabigquestions on Twitter, and I will pull those out and answer those questions. So, today’s question comes in from YouTube. It’s from Chris. And Chris says, “Hi, Thomas. I hold a master’s degree in computer information systems. My question is, is there any roadmap to learn on this course called data engineer? I have intermediate knowledge of Python and Sequel. So, if there’s anything else I need to learn, please reply.”

Well, thanks for your question, Chris. It’s a very common question. It’s something that we’re always wanting to understand is how can I learn more, how can I move up in my field, how can I become a specialist. So, a data engineer is in IT. It’s a sought out field with high demand, but there’s not really a roadmap for these, so you can see what some people are learning, what other people are saying is a specification. So, you’re asking what I see and what I think are the skills that you need based off your Python and your Sequel background. Well, I’m going to break it down into four different categories. I think there’s four important things that you need to learn. And there’s different ways to learn them. And I’ll talk a little bit about that and give you some resources for that. And all the resources for this will be posted on my blog. So, I’ll have it on thomashenson.com. Look up Roadmap Learning for Data Engineer. And under that video, you’ll see all these links for the resources.

Ingesting Data

The first thing is you need to be able to move data in and out. And so most likely you’re going to want to know how to move into HDFS. So, you want to know how to move that data in, how to move it out, and how to use the tools. You can use Flume, just using some of the HDFS commands. You also want to know how to do that maybe from an object perspective. So, if in your workflow, you’re looking to be able to move data from an object based and still use that in Hadoop or the Hadoop ecosystem, then you’d want to know that. And then also I mix in a little bit of Kafka there, too. So, understanding Kafka. So, the important point there is being able to move data in and out. So, ingest data into your system.

ETL

The next one is to be able to perform ETL. So, being able to transform that data that’s already in place or as it’s coming into your system. Some of the tools there… If you watch any part of my videos, you know that I got my start in Pig, so being able to use Pig, or use MapReduce jobs, or maybe even some Python jobs to be able to do it. Or Spark just to be able to transform that data. So, we want to be able to take some kind of maybe structured data or semi structured data and transform it, being able to move that into a MapReduce job, a Spark job, or transform it maybe with Pig and pull some information out. So, being able to do ETL on the data, which rolls into the next point which is being able to analyze the data.

Analyze & Visualize

So, being able to analyze the data whether you have that data, you’re transforming it, maybe you’re moving it into a data warehouse in the Hadoop ecosystem. So, maybe you move it into Hive. Or maybe you’re just transforming some of that data, and capturing it, and pulling into HBase, and then you want to be able to analyze that data maybe with Mahout or MLlib. And so there’s a lot of different tutorials out there that you can do, and it’s just kind of getting your feet wet, understanding, “Okay, we’ve got the data. We were able to move the data in, perform some kind of ETL on it, and now we want to analyze that data.” Which brings us to our last point. The last thing that you want to be able to do and be familiar with is be able to visualize the data. And so with visualizing the data, you have some options there. So, you can use Zeplin or some of the other notebooks out there, or even some custom built… If you’re familiar with front end development, you can kind of focus in on some of the tools out there for making some really cool charts in really cool different ways to be able to visualize the data that’s coming in.

Four Pillars – Learning Road Map for Data Engineers

So, the four big pillars there, remember, are moving your data – so being able to load data in and out of HDFS – object based storage, and then also I’d mix a little Kafka in there, performing some kind of ETL on the data, being able to analyze the data, and then being able to visualize the data. In my career, I’ve put more emphasis around the moving data and the ETL portion. But for whatever you’re trying to do… Or your skill base may be different. Maybe you’re going to focus more on the analyzing of the data and the visualization of the data. But those are the four keys that I would look at for a roadmap to becoming a better data engineer or even just getting into data engineering. All that being said, I will say… I did four. Kind of draw a box around those four pillars and say as we’re doing those, make sure we’re understanding how to secure that data for bonus point. So, as you’re doing it, make sure you’re using security best practices and learning some of those pieces because we start implementing and put these into the enterprise, we want to make sure that we’re securing that data. So, that’s all today for the Big Data, Big Questions. Make sure you subscribe, so you never miss an episode, all this awesome greatness. If you have any questions, make sure you use the #bigdatabigquestions on Twitter. Put it in the comment section here on the YouTube video or go to my blog and see Big Data, Big Questions. And I will answer your questions here. Thanks again, folks.

Show Notes

HDFS Command Line Course

Pig Latin Getting Started

Should Data Engineers Know Machine Learning Algorithms?

November 10, 2017 by Thomas Henson Leave a Comment

How involved should Data Engineers be in learning Machine Learning Algorithms?

For the past few years Data Scientist are one of the hottest jobs in IT. A huge part of what Data Scientist do is selecting Machine Learning Algorithms for projects like SmartHomes and SmartCars. What about the Data Engineer, should they know Machine Learning Algorithms as well? Find out in this episode of Big Data Big Questions.

Transcript – Should Data Engineers Know Machine Learning Algorithms?

Hi, folks. Thomas Henson here, with ThomasHenson.com, and today is another episode of Big Data, Big Questions. And so, today’s question that comes in is, “Should data engineers know machine-learning algorithms?” So, we’re going to tackle that question right after this.

Welcome back. So, today, we’re going to talk a little bit about algorithms, right? So, you know, put your math hat on, and let’s dive into this question today. And so, today’s question, it’s one I get a lot. It’s about the role of a data engineer in machine-learning. And basically, it is… You know, I’ve taken this question from a couple of different sources that I’ve seen, where they’ve asked, you know, “Should data engineers know machine learning algorithms?” And kind of where some of that falls into is, you know, what is the role of the data engineer, and what is the role of the data scientist? And so, really, this question, for me, is really simple. I’m going to go off of my experience and kind of share with you what I’ve done around machine-learning algorithms and how I’ve approached it in my career as a data engineer, software engineer, you know, Hadoop administrator.

There’s a couple of different ways to look at it, but basically, the way that I’ve approached it is I haven’t really learned it. And when I say, “Learned it,” or, “Know it,” I’ve not been in…you know, I’m not going to make a recommendation on it. So, you know, the way I look at it is you should be familiar with them. So, you should be familiar with them, especially familiar with them as far as, like, what’s involved in the package? So, are you using Mahout? You know, what are the algorithms in there, what are the algorithms in your workflow? And then, all the other libraries too.

So, if you’re evaluating other libraries…so, maybe you guys are looking to…you know, maybe you haven’t used Spark and you want to look at the e-mail library that’s there, and you’re kind of going back and forth through those, you want to understand from a basic very high-level, you know, what those algorithms are, and for sure, what algorithms you’re using in your environment, so you can make an educated recommendation saying, “Hey, you know, I think we should move this. Let’s still have the data scientists involved, and have them, you know, look and make sure that the algorithm that we’re going to be using from those packages are going to fall in line with what we’re really using,” because that’s one of the things too, you’ll find that they will differentiate a little bit, so, you know, what we’re using in my house may not be exactly the same, you know, version in, you know, MadLib, or, you know, the ML library.

And so, just be able to understand kind of for sure what’s in your workflow. Be familiar with them too. Another thing that I did…so, like I said, be familiar with them from a high level, but not be making a recommendation, I actually did, you know, picked one, so I would say, you know, be familiar with them, but pick one that you really want to…you know, really want to understand and learn. I picked Singular Value Decomposition, because that’s something that we used a lot in our workflow, and so, I was just kind of…had a natural curiosity for it, and it…you know, it had a really cool story too around it. So, you know, I found some stories around it, you know, it was made really popular with the Netflix Challenge. So, back…Netflix had a challenge for…you know, to, “Beat our data scientists with your algorithm.” And so, SVD was used to, you know, do some of the sorting there, and it was kind of made famous from that perspective, and so, you know, I was familiar with it, but I made sure that I understood one, just for natural curiosity.

Now, if you are looking to, you know, at some point, make a jump, right, to data scientist, if you’re a data engineer, and at some point down the road, you’d like to be…you know, “I want to be the data scientist. I want to say, ‘Hey, this is the algorithm we should use.’” You know, maybe you just want to be a data scientist because, you know, for a couple years running, it’s been the…you know, the sexiest career, you know, in IT for a while, and so, if that’s kind of your approach, you know, definitely start to know them.

Obviously, learn the ones that are in your environment first, because that’s going to be the easiest, because you’re going to have the access to, you know, why you’re using it, how you’re using it, and you have access to the data scientists too, to kind of, you know, take you under their wing, to some extent, and, you know, show you the ins and outs of why you’re using what you did and, you know, kind of why you didn’t use other ones too. For an aspiring data scientist, then yes, for sure, you want to jump in and, you know, start to understand and start to know them. But for a data engineer, I don’t think you have to learn the algorithms, right? I think you have to be familiar with them, I think, you know, for natural curiosity, you know, maybe learn one or two.

But really, our role is not to recommend and say, “Hey, you know, these are the algorithms I think we should use,” or even, like, to pick packages and say, “Hey, these packages here, we’re going to…you know, we’re going to standardize on that and that’s the only thing we’re going to use.” That’s…you know, that’s not really our role, right?

If you have any questions, make sure you submit them to Big Data, Big Questions. You can do it from the website, go to Twitter, use the hashtag #BigDataBigQuestions, in the comment section there, however you want to get in touch with me and get those questions answered. Also, make sure you subscribe so that you never miss all these Big Data, Big Questions goodness, and so that you can always, you know, learn more. Thanks again, folks!

Show Notes

Mahout

Spark MLlib

MADlib

Big Data Big Questions

Netflix Prize

Singular-Value Decomposition