Data Engineer vs. Data Scientist

December 27, 2017 by Thomas Henson 2 Comments

What’s the Difference Between the Data Engineer & Data Scientist

The Data Scientist has been one of the top trending careers choices for the past 3-4 years but where is the love of the Data Engineers? In reality I think more people are confused about the roles in Big Data. Both Data Scientist and Data Engineers are used interchangeably but the roles require different skills sets. In this video I will break down the differences between the Data Engineer vs. Data Scientist.

Transcript – Data Engineer vs. Data Scientist

Thomas: Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. And so today, we’re going to be tackling the differences between a data scientist and a data engineer. Fine out more right after this.

Thomas: So, do you want to become a data engineer, or do you want to become a data scientist? So, this is a question…this is something we see a lot about is all about the data scientists, and big data, and then data engineering. But what’s the different between the two roles, and why is it that since 2012 data scientists have been the sexiest career in IT? And there’s been a lot of publication about it. There’s been a lot of information about it. We’re all talking about machine learning, and smart cars, and the battle between Tesla and Apple for machine learning engineers, data scientists, and how they can battle that out. But what about the data engineer, too? And kind of what are the differences? Well, I’ll tell you. Recently, Information Weekly had a survey out there for 2018 and the highest paying salaries in IT. Data engineer came in at number five. The only other roles that were above the data engineer were all C-level positions. So, think of your CIO, your CSO, your CTO. So, data engineers are getting a lot of love, too. But what are the differences between those two roles? So, we’ll break it down first jumping into what a data scientist does.

So, what does a data scientist do in their day to day work? Well, one of the things that they do is they evaluate data. So, that’s kind of a given. But how do they do that? So, they use different algorithms. They look at different results from data. So, say that we’re trying to find out a way for a web application to have more users that are engaged with it. So, how do I create more engaging content for my users and for my customers and give them something to value? Well, a data scientist would be able to take and look at different variables. So, maybe we get in a room, and maybe everyone kind of comes up with some variables and says, “Okay, how can we retain user retention? So, does this piece of content work? We’ve got some testing on these other pieces. Here’s some of our historical data.”

And so the data scientist, what they’ll do is they’ll evaluate all those data points and tell you which ones are going to be the most relevant. And they’ll do that by using algorithms. So, they’ll look, and maybe they’ll use SVD to find out, okay, which variables are going to make the most sense for us to have more engaging content, have a web application that makes users want to stay and engage with it longer. And so that’s kind of where their role is. Now, they’re not going to be the ones that are writing the MapReduce jobs or doing some of the Spark jobs. We really want them just evaluating the data and helping us build different data models that are going to give us the results we’re looking for.

So, if we can just increase our user retention time or increase our engagement of our content, our web application is going to be more popular. So, in our example, that’s what we want. We want our data scientists that are really evaluating…finding correlations between data and also eliminating correlations. So, this variable that we predicted that we thought was going to be very key to engaging for our web application for our users, it really doesn’t make a difference. And so it gives our developers, and our engineers, and our product and marketing team things for them to look at and say, “Hey, these are the variables that we need to focus on, and this is what’s going to make our web application…give us the desired results we’re looking for and increase that user retention time, increase the engagement for our users in our web application. So, that’s our data scientist.

Now, on the flip side, what is our data engineer going to do? So, our data engineer, they’re the ones that are going to say, “We’ve got this data here. We’re moving the data maybe into our Hadoop cluster. So, we’re moving it into our Hadoop cluster or wherever we’re storing it for analytics.” And so they’re the ones that are really moving that data there. They’re also writing those MapReduce jobs or Spark jobs. They’re doing the development portion of big data. So, our data scientists are over here saying this is the data that I need. The data engineer is saying, “Oh, we have the data. What kind of format…? How should we clean the data? How fast do you need the data, too?” So, how much speed is a concern for some of these variables and being able to fetter out some of the details, and being able to give…maybe improve that product a little bit faster to get it to the users.

And so that’s where you’re going to see the data engineer. They’re also going to be the ones that are managing and configuring our Hive and HBase deployments and doing some of the technical debt work that we’ve talked about before with making sure that we have a strategy for backup, making sure we have a strategy for high availability. So, this product that we’ve got here for our web application, we want to make sure that we’re still feeding our data in, and our data models are feeding our data back to our data scientists. But then we’re also pushing out those results from what the data scientists have given us, too. So, you kind of see two distinct roles.

So, our data engineer, they’re going to be involved in the tech. They’re going to be the ones that are really building those systems out. Where our data scientists, they’re involved with the data. They’re involved with the technology as far as how to use the…what tools are going to help them be able to [INAUDIBLE 00:05:20], use different algorithms, and be able to say, “This data point really makes a different where this other data point may not be making as much of a difference, and so it’s going to…” They’re going to be using those tools for that. But basically what they’re doing is they’re involved in the data. And you see the data engineers involved in the technology, and implementing, and kind of using that strategy.

So, I’m not saying one is better than the other one, but I may be a little bit biased because I’m a data engineer and like data engineering. But two different skillsets, two very important skillsets, two amazingly great career choices right now in IT, two of the probably highest paying individual contributor roles in IT right now. So, you can’t go wrong either way. If you’re looking for more tips and more information about being a data engineer, make sure you subscribe to this channel and find out more information about data engineers. We explore how to do different things. If you have any questions like this, make sure you submit them. Big Data, big questions, using the #bigdatabigquestions on Twitter, go to my website, thomashenson.com, Big Questions, submit your questions there. Put it in the comments below here. I’ll answer it on YouTube the best I can. Any questions like that, just get in touch with me. I’ll answer them on here. Make sure you subscribe. Thanks again for tuning in, and I’ll see you next time.

Show Notes

Singular Value Decomposition

Big Data Beard Podcast Episode 13: A LESSON IN DATA MONETIZATION FROM THE DEAN OF BIG DATA

Salary for Highest Paying Tech Jobs

Learning Roadmap for Data Engineers?

December 19, 2017 by Thomas Henson Leave a Comment

Is there a learning Roadmap for Data Engineers?

Data Engineers are highly sought after field for Developers and Administrators. One factor driving developers into that space is the average salary of 100K – 150Kwhich is well above average for IT professionals. How does a developer start to transition into the field of Data Engineering? In this video I will give the four pillars that developers/administrators need to follow to develop skills in Data Engineering. Watch the video to learn how to become a better Data Engineer…

Transcript – Learning Roadmap For Data Engineers

Thomas: Hi, Folks. I’m Thomas Henson with thomashenson.com. Welcome back to another episode of Big Data, Big Questions. Today we’re going to talk about some learning challenges for the data engineer. And so we’re actually going to title this a roadmap learning for data engineers. So, find out more right after this.

Big Data Big Questions

Thomas: So, today’s question comes in from YouTube. And so if you want to ask a question, post it here in the comments, have these questions answered. So, most of these questions that I’m answering are questions that I’ve gotten from the field that I’ve met with customers and talked about, or I get over, and over, and over. And then a lot of the questions that I’m answering are coming in from YouTube, from Twitter. You can go to Big Data, Big Questions on my website, thomashenson.com…Big Data, Big Questions, submit your questions there. Anyway that you want to use it, use the #bigdatabigquestions on Twitter, and I will pull those out and answer those questions. So, today’s question comes in from YouTube. It’s from Chris. And Chris says, “Hi, Thomas. I hold a master’s degree in computer information systems. My question is, is there any roadmap to learn on this course called data engineer? I have intermediate knowledge of Python and Sequel. So, if there’s anything else I need to learn, please reply.”

Well, thanks for your question, Chris. It’s a very common question. It’s something that we’re always wanting to understand is how can I learn more, how can I move up in my field, how can I become a specialist. So, a data engineer is in IT. It’s a sought out field with high demand, but there’s not really a roadmap for these, so you can see what some people are learning, what other people are saying is a specification. So, you’re asking what I see and what I think are the skills that you need based off your Python and your Sequel background. Well, I’m going to break it down into four different categories. I think there’s four important things that you need to learn. And there’s different ways to learn them. And I’ll talk a little bit about that and give you some resources for that. And all the resources for this will be posted on my blog. So, I’ll have it on thomashenson.com. Look up Roadmap Learning for Data Engineer. And under that video, you’ll see all these links for the resources.

Ingesting Data

The first thing is you need to be able to move data in and out. And so most likely you’re going to want to know how to move into HDFS. So, you want to know how to move that data in, how to move it out, and how to use the tools. You can use Flume, just using some of the HDFS commands. You also want to know how to do that maybe from an object perspective. So, if in your workflow, you’re looking to be able to move data from an object based and still use that in Hadoop or the Hadoop ecosystem, then you’d want to know that. And then also I mix in a little bit of Kafka there, too. So, understanding Kafka. So, the important point there is being able to move data in and out. So, ingest data into your system.

ETL

The next one is to be able to perform ETL. So, being able to transform that data that’s already in place or as it’s coming into your system. Some of the tools there… If you watch any part of my videos, you know that I got my start in Pig, so being able to use Pig, or use MapReduce jobs, or maybe even some Python jobs to be able to do it. Or Spark just to be able to transform that data. So, we want to be able to take some kind of maybe structured data or semi structured data and transform it, being able to move that into a MapReduce job, a Spark job, or transform it maybe with Pig and pull some information out. So, being able to do ETL on the data, which rolls into the next point which is being able to analyze the data.

Analyze & Visualize

So, being able to analyze the data whether you have that data, you’re transforming it, maybe you’re moving it into a data warehouse in the Hadoop ecosystem. So, maybe you move it into Hive. Or maybe you’re just transforming some of that data, and capturing it, and pulling into HBase, and then you want to be able to analyze that data maybe with Mahout or MLlib. And so there’s a lot of different tutorials out there that you can do, and it’s just kind of getting your feet wet, understanding, “Okay, we’ve got the data. We were able to move the data in, perform some kind of ETL on it, and now we want to analyze that data.” Which brings us to our last point. The last thing that you want to be able to do and be familiar with is be able to visualize the data. And so with visualizing the data, you have some options there. So, you can use Zeplin or some of the other notebooks out there, or even some custom built… If you’re familiar with front end development, you can kind of focus in on some of the tools out there for making some really cool charts in really cool different ways to be able to visualize the data that’s coming in.

Four Pillars – Learning Road Map for Data Engineers

So, the four big pillars there, remember, are moving your data – so being able to load data in and out of HDFS – object based storage, and then also I’d mix a little Kafka in there, performing some kind of ETL on the data, being able to analyze the data, and then being able to visualize the data. In my career, I’ve put more emphasis around the moving data and the ETL portion. But for whatever you’re trying to do… Or your skill base may be different. Maybe you’re going to focus more on the analyzing of the data and the visualization of the data. But those are the four keys that I would look at for a roadmap to becoming a better data engineer or even just getting into data engineering. All that being said, I will say… I did four. Kind of draw a box around those four pillars and say as we’re doing those, make sure we’re understanding how to secure that data for bonus point. So, as you’re doing it, make sure you’re using security best practices and learning some of those pieces because we start implementing and put these into the enterprise, we want to make sure that we’re securing that data. So, that’s all today for the Big Data, Big Questions. Make sure you subscribe, so you never miss an episode, all this awesome greatness. If you have any questions, make sure you use the #bigdatabigquestions on Twitter. Put it in the comment section here on the YouTube video or go to my blog and see Big Data, Big Questions. And I will answer your questions here. Thanks again, folks.

Show Notes

HDFS Command Line Course

Pig Latin Getting Started

Should Data Engineers Know Machine Learning Algorithms?

November 10, 2017 by Thomas Henson Leave a Comment

How involved should Data Engineers be in learning Machine Learning Algorithms?

For the past few years Data Scientist are one of the hottest jobs in IT. A huge part of what Data Scientist do is selecting Machine Learning Algorithms for projects like SmartHomes and SmartCars. What about the Data Engineer, should they know Machine Learning Algorithms as well? Find out in this episode of Big Data Big Questions.

Transcript – Should Data Engineers Know Machine Learning Algorithms?

Hi, folks. Thomas Henson here, with ThomasHenson.com, and today is another episode of Big Data, Big Questions. And so, today’s question that comes in is, “Should data engineers know machine-learning algorithms?” So, we’re going to tackle that question right after this.

Welcome back. So, today, we’re going to talk a little bit about algorithms, right? So, you know, put your math hat on, and let’s dive into this question today. And so, today’s question, it’s one I get a lot. It’s about the role of a data engineer in machine-learning. And basically, it is… You know, I’ve taken this question from a couple of different sources that I’ve seen, where they’ve asked, you know, “Should data engineers know machine learning algorithms?” And kind of where some of that falls into is, you know, what is the role of the data engineer, and what is the role of the data scientist? And so, really, this question, for me, is really simple. I’m going to go off of my experience and kind of share with you what I’ve done around machine-learning algorithms and how I’ve approached it in my career as a data engineer, software engineer, you know, Hadoop administrator.

There’s a couple of different ways to look at it, but basically, the way that I’ve approached it is I haven’t really learned it. And when I say, “Learned it,” or, “Know it,” I’ve not been in…you know, I’m not going to make a recommendation on it. So, you know, the way I look at it is you should be familiar with them. So, you should be familiar with them, especially familiar with them as far as, like, what’s involved in the package? So, are you using Mahout? You know, what are the algorithms in there, what are the algorithms in your workflow? And then, all the other libraries too.

So, if you’re evaluating other libraries…so, maybe you guys are looking to…you know, maybe you haven’t used Spark and you want to look at the e-mail library that’s there, and you’re kind of going back and forth through those, you want to understand from a basic very high-level, you know, what those algorithms are, and for sure, what algorithms you’re using in your environment, so you can make an educated recommendation saying, “Hey, you know, I think we should move this. Let’s still have the data scientists involved, and have them, you know, look and make sure that the algorithm that we’re going to be using from those packages are going to fall in line with what we’re really using,” because that’s one of the things too, you’ll find that they will differentiate a little bit, so, you know, what we’re using in my house may not be exactly the same, you know, version in, you know, MadLib, or, you know, the ML library.

And so, just be able to understand kind of for sure what’s in your workflow. Be familiar with them too. Another thing that I did…so, like I said, be familiar with them from a high level, but not be making a recommendation, I actually did, you know, picked one, so I would say, you know, be familiar with them, but pick one that you really want to…you know, really want to understand and learn. I picked Singular Value Decomposition, because that’s something that we used a lot in our workflow, and so, I was just kind of…had a natural curiosity for it, and it…you know, it had a really cool story too around it. So, you know, I found some stories around it, you know, it was made really popular with the Netflix Challenge. So, back…Netflix had a challenge for…you know, to, “Beat our data scientists with your algorithm.” And so, SVD was used to, you know, do some of the sorting there, and it was kind of made famous from that perspective, and so, you know, I was familiar with it, but I made sure that I understood one, just for natural curiosity.

Now, if you are looking to, you know, at some point, make a jump, right, to data scientist, if you’re a data engineer, and at some point down the road, you’d like to be…you know, “I want to be the data scientist. I want to say, ‘Hey, this is the algorithm we should use.’” You know, maybe you just want to be a data scientist because, you know, for a couple years running, it’s been the…you know, the sexiest career, you know, in IT for a while, and so, if that’s kind of your approach, you know, definitely start to know them.

Obviously, learn the ones that are in your environment first, because that’s going to be the easiest, because you’re going to have the access to, you know, why you’re using it, how you’re using it, and you have access to the data scientists too, to kind of, you know, take you under their wing, to some extent, and, you know, show you the ins and outs of why you’re using what you did and, you know, kind of why you didn’t use other ones too. For an aspiring data scientist, then yes, for sure, you want to jump in and, you know, start to understand and start to know them. But for a data engineer, I don’t think you have to learn the algorithms, right? I think you have to be familiar with them, I think, you know, for natural curiosity, you know, maybe learn one or two.

But really, our role is not to recommend and say, “Hey, you know, these are the algorithms I think we should use,” or even, like, to pick packages and say, “Hey, these packages here, we’re going to…you know, we’re going to standardize on that and that’s the only thing we’re going to use.” That’s…you know, that’s not really our role, right?

If you have any questions, make sure you submit them to Big Data, Big Questions. You can do it from the website, go to Twitter, use the hashtag #BigDataBigQuestions, in the comment section there, however you want to get in touch with me and get those questions answered. Also, make sure you subscribe so that you never miss all these Big Data, Big Questions goodness, and so that you can always, you know, learn more. Thanks again, folks!

Show Notes

Mahout

Spark MLlib

MADlib

Big Data Big Questions

Netflix Prize

Singular-Value Decomposition

Python vs. Scala Freelance Data Engineers

November 9, 2017 by Thomas Henson Leave a Comment

Which is better for Freelance Data Engineers Scala or Python?

Picking up freelance gigs can be a challenge especially when just starting out. So which language is better for getting freelance gigs Scala or Python? In today’s episode Big Data Big Questions I answer a question about freelance options for Data Engineers. Watch the video below to find out about Scala and Python Freelance options.

Transcript – Python vs. Scala Freelance Data Engineers Video

Hi. I’m Thomas Henson with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question focuses around Python versus Scala for freelancers. Today, we’re going to tackle that question. Find out more right after this.

Today’s question comes in from our Big Data Big Questions series. If you have any questions, go to thomashenson.com Big Data Big Questions. Submit your questions. That’s where today’s is coming from.

Today, our question’s coming in from Gill Taso [Phonetic]. I hope I’m saying your name right. I hope I’m not mispronouncing it. Thanks for following, and thanks for asking a question, Gill.

Today, Gill’s question is, “I saw your video on learning how to become a data engineer, and it was very interesting.” Thank you, Gill. “I subscribe to your Twitter and your YouTube channel.” Well, thanks again. That’s awesome. That’s how you can make sure that you never miss an episode. Currently Gill is a junior data engineer who works mainly with technologies like Impala and Hive, with a little bit of Spark SQL.

Said, “I would like to improve my skills in Spark. I have intermediate skills in Python, but as I review the industry, it seems it’s better to learn Scala, which is very tough, at least for me. Do you think it’s better for me to sharpen my skills in Python or transition to Scala now? Also, I’m looking for ways to be a freelance data engineer in the short-term, maybe two to three years, because I’m looking more for more flexibility. Do you have any advice to give me for training, skills, and technology?”

Thanks for the question, Gill. That’s actually a two-part question, I feel like, so I feel like there’s a couple different ways that we can look at it. First, the question is, for a junior data engineer that’s working a really good intermediate skills in Python, but should you transition more to Scala? The second part of the question is, what should you do for freelancing?

There’s two different ways to look at it. If we look at it in your current career, where the industry’s going, is Python a viable option, or should you start to do Scala? Or, where’s that market going to be at for freelancing? I think those are two different things. What we see for freelancing and then what we see for industry and in your career progression, and especially if you’re in the Spark community. It seems like you’ve got, hinted around that, you’ve been working with Spark for a little while, and starting to get into more of the Spark, and that’s why you’re starting to look at Scala.

We’ll break it down real quick. First, my thoughts on Python versus Scala, just in the industry, I think that, with you saying that you have intermediate Python skills, if you’re looking at Spark, and that’s what you’re really wanting to specialize a little bit more in, I don’t think it would hurt you. I think it would be beneficial for you to start going ahead and looking into Scala, and maybe doing some research, there.

I definitely wouldn’t push back, and say, “I don’t need those Python skills anymore.” Both of them are interpreters. You can write both of those inside of Spark, so you’re covered there. There a little more functionality, it seems like, and more documentation around Scala. Part of it is because it’s expanding out that Java, so it keeps a lot of people from having to write those Java jobs versus just using the Scala interpreter ripple or just using the functions there.

From where I see it, and where you want to go, and especially if, you being a junior, saying that you’re a junior, data engineer, if you want to progress, I don’t think there’s a problem with you learning Scala. I just wouldn’t say that you really need to jump to it now.

It’s hard for us to, as technologists, as engineers, as developers, we love playing with new toys. Any time there’s a new language or a new toy, a new kind of technology, whether it be some kind of project with dockers containers, or Mesos, or Yarn, anytime there’s something new, we always look at that, like, “Oh, man!” That’s a shiny new object we want to play with.

What should we do from our career perspective? I don’t think you have to go all in on Scala. I think that, even with Python, and where we see Python now, and just look at the different open source projects and everything that’s going on with Python. I think you’re set there. One recommendation that I was given early in my career, that I want to pass along to you, is being a junior engineer, saying you have intermediate Python skills, what I would start to do is, I would start to maybe start looking at specialization.

Look at a couple different open source projects, maybe get involved, that revolve around Python. That can help you build your foundation, your portfolio, and yourself as a career. You still have options.

You’ve heard me talk about it before in the data engineer video. Take 30 minutes. Take 30 minutes a day and maybe two to three times a week, mix in some Scala research in there. That way, you’re learning it, and you’re starting to get it, but you’re also still contributing to your Python skills, which, like I said, you’re at the intermediate phase. There’s a way for you to specialize, because Python’s a pretty big language, and so there are different specializations that you can do. I would look at that.

The second part of the question that’s a little bit different, and I’m going to have somewhat of a different opinion here, is you’re asking about options for freelancing. Which one should you focus on if you want to freelance?

The way that I would look at that is, there are three important things that I look at for freelancing. If you’re freelancing, you’re focused on your brand. That’s how you’re going to get freelancing jobs. That’s how you’re going to be able to market yourself as a freelancer and continue to get that revenue there.

That starts with specialization. Like I was talking about a little bit with Python and starting to pick a specialization, I would really focus on it if that’s what I wanted to do. If I wanted to get into freelancing, wanted to be a freelance data engineer, you really want to start focusing and start specializing.

That’s not to say that you’re not going to pick up other skills and continue to learn those, but I think specialization, especially when you’re new to freelancing, is going to be very, very important, even more so important than just in your current career, where you’re working, maybe, at your company. Specialization is important there, but not as important as a freelancer. So, the second thing that’s really big with freelancing is the market. What is the total market need? Just looking based off that, if you were looking and asking that question, “Hey, I want to be a freelancer,” Python or Scala? There’s going to be a lot more jobs that are available right now, a lot more gigs to be able to get, that are going to revolve around Python, and I think that trend’s going to continue. I don’t think that Scala’s going to overtake that market.

Even segmented in the big data area, too, so even if we got outside of what’s going on with the big data community, I think those jobs are still going to continue to be maybe two to three times more for Python developers.

You’re looking at the brand. You’re looking at the market need, but then you’re also, the third key for freelancing is your reputation. Being new, you’re not going to have a lot of reputation, so what are some things that you can do to be able to get your name out there, and also get involved in some jobs?

It’s going to go back to the open source. I would look at some open source projects, maybe around the areas that you want to specialize in, and just get involved. Getting involved in those is sometimes just as easy as joining the user discussion list, and maybe answering some questions there. Getting on the developer list, and just maybe reading the emails, and see how it’s going.

I’m telling you, just getting on those email lists and seeing what’s going on, and as you start to contribute on there, that’s a great way for you to be able to, one, have code that you can point back to, or that you’ve been a part of the community. Two, that helps your reputation, so it’s going to be easier for you to find jobs. There’s a lot of times that you’ll see jobs, or requests for podcast interviews, or just a lot of different things, all come from that open source email list. I would get involved there.

Also, too, another thing that you do, just being new to the area of freelancing, is maybe you look for one of the gig, one of the freelancing places online, and start doing some of those jobs. A lot of times, those jobs, because you’re just starting out, they’re going to be very competitive, and you’re going to maybe be doing it for 10 or 20 bucks, or 30 bucks an hour, for a project, but that helps you build that portfolio. I wouldn’t focus too much on being a part of that, and saying that’s going to be your revenue for freelancing, but I think it’s a great way for you to get started, and start to see how it works for freelancing, and if it’s something that you’re interested in, too.

It’ll help you build that reputation, but I think the open source, long term, is going to really build that reputation for a lot longer than even the gig sites, and some of the other things. Three things for freelancing. Definitely, for that question around Python versus Scala, I would continue to focus on Python for your freelancing gigs, because I think that’s going to be your fastest way to be able to build that specialization. I think the market need’s there, and then also being able to brand yourself as well. That’s today’s episode of Big Data Big Questions.

Make sure you subscribe, so that you never miss an episode. Also, go, if you have any questions for Big Data Big Questions, check it out, thomashenson.com, I’ve got a place out there for you to just fill out. I can answer any questions you have.

Thanks again. See you!