Learning Roadmap for Data Engineers?

Is there a learning Roadmap for Data Engineers?

Data Engineers are highly sought after field for Developers and Administrators. One factor driving developers into that space is the average salary of 100K – 150Kwhich is well above average for IT professionals. How does a developer start to transition into the field of Data Engineering? In this video I will give the four pillars that developers/administrators need to follow to develop skills in Data Engineering. Watch the video to learn how to become a better Data Engineer…

Transcript – Learning Roadmap For Data Engineers

Thomas: Hi, Folks. I’m Thomas Henson with thomashenson.com. Welcome back to another episode of Big Data, Big Questions. Today we’re going to talk about some learning challenges for the data engineer. And so we’re actually going to title this a roadmap learning for data engineers. So, find out more right after this.

Big Data Big Questions

Thomas: So, today’s question comes in from YouTube. And so if you want to ask a question, post it here in the comments, have these questions answered. So, most of these questions that I’m answering are questions that I’ve gotten from the field that I’ve met with customers and talked about, or I get over, and over, and over. And then a lot of the questions that I’m answering are coming in from YouTube, from Twitter. You can go to Big Data, Big Questions on my website, thomashenson.com…Big Data, Big Questions, submit your questions there. Anyway that you want to use it, use the #bigdatabigquestions on Twitter, and I will pull those out and answer those questions. So, today’s question comes in from YouTube. It’s from Chris. And Chris says, “Hi, Thomas. I hold a master’s degree in computer information systems. My question is, is there any roadmap to learn on this course called data engineer? I have intermediate knowledge of Python and Sequel. So, if there’s anything else I need to learn, please reply.”

Well, thanks for your question, Chris. It’s a very common question. It’s something that we’re always wanting to understand is how can I learn more, how can I move up in my field, how can I become a specialist. So, a data engineer is in IT. It’s a sought out field with high demand, but there’s not really a roadmap for these, so you can see what some people are learning, what other people are saying is a specification. So, you’re asking what I see and what I think are the skills that you need based off your Python and your Sequel background. Well, I’m going to break it down into four different categories. I think there’s four important things that you need to learn. And there’s different ways to learn them. And I’ll talk a little bit about that and give you some resources for that. And all the resources for this will be posted on my blog. So, I’ll have it on thomashenson.com. Look up Roadmap Learning for Data Engineer. And under that video, you’ll see all these links for the resources.

Ingesting Data

The first thing is you need to be able to move data in and out. And so most likely you’re going to want to know how to move into HDFS. So, you want to know how to move that data in, how to move it out, and how to use the tools. You can use Flume, just using some of the HDFS commands. You also want to know how to do that maybe from an object perspective. So, if in your workflow, you’re looking to be able to move data from an object based and still use that in Hadoop or the Hadoop ecosystem, then you’d want to know that. And then also I mix in a little bit of Kafka there, too. So, understanding Kafka. So, the important point there is being able to move data in and out. So, ingest data into your system.

ETL

The next one is to be able to perform ETL. So, being able to transform that data that’s already in place or as it’s coming into your system. Some of the tools there… If you watch any part of my videos, you know that I got my start in Pig, so being able to use Pig, or use MapReduce jobs, or maybe even some Python jobs to be able to do it. Or Spark just to be able to transform that data. So, we want to be able to take some kind of maybe structured data or semi structured data and transform it, being able to move that into a MapReduce job, a Spark job, or transform it maybe with Pig and pull some information out. So, being able to do ETL on the data, which rolls into the next point which is being able to analyze the data.

Analyze & Visualize

So, being able to analyze the data whether you have that data, you’re transforming it, maybe you’re moving it into a data warehouse in the Hadoop ecosystem. So, maybe you move it into Hive. Or maybe you’re just transforming some of that data, and capturing it, and pulling into HBase, and then you want to be able to analyze that data maybe with Mahout or MLlib. And so there’s a lot of different tutorials out there that you can do, and it’s just kind of getting your feet wet, understanding, “Okay, we’ve got the data. We were able to move the data in, perform some kind of ETL on it, and now we want to analyze that data.” Which brings us to our last point. The last thing that you want to be able to do and be familiar with is be able to visualize the data. And so with visualizing the data, you have some options there. So, you can use Zeplin or some of the other notebooks out there, or even some custom built… If you’re familiar with front end development, you can kind of focus in on some of the tools out there for making some really cool charts in really cool different ways to be able to visualize the data that’s coming in.

Four Pillars – Learning Road Map for Data Engineers

So, the four big pillars there, remember, are moving your data – so being able to load data in and out of HDFS – object based storage, and then also I’d mix a little Kafka in there, performing some kind of ETL on the data, being able to analyze the data, and then being able to visualize the data. In my career, I’ve put more emphasis around the moving data and the ETL portion. But for whatever you’re trying to do… Or your skill base may be different. Maybe you’re going to focus more on the analyzing of the data and the visualization of the data. But those are the four keys that I would look at for a roadmap to becoming a better data engineer or even just getting into data engineering. All that being said, I will say… I did four. Kind of draw a box around those four pillars and say as we’re doing those, make sure we’re understanding how to secure that data for bonus point. So, as you’re doing it, make sure you’re using security best practices and learning some of those pieces because we start implementing and put these into the enterprise, we want to make sure that we’re securing that data. So, that’s all today for the Big Data, Big Questions. Make sure you subscribe, so you never miss an episode, all this awesome greatness. If you have any questions, make sure you use the #bigdatabigquestions on Twitter. Put it in the comment section here on the YouTube video or go to my blog and see Big Data, Big Questions. And I will answer your questions here. Thanks again, folks.

Show Notes

HDFS Command Line Course

Pig Latin Getting Started