Big Data Big Questions: Learning to Become a Data Engineer?

Data Scientist for the past few years has been named the sexiest job in IT. However the Data Engineer is a huge part of the Big Data movement. The Data Engineer is one the top paying jobs in IT. On average the Data Engineer can make anywhere from 90K – 150K a year.

Data Engineers are responsible for moving large amounts of data, administering the Hadoop/Streaming/Analytics Cluster, and writing MapReduce/Spark/Flink/Scala/etc. jobs.

With all this excitement for Data Analytics and Data Engineers, how can you get involved in this community?

Ready to learn tips to becoming a Data Engineer? Checkout this video for tips to becoming a Data Engineer.

Transcript

Hi Folks, I’m Thomas Henson, with thomashenson.com, and welcome back to another episode of Big Data, Big Questions. Today’s question is: What are some tips for learning to become a better data engineer? Find out more right after this.

So, today’s episode is all about tips for learning to become a better data engineer. So, if you’re watching this, you’re probably concerned with, one, how can I start out becoming a data engineer? What are some ways that I can learn to become better? Or maybe you’re just looking to answer one specific question. But all those are encompassed in what we call the data engineer.

A data engineer is somebody who’s concerned with moving data in and out of Hadoop ecosystem, being able to give status scientists and data analysts better views into the data. So, we’re involved with the day-to-day interactions of how that data is coming in. Is it in how we’re ingesting that data? How are we creating those applications and tuning those applications so that the data comes in faster? All to support those business analysts, those business decisions, and data scientists in creating better models and having just more data to put their hands on.

And so, a lot of times what we’re always doing is we’re asked to take on a couple terabytes of data here, maybe implement and do all the configuration for your hives. You know, your hive implementation or HBase or anything that’s in that big data ecosystem. Some of the tips that I’ve found for just getting started, so if you’re brand new to this and you don’t know where to start, the first thing I would recommend is, go out and just download the sandboxes.

So, download Cloudera’s sandbox, or download Hortonworks’ sandbox and just start playing with it. Go through some of the tutorials. Stand up on your local machine in a VM environment, and just start playing with moving some of the data around. Find some sample data, so go to data.world. Also, I have a post and a video on where to find some data sets, so take those data sets in, start ingesting those. I have a ton of resources and a ton of material on just some simple examples that you can walk through with Pig, and some around Hive. So, go there and find some of those. But, basically what I’m saying is, just get hands-on. Start creating applications. Start trying to do some simple things like, ingest some data in, put it into Hive, and be able to create a table and pull some of that data out, and just maybe some simple Hive queries. And do the same thing with Pig, and just kind of go around to some of those applications that you’re curious about, and start playing with them.

Another thing is, is once you start playing, and sampling, and testing that data, get involved. By getting involved, just ask some questions, create a blog post, try to find a way that you can contribute back to the community. I mean, that’s what I did when I was first starting out. I started off with a sandbox, and what I did was, I took and made sure that every day for 30 minutes, I was learning something new in the Hadoop ecosystem. And so, that’s another tip for you too, is to take and try to do this 30 minutes a day, every day. Even Saturdays, Sundays. Don’t take a day off. And it’s only 30 minutes. And if it’s something that you’re passionate about, and you like doing, that time is just going to fly by. But over time, that’s just really going to give you more and more time in the Hadoop ecosystem. So, whether you’re doing this for a project at work, whether you’re already in the ecosystem and you’re just trying to improve, that 30 minutes a day is really going to help. And it’s something that I’ve continued to do, and continued to do, now, even though I’ve been in part of the community for three or four years now. It’s how I just continue to learn, so I make sure I’m always kind of pushing.