Thomas Henson, Author at Thomas Henson

Major Hadoop Release!

Hadoop 3.0 is has dropped! There is a lot of excitement in the Hadoop community for a 3.0 release. Now is the time to find out what’s new in Hadoop 3.0 so you can plan for an upgrade to your existing Hadoop clusters. In this video I will explain the major changes in Hadoop 3.0 that every Data Engineer should know.

Transcript – What’s New in Hadoop 3.0?

Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. In today’s episode, we’re going to talk about some exciting new changes in Hadoop 3.0 and why Hadoop has decided to go with a major release in Hadoop 3.0, and what all is in it. Find out more right after this.

Thomas: So, today I wanted to talk to you about the changes that are coming in Hadoop 3.0. So, it’s already went through the alpha, and now we’re actually in the beta phase, so you can actually go out there and download it and play with it. But what are these changes that are in Hadoop 3.0, and then why did we go with such a major release for Hadoop 3.0? So, what all is in this one? There’s two major ones that we’re going to talk about, but let me talk about some of the other ones that are involved with this change, too. So, the first one is more support for containerization. And so if you go through Hadoop 3.0, when you go to the website, you can look, you can actually go through some of the documentation and see where they’re starting to support some of the docker pieces. And so this is just more evidence for the containerization of the world. We’ve seen it with Kubernetes. There’s a lot of different other pieces that are out there with docker. It’s almost like a buzzword to some extent, but it’s really, really been popularized.

It’s really cool changes, too, when you think about it. Because if we go back to when we were in Hadoop 1.0 and even 2.0, it’s kind of been a third rail to say, “Hey, we’re going to virtualize Hadoop.” And now we’re fast forwarding and switching to some of the containers, and so that’s going to be some really cool changes that are coming. Obviously there’s going to be more and more changes that are going to happen [INAUDIBLE 00:01:37], but this is really laying some of the foundation for that support for docker and some of the other major container players out there in the IT industry.

Another big change that we’re starting to see… One again, this is another… I won’t say it’s a monumental change, but it’s just more evidence for support for the cloud. And so the first one is there’s some expanded support for Azure’s data lakes. So, think the unstructured data there. Maybe some of our HTFS components. And then also some big changes in Amazon’s AWS S3. So, S3, they’re actually going to allow for easier management of your metadata with DynamoDB, which is a huge no sequel database used in a DAWS platform. So, those are two…I would say some of the minor changes. Those changes along probably wouldn’t have pushed it to be a Hadoop 3.0 or a major release.

The two major releases…and these are going to deal with the way that we store data, and it’s also going to deal with the way that we protect our data for disaster recover and when you start thinking of those enterprise features that you need to have. And so the first one is support for more than two namenodes. And so we’ve had support since Hadoop 2.0 where we were able to have a standby namenode. What this gave us in pre-having a standby namenode or even having a secondary namenode is if your Hadoop cluster went down…or if your namenode went down…your Hadoop cluster was all the way down, right?

Because that’s where all your data is stores as far as your metadata, and it knows what data is allocated on each of the namenodes. And so once we were able to have that secondary namenode and that shared journal where if one namenode went down, you can have another one. But when we start thinking about fault tolerance and disaster recovery for enterprises, we probably want to be able to expand that out. And so this is one of the ways that we’re actually going to tackle that in the enterprise is to be able to have those changes.

So, be able to support more than two namenodes. And so if you think about it with just doing some calculations, one of the examples is if you have three namenodes, and you have five shared journals, you can actually take two losses of a namenode. So, you could lose two namenodes, and your Hadoop cluster would still be up and running, still be able to run your MapReduce jobs, or if you’re using Spark or something like that, you still have your access to your Hadoop cluster there. And so that’s a huge change when we start to think about where we’re going with the enterprise and just the enterprise adoption. So, you’re seeing a lot of features and requests that are coming from the enterprise customer saying, “Hey, this is the way that we do DR. We’d like to have more fault tolerance built in.” And you’re starting to see that.

So, that was a huge change. One caveat around that…support for those namenodes, but they’re still in the standby mode. So, they’re not what we would talk about when we talk about HTFS federation. So, it’s not supporting three or four different namenodes in different portions of HTFS. I’ve actually got a blog post that you can check out about HTFS federation and kind of where I see that going and how that’s a little bit different, too. So, that was a big change. And then the huge change…I’ve seen some of the results on this before it even came out to the alpha. I think they did some testing in Japan Yahoo. But it’s about using Erasure coding for storing the data. So, if you think about how we store data in HTFS… If you remember the default three, so three times replication. So, as data comes in your namenode, it’s moved to one of your data nodes, and then two [INAUDIBLE 00:05:04] copies are moved to a different rack on two different data nodes. And so that’s to give you that fault tolerance there. So, if you lose one data node, you’re able to get your data and have your data in a separate rack that still would be able to run your MapReduce jobs or your Spark jobs, or whatever you’re trying to do with your data. Maybe just trying to pull it back.

That’s how we traditionally stored it. If you needed more protection, you just bumped it up. But that’s really inefficient. Sometimes we would talk about that being 200% of your data for one portion of your data block. But really, it’s more than that because most customers will have a DR cluster, and so they have it triple replicated over there. So, when you start to think about, “Okay, in our Hadoop cluster, we have it triple replicated. In our DR Hadoop cluster, we have it triple replicated.” Oh, and the data may exist somewhere else as the source data outside of your Hadoop clusters. That’s seven copies of the data. And how efficient is that for data that’s maybe mostly archive? Or maybe it’s compliance data. You want to keep it in your Hadoop cluster.

Maybe you run [INAUDIBLE 00:06:03] over it once a year. Maybe not. Maybe it’s just something you want to hold on to so if you do want to run a job, you can. So, what Erasure coding is going to do is it’s going to give you the ability to store that at a different rate. So, instead of having to triple replicate it, what Erasure coding basically does is it says, “Okay, if we have data, we’re going to break it into six different data blocks, and then we’re going to store three [INAUDIBLE 00:06:27] versus when we’re doing triple replication think of having 12. And so the ability to break that data down and be able to pull the data back from the [INAUDIBLE 00:06:36] is going to give you that ability to get a better ratio for how you’re going to store that data and what your efficiency rate is, too.

So, instead of 200%, maybe it’s going to be closer to 125 or 150. It’s just going to depend as you scale. Just something to look forward to. But it’s really cool because that gives you the ability to one, store more data – bring in more data, hold on to it, and not think so much about the…okay, this is going to take up three times the data just for how big the file is. And so it gives you the ability to hold on to more data and take more somewhat of a risk and be like, “Hey, I don’t know that we need that data right now, but let’s hold on to it because we know that we can use Erasure coding, and we can store it at a different rate. And then as we start to need it, or if it’s something that we need later on, we can bring that back and take that away.” So, think of Erasure coding as more of an archive for your data in HTFS.

And so those are the major changes in Hadoop 3.0. I just wanted to talk to you guys about that and just kind of get that out there. Feel free to send me any questions. So, if you have any questions for Big Data, Big Questions, feel free to go to my website, put it on Twitter, #bigdatabigquestions, put it in the comments section here below. I’ll answer those questions here for you. And then as always, make sure you subscribe so you never miss an episode. Always talking big data, always talking big questions and maybe some other tidbits in there, too. Until next time. See everyone then. Thanks.

Show Notes

Hadoop 3.0 Alpha Notes

Hadoop Summit Slides on Japan Yahoo Hadoop 3.0 Testing

DynamoDB NoSQL Database on AWS

Kubernetes

Is there a learning Roadmap for Data Engineers?

Data Engineers are highly sought after field for Developers and Administrators. One factor driving developers into that space is the average salary of 100K – 150Kwhich is well above average for IT professionals. How does a developer start to transition into the field of Data Engineering? In this video I will give the four pillars that developers/administrators need to follow to develop skills in Data Engineering. Watch the video to learn how to become a better Data Engineer…

Transcript – Learning Roadmap For Data Engineers

Thomas: Hi, Folks. I’m Thomas Henson with thomashenson.com. Welcome back to another episode of Big Data, Big Questions. Today we’re going to talk about some learning challenges for the data engineer. And so we’re actually going to title this a roadmap learning for data engineers. So, find out more right after this.

Big Data Big Questions

Thomas: So, today’s question comes in from YouTube. And so if you want to ask a question, post it here in the comments, have these questions answered. So, most of these questions that I’m answering are questions that I’ve gotten from the field that I’ve met with customers and talked about, or I get over, and over, and over. And then a lot of the questions that I’m answering are coming in from YouTube, from Twitter. You can go to Big Data, Big Questions on my website, thomashenson.com…Big Data, Big Questions, submit your questions there. Anyway that you want to use it, use the #bigdatabigquestions on Twitter, and I will pull those out and answer those questions. So, today’s question comes in from YouTube. It’s from Chris. And Chris says, “Hi, Thomas. I hold a master’s degree in computer information systems. My question is, is there any roadmap to learn on this course called data engineer? I have intermediate knowledge of Python and Sequel. So, if there’s anything else I need to learn, please reply.”

Well, thanks for your question, Chris. It’s a very common question. It’s something that we’re always wanting to understand is how can I learn more, how can I move up in my field, how can I become a specialist. So, a data engineer is in IT. It’s a sought out field with high demand, but there’s not really a roadmap for these, so you can see what some people are learning, what other people are saying is a specification. So, you’re asking what I see and what I think are the skills that you need based off your Python and your Sequel background. Well, I’m going to break it down into four different categories. I think there’s four important things that you need to learn. And there’s different ways to learn them. And I’ll talk a little bit about that and give you some resources for that. And all the resources for this will be posted on my blog. So, I’ll have it on thomashenson.com. Look up Roadmap Learning for Data Engineer. And under that video, you’ll see all these links for the resources.

Ingesting Data

The first thing is you need to be able to move data in and out. And so most likely you’re going to want to know how to move into HDFS. So, you want to know how to move that data in, how to move it out, and how to use the tools. You can use Flume, just using some of the HDFS commands. You also want to know how to do that maybe from an object perspective. So, if in your workflow, you’re looking to be able to move data from an object based and still use that in Hadoop or the Hadoop ecosystem, then you’d want to know that. And then also I mix in a little bit of Kafka there, too. So, understanding Kafka. So, the important point there is being able to move data in and out. So, ingest data into your system.

ETL

The next one is to be able to perform ETL. So, being able to transform that data that’s already in place or as it’s coming into your system. Some of the tools there… If you watch any part of my videos, you know that I got my start in Pig, so being able to use Pig, or use MapReduce jobs, or maybe even some Python jobs to be able to do it. Or Spark just to be able to transform that data. So, we want to be able to take some kind of maybe structured data or semi structured data and transform it, being able to move that into a MapReduce job, a Spark job, or transform it maybe with Pig and pull some information out. So, being able to do ETL on the data, which rolls into the next point which is being able to analyze the data.

Analyze & Visualize

So, being able to analyze the data whether you have that data, you’re transforming it, maybe you’re moving it into a data warehouse in the Hadoop ecosystem. So, maybe you move it into Hive. Or maybe you’re just transforming some of that data, and capturing it, and pulling into HBase, and then you want to be able to analyze that data maybe with Mahout or MLlib. And so there’s a lot of different tutorials out there that you can do, and it’s just kind of getting your feet wet, understanding, “Okay, we’ve got the data. We were able to move the data in, perform some kind of ETL on it, and now we want to analyze that data.” Which brings us to our last point. The last thing that you want to be able to do and be familiar with is be able to visualize the data. And so with visualizing the data, you have some options there. So, you can use Zeplin or some of the other notebooks out there, or even some custom built… If you’re familiar with front end development, you can kind of focus in on some of the tools out there for making some really cool charts in really cool different ways to be able to visualize the data that’s coming in.

Four Pillars – Learning Road Map for Data Engineers

So, the four big pillars there, remember, are moving your data – so being able to load data in and out of HDFS – object based storage, and then also I’d mix a little Kafka in there, performing some kind of ETL on the data, being able to analyze the data, and then being able to visualize the data. In my career, I’ve put more emphasis around the moving data and the ETL portion. But for whatever you’re trying to do… Or your skill base may be different. Maybe you’re going to focus more on the analyzing of the data and the visualization of the data. But those are the four keys that I would look at for a roadmap to becoming a better data engineer or even just getting into data engineering. All that being said, I will say… I did four. Kind of draw a box around those four pillars and say as we’re doing those, make sure we’re understanding how to secure that data for bonus point. So, as you’re doing it, make sure you’re using security best practices and learning some of those pieces because we start implementing and put these into the enterprise, we want to make sure that we’re securing that data. So, that’s all today for the Big Data, Big Questions. Make sure you subscribe, so you never miss an episode, all this awesome greatness. If you have any questions, make sure you use the #bigdatabigquestions on Twitter. Put it in the comment section here on the YouTube video or go to my blog and see Big Data, Big Questions. And I will answer your questions here. Thanks again, folks.