Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

What’s New in Hadoop 3.0?

December 20, 2017 by Thomas Henson 1 Comment

New in Hadoop 3.0

Major Hadoop Release!

Hadoop 3.0 is has dropped! There is a lot of excitement in the Hadoop community for a 3.0 release. Now is the time to find out what’s new in Hadoop 3.0 so you can plan for an upgrade to your existing Hadoop clusters. In this video I will explain the major changes in Hadoop 3.0 that every Data Engineer should know.

Transcript – What’s New in Hadoop 3.0?

Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. In today’s episode, we’re going to talk about some exciting new changes in Hadoop 3.0 and why Hadoop has decided to go with a major release in Hadoop 3.0, and what all is in it. Find out more right after this.

Thomas: So, today I wanted to talk to you about the changes that are coming in Hadoop 3.0. So, it’s already went through the alpha, and now we’re actually in the beta phase, so you can actually go out there and download it and play with it. But what are these changes that are in Hadoop 3.0, and then why did we go with such a major release for Hadoop 3.0? So, what all is in this one? There’s two major ones that we’re going to talk about, but let me talk about some of the other ones that are involved with this change, too. So, the first one is more support for containerization. And so if you go through Hadoop 3.0, when you go to the website, you can look, you can actually go through some of the documentation and see where they’re starting to support some of the docker pieces. And so this is just more evidence for the containerization of the world. We’ve seen it with Kubernetes. There’s a lot of different other pieces that are out there with docker. It’s almost like a buzzword to some extent, but it’s really, really been popularized.

It’s really cool changes, too, when you think about it. Because if we go back to when we were in Hadoop 1.0 and even 2.0, it’s kind of been a third rail to say, “Hey, we’re going to virtualize Hadoop.” And now we’re fast forwarding and switching to some of the containers, and so that’s going to be some really cool changes that are coming. Obviously there’s going to be more and more changes that are going to happen [INAUDIBLE 00:01:37], but this is really laying some of the foundation for that support for docker and some of the other major container players out there in the IT industry.

Another big change that we’re starting to see… One again, this is another… I won’t say it’s a monumental change, but it’s just more evidence for support for the cloud. And so the first one is there’s some expanded support for Azure’s data lakes. So, think the unstructured data there. Maybe some of our HTFS components. And then also some big changes in Amazon’s AWS S3. So, S3, they’re actually going to allow for easier management of your metadata with DynamoDB, which is a huge no sequel database used in a DAWS platform. So, those are two…I would say some of the minor changes. Those changes along probably wouldn’t have pushed it to be a Hadoop 3.0 or a major release.

The two major releases…and these are going to deal with the way that we store data, and it’s also going to deal with the way that we protect our data for disaster recover and when you start thinking of those enterprise features that you need to have. And so the first one is support for more than two namenodes. And so we’ve had support since Hadoop 2.0 where we were able to have a standby namenode. What this gave us in pre-having a standby namenode or even having a secondary namenode is if your Hadoop cluster went down…or if your namenode went down…your Hadoop cluster was all the way down, right?

Because that’s where all your data is stores as far as your metadata, and it knows what data is allocated on each of the namenodes. And so once we were able to have that secondary namenode and that shared journal where if one namenode went down, you can have another one. But when we start thinking about fault tolerance and disaster recovery for enterprises, we probably want to be able to expand that out. And so this is one of the ways that we’re actually going to tackle that in the enterprise is to be able to have those changes.

So, be able to support more than two namenodes. And so if you think about it with just doing some calculations, one of the examples is if you have three namenodes, and you have five shared journals, you can actually take two losses of a namenode. So, you could lose two namenodes, and your Hadoop cluster would still be up and running, still be able to run your MapReduce jobs, or if you’re using Spark or something like that, you still have your access to your Hadoop cluster there. And so that’s a huge change when we start to think about where we’re going with the enterprise and just the enterprise adoption. So, you’re seeing a lot of features and requests that are coming from the enterprise customer saying, “Hey, this is the way that we do DR. We’d like to have more fault tolerance built in.” And you’re starting to see that.

So, that was a huge change. One caveat around that…support for those namenodes, but they’re still in the standby mode. So, they’re not what we would talk about when we talk about HTFS federation. So, it’s not supporting three or four different namenodes in different portions of HTFS. I’ve actually got a blog post that you can check out about HTFS federation and kind of where I see that going and how that’s a little bit different, too. So, that was a big change. And then the huge change…I’ve seen some of the results on this before it even came out to the alpha. I think they did some testing in Japan Yahoo. But it’s about using Erasure coding for storing the data. So, if you think about how we store data in HTFS… If you remember the default three, so three times replication. So, as data comes in your namenode, it’s moved to one of your data nodes, and then two [INAUDIBLE 00:05:04] copies are moved to a different rack on two different data nodes. And so that’s to give you that fault tolerance there. So, if you lose one data node, you’re able to get your data and have your data in a separate rack that still would be able to run your MapReduce jobs or your Spark jobs, or whatever you’re trying to do with your data. Maybe just trying to pull it back.

That’s how we traditionally stored it. If you needed more protection, you just bumped it up. But that’s really inefficient. Sometimes we would talk about that being 200% of your data for one portion of your data block. But really, it’s more than that because most customers will have a DR cluster, and so they have it triple replicated over there. So, when you start to think about, “Okay, in our Hadoop cluster, we have it triple replicated. In our DR Hadoop cluster, we have it triple replicated.” Oh, and the data may exist somewhere else as the source data outside of your Hadoop clusters. That’s seven copies of the data. And how efficient is that for data that’s maybe mostly archive? Or maybe it’s compliance data. You want to keep it in your Hadoop cluster.

Maybe you run [INAUDIBLE 00:06:03] over it once a year. Maybe not. Maybe it’s just something you want to hold on to so if you do want to run a job, you can. So, what Erasure coding is going to do is it’s going to give you the ability to store that at a different rate. So, instead of having to triple replicate it, what Erasure coding basically does is it says, “Okay, if we have data, we’re going to break it into six different data blocks, and then we’re going to store three [INAUDIBLE 00:06:27] versus when we’re doing triple replication think of having 12. And so the ability to break that data down and be able to pull the data back from the [INAUDIBLE 00:06:36] is going to give you that ability to get a better ratio for how you’re going to store that data and what your efficiency rate is, too.

So, instead of 200%, maybe it’s going to be closer to 125 or 150. It’s just going to depend as you scale. Just something to look forward to. But it’s really cool because that gives you the ability to one, store more data – bring in more data, hold on to it, and not think so much about the…okay, this is going to take up three times the data just for how big the file is. And so it gives you the ability to hold on to more data and take more somewhat of a risk and be like, “Hey, I don’t know that we need that data right now, but let’s hold on to it because we know that we can use Erasure coding, and we can store it at a different rate. And then as we start to need it, or if it’s something that we need later on, we can bring that back and take that away.” So, think of Erasure coding as more of an archive for your data in HTFS.

And so those are the major changes in Hadoop 3.0. I just wanted to talk to you guys about that and just kind of get that out there. Feel free to send me any questions. So, if you have any questions for Big Data, Big Questions, feel free to go to my website, put it on Twitter, #bigdatabigquestions, put it in the comments section here below. I’ll answer those questions here for you. And then as always, make sure you subscribe so you never miss an episode. Always talking big data, always talking big questions and maybe some other tidbits in there, too. Until next time. See everyone then. Thanks.

Show Notes

Hadoop 3.0 Alpha Notes

Hadoop Summit Slides on Japan Yahoo Hadoop 3.0 Testing

DynamoDB NoSQL Database on AWS

Kubernetes 

 

Filed Under: Hadoop Tagged With: Big Data Big Questions, Hadoop, HDFS

Subscribe to Newsletter

Archives

  • February 2021 (2)
  • January 2021 (5)
  • May 2020 (1)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Machine Learning Engineer Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Recent Posts

  • Tips & Tricks for Studying Machine Learning Projects
  • Getting Started as Big Data Product Marketing Manager
  • What is a Chief Data Officer?
  • What is an Industrial IoT Engineer with Derek Morgan
  • Ultimate List of Tensorflow Resources for Machine Learning Engineers

Copyright © 2025 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in