HDFS Archives - Thomas Henson

HDFS Skills Without Java

In the world of Hadoop and Big Data HDFS is king. Data Engineers looking to boost their administrative skills first learn to navigate the Hadoop Distributed File System (HDFS) before jumping to more complex tasks. If Hadoop is written in Java does that require knowing Java Programming for HDFS. In this video I breakdown what HDFS is and how to learn it without needing to know Java. Find out more by watching this episode of Big Data Big Questions.

Transcript – Learn HDFS Without Java?

Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question came in from a live session. If you’re not familiar, I do a live session sometime on the weekends, and I’m thinking about incorporating another one. If you’d like to be a part of those, make sure you check it out. I’ll post those. Also, let me know if there’s a better time for me to do these. If you’d like to see maybe a Wednesday night or a Tuesday night episode, let me know. Put them in the comments section here below, and also if you have a question, go ahead and throw them down just like this one.

This one came out from my live session. One of the last questions I actually dropped off. As I dropped off, this question came in, so I wanted to make sure that I was getting this one done and out there. the question comes in. It’s can you learn HDFS without Java? This question is a little bit similar to some of the other ones that I answered and talked around Hadoop. MapReduce, and can you do Hadoop, or MapReduce, or Spark without Java? This one takes a twist a little bit on more of the administrator side. I feel like we’re talking about, we’ve discussed the difference between the Hadoop or developer, when we’re talking big data developer versus big data administrator. This one is more around the administration. I’ve said before, on the other ones, where can you do Spark? Can you do MapReduce without Java? It was always, hey, it depends. You absolutely can, but there might be an instance where you need to import or have somebody that’s already using that. For this one, no Java.

You’re cleared. You don’t have to worry about that, and one of the reasons is, if you think about it from an administrator perspective, really what we’re trying to do is, we’re trying to go through and be able to move data around, and understand some of the other tasks, like updates, what we’re doing. I did a whole course around HDFS from the command line. You can go through that course and never do anything around Java-related. It’s pretty cool to be able to go in and do that. Talk more configuration files like what we’re trying to do from that perspective. No worries. No need for java to be able to do HDFS. From a high level, let’s look at some HDFS commands and understand what we’re talking about whenever we’re saying, “Hey, no need for Java.” Then also, more of a need for Linux. If we look here from the command line, one of the things that you can do is go through and look at what we’re doing from an HDFS perspective. All these commands that we’re going to do are HDFS DFS commands. If you look at doing HDFS DFS, just to list out the files that you have here, you’ll use this HDFS DFS LS. This command will take you through, and it’ll show you everything that’s in a directory, right? We’re looking at files that we have in this directory here, and it’s really similar to what we would do if you just logged in to your favorite version of Linux and did LS from the command line.

A lot of these commands are all going to be the same. I actually have a course, like I said, that’ll dig through and go through all these different commands, but look at this command here, too. HDFS, DFS, MKDR. What do you think we’re doing here? If you have a background in Linux, you understand that we’re just making directories. Lastly, some of the things that you’ll also want to have from a Linux perspective that will help you in HDFS are these permissions. How can you be able to be ensured that Bob doesn’t have access to a file that he doesn’t need to have access to or that the HDFS user is allowing other users to be able to create files? That’s where we talk about permissions. Like I said, this is similar. What we do from a Linux perspective, but I have a course that’s all around this, if you’re interested in checking it out, but these are some of the commands and some of the skills that you’ll need to be an HDFS administrator. I’ve also got some other resources that I’ll put in the description here that’ll walk through some quick tutorials that you can walk through, and start using. All the commands that you need to know, like I said, it’s nothing that you need to recite. I actually created some of these blog posts that you’ll see, just because I couldn’t remember some of the commands. Like I was saying, mostly from a Linux perspective, but no need to worry. No need to jump in about, “Man, how am I going to learn Java if I want to be an HDFS administrator,” or start working in HDFS? Totally able to do that, and you can see it here just as simply as how we were able to jump in and do it. If you’re looking to be able to jump in and do some of the commands like we just showed, just go out and download one of the sandboxes or set up just a Hadoop environment on your own. This gives you the ability to play with it in your own lab and start building out some of those other requirements. Now, thanks for tuning in. Thanks for the question. If anybody has a question, make sure you put them in the comments section here below. I’ll try my best to answer these as we see on another episode of Big Data Big Questions.

Major Hadoop Release!

Hadoop 3.0 is has dropped! There is a lot of excitement in the Hadoop community for a 3.0 release. Now is the time to find out what’s new in Hadoop 3.0 so you can plan for an upgrade to your existing Hadoop clusters. In this video I will explain the major changes in Hadoop 3.0 that every Data Engineer should know.

Transcript – What’s New in Hadoop 3.0?

Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. In today’s episode, we’re going to talk about some exciting new changes in Hadoop 3.0 and why Hadoop has decided to go with a major release in Hadoop 3.0, and what all is in it. Find out more right after this.

Thomas: So, today I wanted to talk to you about the changes that are coming in Hadoop 3.0. So, it’s already went through the alpha, and now we’re actually in the beta phase, so you can actually go out there and download it and play with it. But what are these changes that are in Hadoop 3.0, and then why did we go with such a major release for Hadoop 3.0? So, what all is in this one? There’s two major ones that we’re going to talk about, but let me talk about some of the other ones that are involved with this change, too. So, the first one is more support for containerization. And so if you go through Hadoop 3.0, when you go to the website, you can look, you can actually go through some of the documentation and see where they’re starting to support some of the docker pieces. And so this is just more evidence for the containerization of the world. We’ve seen it with Kubernetes. There’s a lot of different other pieces that are out there with docker. It’s almost like a buzzword to some extent, but it’s really, really been popularized.

It’s really cool changes, too, when you think about it. Because if we go back to when we were in Hadoop 1.0 and even 2.0, it’s kind of been a third rail to say, “Hey, we’re going to virtualize Hadoop.” And now we’re fast forwarding and switching to some of the containers, and so that’s going to be some really cool changes that are coming. Obviously there’s going to be more and more changes that are going to happen [INAUDIBLE 00:01:37], but this is really laying some of the foundation for that support for docker and some of the other major container players out there in the IT industry.

Another big change that we’re starting to see… One again, this is another… I won’t say it’s a monumental change, but it’s just more evidence for support for the cloud. And so the first one is there’s some expanded support for Azure’s data lakes. So, think the unstructured data there. Maybe some of our HTFS components. And then also some big changes in Amazon’s AWS S3. So, S3, they’re actually going to allow for easier management of your metadata with DynamoDB, which is a huge no sequel database used in a DAWS platform. So, those are two…I would say some of the minor changes. Those changes along probably wouldn’t have pushed it to be a Hadoop 3.0 or a major release.

The two major releases…and these are going to deal with the way that we store data, and it’s also going to deal with the way that we protect our data for disaster recover and when you start thinking of those enterprise features that you need to have. And so the first one is support for more than two namenodes. And so we’ve had support since Hadoop 2.0 where we were able to have a standby namenode. What this gave us in pre-having a standby namenode or even having a secondary namenode is if your Hadoop cluster went down…or if your namenode went down…your Hadoop cluster was all the way down, right?

Because that’s where all your data is stores as far as your metadata, and it knows what data is allocated on each of the namenodes. And so once we were able to have that secondary namenode and that shared journal where if one namenode went down, you can have another one. But when we start thinking about fault tolerance and disaster recovery for enterprises, we probably want to be able to expand that out. And so this is one of the ways that we’re actually going to tackle that in the enterprise is to be able to have those changes.

So, be able to support more than two namenodes. And so if you think about it with just doing some calculations, one of the examples is if you have three namenodes, and you have five shared journals, you can actually take two losses of a namenode. So, you could lose two namenodes, and your Hadoop cluster would still be up and running, still be able to run your MapReduce jobs, or if you’re using Spark or something like that, you still have your access to your Hadoop cluster there. And so that’s a huge change when we start to think about where we’re going with the enterprise and just the enterprise adoption. So, you’re seeing a lot of features and requests that are coming from the enterprise customer saying, “Hey, this is the way that we do DR. We’d like to have more fault tolerance built in.” And you’re starting to see that.

So, that was a huge change. One caveat around that…support for those namenodes, but they’re still in the standby mode. So, they’re not what we would talk about when we talk about HTFS federation. So, it’s not supporting three or four different namenodes in different portions of HTFS. I’ve actually got a blog post that you can check out about HTFS federation and kind of where I see that going and how that’s a little bit different, too. So, that was a big change. And then the huge change…I’ve seen some of the results on this before it even came out to the alpha. I think they did some testing in Japan Yahoo. But it’s about using Erasure coding for storing the data. So, if you think about how we store data in HTFS… If you remember the default three, so three times replication. So, as data comes in your namenode, it’s moved to one of your data nodes, and then two [INAUDIBLE 00:05:04] copies are moved to a different rack on two different data nodes. And so that’s to give you that fault tolerance there. So, if you lose one data node, you’re able to get your data and have your data in a separate rack that still would be able to run your MapReduce jobs or your Spark jobs, or whatever you’re trying to do with your data. Maybe just trying to pull it back.

That’s how we traditionally stored it. If you needed more protection, you just bumped it up. But that’s really inefficient. Sometimes we would talk about that being 200% of your data for one portion of your data block. But really, it’s more than that because most customers will have a DR cluster, and so they have it triple replicated over there. So, when you start to think about, “Okay, in our Hadoop cluster, we have it triple replicated. In our DR Hadoop cluster, we have it triple replicated.” Oh, and the data may exist somewhere else as the source data outside of your Hadoop clusters. That’s seven copies of the data. And how efficient is that for data that’s maybe mostly archive? Or maybe it’s compliance data. You want to keep it in your Hadoop cluster.

Maybe you run [INAUDIBLE 00:06:03] over it once a year. Maybe not. Maybe it’s just something you want to hold on to so if you do want to run a job, you can. So, what Erasure coding is going to do is it’s going to give you the ability to store that at a different rate. So, instead of having to triple replicate it, what Erasure coding basically does is it says, “Okay, if we have data, we’re going to break it into six different data blocks, and then we’re going to store three [INAUDIBLE 00:06:27] versus when we’re doing triple replication think of having 12. And so the ability to break that data down and be able to pull the data back from the [INAUDIBLE 00:06:36] is going to give you that ability to get a better ratio for how you’re going to store that data and what your efficiency rate is, too.

So, instead of 200%, maybe it’s going to be closer to 125 or 150. It’s just going to depend as you scale. Just something to look forward to. But it’s really cool because that gives you the ability to one, store more data – bring in more data, hold on to it, and not think so much about the…okay, this is going to take up three times the data just for how big the file is. And so it gives you the ability to hold on to more data and take more somewhat of a risk and be like, “Hey, I don’t know that we need that data right now, but let’s hold on to it because we know that we can use Erasure coding, and we can store it at a different rate. And then as we start to need it, or if it’s something that we need later on, we can bring that back and take that away.” So, think of Erasure coding as more of an archive for your data in HTFS.

And so those are the major changes in Hadoop 3.0. I just wanted to talk to you guys about that and just kind of get that out there. Feel free to send me any questions. So, if you have any questions for Big Data, Big Questions, feel free to go to my website, put it on Twitter, #bigdatabigquestions, put it in the comments section here below. I’ll answer those questions here for you. And then as always, make sure you subscribe so you never miss an episode. Always talking big data, always talking big questions and maybe some other tidbits in there, too. Until next time. See everyone then. Thanks.

Show Notes

Hadoop 3.0 Alpha Notes

Hadoop Summit Slides on Japan Yahoo Hadoop 3.0 Testing

DynamoDB NoSQL Database on AWS

Kubernetes