Hadoop Distributed File System Archives

Certifications Required For Hadoop Administrators?

June 11, 2019 by Thomas Henson 1 Comment

Hadoop Certifications

Data Engineers looking to grow their careers are constantly learning and add new skills. What kind of impact do Hadoop Certifications have during the hiring process?

Data Engineers, Developers, and IT in general are known for their abundance of certifications. Everyone has an opinion as well about how much those certifications mean to real skills. On this episode of Big Data Big Questions find out what my thoughts are for Hadoop Admin Certifications and if Enterprises are requiring those for Data Engineers.

Transcript – Certifications Required For Hadoop Administrators

It’s the Big Data Big Questions show! Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of… Come on, I just said it. Big Data Big Questions. Today’s question comes in from a user here on YouTube. If you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson/big-questions. I’ll do my best to answer your questions here, live on one of our shows, or in one of our live YouTube sessions that we’re starting to do on Saturday mornings. Throw those questions in there. Let me know how are doing with this channel, and also if you have any questions around data engineering, machine learning. I’ll even take some of those data science questions, as well. Today’s question comes in around certifications in the Hadoop ecosystem. Are certifications required for Hadoop administrators/Hadoop developers? Absolutely, positively not. They’re not required, right?

Now, there may be some places where they’ll require you to. I did see that back in my day, in software engineering, but in general, they’re not going to be, not going to require you to have that before gaining entry. Now, they might be nice to have. Especially if you’re talking about going into an established team or an established group within your organization that, hey! We’re on the Horton Works stack, and we like to have everybody up to par from a certification perspective.

I haven’t seen that a lot specifically in the data engineering field, but it is something I’ve seen over the years in software engineering, but just not as much here lately. Now, does that mean that I’m saying that you shouldn’t go get a certification? That’s not what I’m saying at all. Especially if you’re learning and trying to get into the Hadoop ecosystem, and you’re like, hey, where do I really start?

First, you start with Big Data Big Questions.

Really, I can use the certifications. Whether it be from Azure, AWS, Cloudera, Horton Works, or Google GCP, Google’s Cloud Platform, you can take any of their certifications and really see, and build out your own learning path. That’s an opportunity there. Even if you’re not going to go down the path of trying to get that certification, if you’re trying to gain information and learn some of the things that you need to know as a good data engineer, whether it be on the developer side, whether it be on the administrative side, that’s definitely where I would start. That’s an opportunity there.

When we look at the question, it poses more of a philosophical question, if we will, in the data engineering and IT world, meaning how do we feel about IT certifications? I’ve answered this question. I had myself and Aaron Banks [Phonetic 00:02:31]. We were talking about specifically around IT certifications, and are they worth it, and we have a full-length video where we really dip into it. I’ll give you a little bit of a preview of my thought process around it.

The way that I look at certifications is, if you’re looking to be able to prove out, especially if you’re outside of a field, then hey, getting a certification might benefit you to make yourself more desirable to getting your application and getting your brand in there. Having a certification does lend some credence in those situations. However, if you’re established in the role, and you’ve been doing data engineering, and you have a lot of experience in it, necessarily you’re not going to really need to have that certification. You’ve been proven. You’ve down the due diligence of being in that role, and you’re applying for a role as a data engineer. You don’t necessarily have to go through that certification process.

Like I said, I really think that certifications are really good. Whenever we’re talking about, hey, maybe I don’t have that experience in that role, and I want to prove to you that, hey, I’m coming from, maybe you’re a web developer like I was. You’re a web developer, and you’re like, “Man. I’d really like to get into this,” Hadoop and this data engineering side of things. Where can I start, or how can I really identify myself to be somebody who wants to take on that next role? That’s where a certification is really going to help. You can get that certification. You can walk through it, but, you’re not going to walk in though, and say, “Hey, I’ve got the certification,” and you, Mr. Data Engineer, Miss Data Engineer, that’s been in that role for six years, “I know more than you do, because I have the certification.” That’s not really the case, and that’s probably not what you want to do, especially if you’re new to an organization.

Be honest, and be gentle in your interview process if you have a certification, but you don’t have the experience, and just say, “Hey, you know, I’m really passionate about it.” I’ve been following Big Data Big Questions for some time, and I thought that it’d be good to get into the data engineering field. To show how serious I am, I actually went through and got, walked through some of the certification process in there, too. Just an opportunity there for you to stand out from the crowd and show your experience, when you don’t really have experience. Let me know how you feel about this question and this answer here. Put it in the comment section below. Love to hear feedback. Also, if you have any questions, make sure you put them in the comments section here below, and never, never forget to subscribe and ring that bell, so that you’ll never miss an episode. Thanks again, and I’ll see you on the next episode of Big Data Big Questions.

Setting Up Passwordless SSH for Ambari Agent

October 26, 2017 by Thomas Henson Leave a Comment

Want to know one of the hardest part for me installing Hadoop with Ambari?

Setting up Passwordless ssh for all nodes so that Ambari Agent could do the install. Looking back it might be a trivial thing to get right, but at that time my Linux skills were lacking. Plus I had been cast into the Hadoop Administrator role after only being a Pig Developer after a month.

Having a background in Linux is very beneficial to excelling as Data Engineer or Hadoop Administrator. However if you have just been thrown into the role or looking to build your first cluster from scratch check out the video below on Setting Up Passwordless SHH for Ambari Agent.

Transcript – Setup Passwordless SSH for Ambari Agent

One step we need to do before installing Ambari is to set up passwordless SSH on our Ambari boxes. So, what we’re going to do is we’re actually going to generate a key on our master node and send those out to our data nodes. I wanted to caution you that this sounds very easy, and if you’re familiar with Linux and you’ve done this a couple of times, you understand that it might be trivial. But if it’s something you haven’t done before or if it’s something you haven’t done in a while, you want to make sure that you walk through this step. One of the reasons that you really want to walk through this before we install anything Ambari-related or Hadoop-related is because this is going to help us troubleshoot problems that we might have with permissions. So, if we know that this piece works, we can eliminate all the other problems.

No problem if you haven’t set it up before. We’re actually going to walk through that in the demo here. But first, let’s just look at it from an architectural perspective. So, what we’re going to do is, on our master node, we’re going to generate both a public and a private key. Then we’re going to share out that public key with all the data nodes, and what this is going to do is this is going to allow for the master node to log in via SSH with no password into data node 1, 2, and 3. So, since master node can actually login to these, we only have to install Ambari on the master node and then allow the master node to run all the installation on all the other nodes. You’ll see more of that once we get into installing Ambari and Ambari Agent, but just know that we have to have this public key working in order to have passwordless SSH.

The steps to walk through it are pretty easy. So, what we’re going to do is we’re going to login to our master node and we’re going to create a key. So, we’ll type in ssh-keygen. From there, it’ll generate the public and private key, and then we will copy the public key to data node 1, 2 and 3. Next, we’re going to add that key to the authorized list on all the data nodes. We’re going to test from our main node into data node 1, 2, and 3 just to make sure that our passwordless SSH works and that we can log in as root. Now let’s step through that in a demo.

Now we’re ready to set up passwordless SSH in our environment. So, in my environment, I have node one, which will be my master node, and I’m going to set up passwordless SSH on node 2, 3, and 4. But in this demo, we’re just going to walk through doing it on node 1 and node 2, and then we can just replicate it—the same process—on the other nodes.

So, the first thing we need to do is, on our master node or node 1, we’re going to generate our public key and our private key. So, ssh-keygen. We’re going to keep it defaulted to go into the .ssh folder. I’m not going to enter anything for my passphrase.

You can see there’s a random image and we can run an ll on our .ssh directory, and we see that we have both our public and our private key. Now what we need to do is we need to move that public key over to our data node 1, and then we’ll be able to login without using a password. So, I’m going to clear out the screen, and now what we’re going to do is we’re going to use scp and just move that public key over to node 2.

So, since we haven’t set up our passwordless SSH, it will prompt us for a password here. So, we got the transfer complete and now I’m going to login to node 2. Still haven’t used that password. If we run a quick ll, we can see we have our public key here, and now all we need to do is set up our .ssh directory and add this public key to the authorized keys. So, we’re going to make that directory, where it’s inside that directory, and we can see nothing is in it. Now it’s time to move that public key into this .ssh directory.

Then we have our public key, and now let’s just cat that file. We’re going to create an authorized keys. And we have two files here, so we have our public key, and then we’ve also written an authorized keys, which is going to be that public key. So, we’re going to exit out. As you can see, I’m back in node 1. So, now we should be able to just SSH in and not be prompted for a password. And you can see now we’re here in node 2.

So that’s how you set up your passwordless SSH. We’ll need to do this for all data nodes that we’re going to add to the cluster, and this will allow that, once we have Ambari installed on our main node, Ambari will be able to go and make changes to all the data nodes and do all the updates and upgrades all at one time so that you’re not having to manage each individual upgrade, each individual update.

7 Commands for Copying Data in HDFS

May 15, 2017 by Thomas Henson Leave a Comment

What happens when you need a duplicate file in two different locations?

It’s not a trivial problem you just need to copy that file to the new location. In Hadoop and HDFS you can copy files easily. You just have to understand how you want to copy then pick the correct command. Let’s walk though all the different ways of copying data in HDFS.

HDFS dfs or Hadoop fs?

Many commands in HDFS are prefixed with the hdfs dfs – [command] or the legacy hadoop fs – [command]. Although not all hadoop fs commands and hdfs dfs are interchangeable. To ease the confusion, below I have broken down both the hdfs dfs and hadoop fs copy commands. My preference is to use hdfs dfs prefix vs. the hadoop fs.

Copy Data in HDFS Examples

The example commands assume my HDFS data is located in /user/thenson and local files are in the /tmp directory (not to be confused with the HDFS /tmp directory). The example data will be loan data set from Kaggle. Using the data set or same file structure isn’t necessary it’s just for a frame of reference.

Hadoop fs Commands

Hadoop fs cp – Easiest way to copy data from one source directory to another. Use the hadoop fs -cp [source] [destination].
hadoop fs -cp /user/thenson/loan.csv /loan.csv
Hadoop fs copyFromLocal – Need to copy data from local file system into HDFS? Use the hadoop fs -copyFromLocal [source] [destination].
hadoop fs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv

Hadoop fs copyToLocal – Copying data from HDFS to local file system? Use the hadoop fs -copyToLocal [source] [destination].
>hadoop fs -copyToLocal /user/thenson/loan.csv /tmp/

HDFS dfs Commands

HDFS dfs CP – Easiest way to copy data from one source directory to another. The same as using hadoop fs cp. Use the hdfs dfs cp [source] [destination].
hdfs dfs -cp /user/thenson/loan.csv /loan.csv
HDFS dfs copyFromLocal -Need to copy data from local file system into HDFS? The same as using hadoop fs -copyFromLocal. Use the hdfs dfs -copyFromLocal [source] [destination].
hdfs dfs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv
HDFS dfs copyToLocal – Copying data from HDFS to local file system? The same as using hadoop fs -copyToLocal. Use the hdfs dfs -copyToLocal [source] [destination].
hdfs dfs -copyToLocal /user/thenson/loan.csv /tmp/loan.csv

Hadoop Cluster to Cluster Copy

Distcp used in Hadoop – Need to copy data from one cluster to another? Use the MapReduce’s distributed copy to move data with a MapReduce job. For the listed command below the original data exist on cluster namenode in the /user/thenson directory and is being transferred to the newNameNode cluster. Make sure to use the full hdfs url in command. Command hadoop -distcp [source] \ [destination].
hadoop -distcp hdfs://namenode:8020/user/thenson \ hdfs://newNameNode:8020/user/thenson

It’s the Scale that Matters..

While copying data is a simple matter in most application, everything in Hadoop is more complicated because of the scale. Make sure when copying data in HDFS to understand the use case and scale, then choose one of the commands above.

Interested in learning more HDFS commands? Checkout out my Top HDFS Commands post.

Top HDFS Commands

July 5, 2016 by Thomas Henson 6 Comments