Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Ultimate Battle Tensorflow vs. Hadoop

October 4, 2019 by Thomas Henson 1 Comment

Tensorflow vs. Hadoop

The Battle for #BigData 

This post has been a long time coming!

Today I talk about the difference between Tensorflow and Hadoop. While Hadoop was built for processing data in a distrubuted fashion their are some comparison with Tensorflow. One of which is both originated out of the Google development stack. Another one is that both were created to bring insight to data although they both have different approaches to that mission.

Who now is the king of #Bigdata? To be fair the comparison is not like for like but most of the time are bound together as it has to be one or the other. Find my thoughts on Tensorflow vs. Hadoop in the latest episode of Big Data Big Questions.

Transcript – Ultimate Battle Tensorflow vs. Hadoop

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question is really a conversation that I heard from, actually, my little brother when he was talking about something that he heard at a conference. He brought it to my attention. “Hey, Thomas, you’re involved in big data. I was talking to some folks at a GIS conference around Hadoop and TensorFlow.” He’s like, “One person came up to me and said, ‘Ah! Hadoop’s dead. It’s all TensorFlow now.” I really wanted to take today to really talk about the differences between Hadoop and TensorFlow, and just do a level set for all data engineers out there, all big data developers, or people that are just interested in finding out. “Okay, what’s happening in the marketplace?” Today’s question is going to come in around TensorFlow versus Hadoop and find out all the things that we need to know from a data engineering perspective. Even in the end, we’ll talk about which one’s going to be around in five years. Find out more right after this.

Welcome back. Today, as promised, what we’re going to do is, we’re going to tackle the question around which is better, what’s the differences of TensorFlow versus Hadoop, where does it fit in data analytics, the marketplace, and solving the world’s problems? If you’re watching this channel, and you’re invested in the data analytics community, you know how we feel about it, and we’re passionate about, we’re being able to solve problems using data. First thing we’re going to do is break them down, and then at the end, we’re going to talk about some of the differences, where we see the market going, and which one is going to make it in five years. Or, will both? Who knows. First, what is TensorFlow. We’ve talked about it a good bit on this channel, but TensorFlow is a framework to do deep learning. Deep learning gives you the ability to subset, and a branch of machine learning, but it’s just about processing data. The really cool thing about TensorFlow, and the reason TensorFlow and frameworks similar to TensorFlow in the deep learning realm are so awesome is because it gives you the portability to run and analyze your data on your local machine or even spread it out in a distributed environment. It comes with a lot of different algorithms and neural networks that you can use and incorporate into solving problems. One of the cool things about deep learning is just the ability to actually look and analyze more video data or voice recognition, right? Or, if you’re going on Instagram or you’re going on YouTube, and you’re looking for examples on deep learning, chances are somebody’s going to build some kind of video or some kind of photo identification that will help you identify a cat. That’s the classic example that you’ll see, is, “Hey, can we detect a cat by feeding in data, and looking, and analyzing this?” Tensorflow doesn’t use Hadoop, but TensorFlow uses big data. You use these large data sets to train your models that can be used on edge devices. If you’re even used a drone, or if you’ve ever used a remote control to use natural language processing to change the channel, then you’ve used some portion of deep learning or natural language processing. Not saying it’s TensorFlow, but that’s what TensorFlow, it really does. It’s very popular, developed by Google, open sourced, and housed by Google. A lot of free resources out there, and for data scientists and machine learning engineers, it’s a very, very exciting product to be able to build out and be able to start analyzing your data quicker and in a very popular fashion. Couple together the excitement for deep learning, couple together the ease of use of TensorFlow, and that’s why the market has just been hot for TensorFlow and those other frameworks.

What is Hadoop? Hadoop, it’s all about elephants, right? Hadoop has really been around since, I don’t know, we’re probably in 12 to 13 years of it being open source, but if we think back to what we did from analyzing data that was coming in from the web, think about being able to index the entire web, it’s kind of what Google helped develop that, and Yahoo, and a lot of the other teams from Cloudera and HortonWorks, really helped to push Hadoop into the open source arena. Hadoop is synonymous with saying big data. You can’t say big data without thinking about Hadoop. Hadoop’s been around for a long time. There’s a lot of different components to Hadoop, and even on this channel, whenever we talk about Hadoop, we’re specifically really talking about the ecosystem. The ability to process data, but the ability to also store large amounts of data with HDFS, so the Hadoop distributed file system, there’s a lot of components in there. There are APIs, and there are other tools that help for you to do it, but one of the things that I really like to think about when we talk about Hadoop and why it was so record-breaking, and just really open the market for big data was just the ability to set up distributed systems and be able to analyze large amounts of data. These large amounts of data would be more in the unstructured data, so think of it not being in a database, but a lot of it would still be in text-based. You could go out there, very popular example is going out here, setting up an API to pull in Twitter data, and be able to do cinnamon analysis [Phonetic 00:05:13] over that. Not so much the deep learning. They’re trying to get into the deep learning area right now, but more of machine learning, using algorithms like singular value decomposition or [Inaudible 00:05:25] neighbor, but being able to do that over large sets of data. Large sets of data with multiple machines. Hadoop, been around for a while, more seen as replacing the enterprise data warehouse. With TensorFlow now on the scene, where does Hadoop fit in, and what’s going on, and what are some of the differences?

Hadoop was written in Java. TensorFlow was written in C++. Both of them have APIs. They give you the ability to, whenever we’re talking about the processing of data, you can do it in Java, you can do it in Python, you can do it in Scala. There’s a lot of different options there from a Hadoop perspective. TensorFlow, too. You can see C++. You can also see it in Python. Python’s one of the more popular ones, actually did a course using TF Learn and TensorFlow to show that. When we think about the tools, it’s a little bit different. When we think about Hadoop, we’re actually building out a distributed system. Then, we’re using things like maybe Spark. Think of using Spark to be able to analyze that data. We’re going to pull insight from that data back to our cinnamon analysis that’s going to say, “Hey, these specific words in here, when we see them, this tweet is unhappy,” or, “This tweet is happy.” Versus TensorFlow, same thing. More of a processing engine, like framework to be able to pull in, analyze the data, and give you insights on whether that image contained a cat or not a cat. You’re starting to see some of the differences. We talked about Python versus Java. Both of them, there’s different APIs that you can start to use those. I’m probably talking right now about saying that I haven’t seen a lot of Java and TensorFlow, but I’m sure somebody has an API or some kind of framework out there that works on it. Another big difference, too, is the way that the processing is done. The Hadoop ecosystem’s really trying to get into it right now, but from a TensorFlow perspective, we’re really seeing it on GPUs, right? Think of being able to use GPUs to process data, 10-20, a lot faster than what we see on a CPU. Where Hadoop is more CPU-based, the way that we’re solving problems with Hadoop is we’re throwing a lot of CPUs in a distributed model to process the data and then pull it back in. TensorFlow, same thing, distributed networks. As you start to scale out your data, you really need to distribute those systems, but we’re doing it with GPUs. That’s speeding up the process. Little bit of a difference there, just in the approach, but that’s one of the big key differences. If we’re a data engineer, and we’re evaluating these, where do they come in? Ease of use, Hadoop, you’re building out your distributed system. Really Java-based, so if you have a Java background, it’s really good, but you can get by without it in some areas. It’s really not so much of a comparison with ease of use, but if we’re talking about just being able to stand something up and start messing around with it, it’s going to be a little bit more complicated and harder to do it from a Hadoop perspective with TensorFlow. You can actually look at an NFS file system. You can feed in data from different file systems, where with Hadoop, you’re building that system out, and also building out a file system. You’re building out distributed systems, and you’re building out disaster recovery and some of the other components. It’s harder to do from a Hadoop perspective, but there’s more expertise in it, because you’re actually building out a whole solution set, versus TensorFlow is the processing system that you’re using. The comparison on that perspective is somebody tries to talk to you about that, kind of explain that it’s, these are two different systems, right? When we’re talking about which are we using, that really comes down to it. If you’re looking for a project, and somebody says, “Hey! Should we use TensorFlow here, or Hadoop?” It’s going to be pretty easy to spot those, I think, because when you’re starting to look at them, if you think of Hadoop, think of something that’s replacing or falling in line to the enterprise data warehouse. What are we doing? Do we have massive amounts of data. It could be structured, semi-structured, but you’re wanting to offload, and you’re wanting to run huge analytics over that processing. Then, that’s probably going to be a Hadoop perspective. We’re probably building out that system when we think of the traditional enterprise data warehouse. That’s the bucket that we’re going to fall in. If we’re talking about doing some sort of artificial intelligence or doing some things with deep learning, maybe not so much in the machine learning era, you’re going to want to look at TensorFlow. Especially, listen for keywords like, hey, what are we doing from the perspective of images, or video, or voice? Any of those media-rich types of data, then you’re probably going to use TensorFlow, too. If you have machine learning engineers, a data scientist, and you’re trying to do rich media, TensorFlow’s going to be your really popular one. If you have more data analysts, and even your data scientist, but from the perspective of, we’re looking at large amounts of data and wanting to marry it, but we have it in some kind of structure and some kind of standardized system, then Hadoop may be your bucket.

Which one of these is going to be around in five years? I think they’ll both be around, but I will say that the popularity for Hadoop will continue in some degrees, but it’s more continuing to replace that enterprise data warehouse. Think of what you do from a traditional perspective in holding all your company’s information, from that perspective, where we’re seeing more product development, more media-rich things that are being done from an artificial intelligence. We’ll see more TensorFlow there. Will TensorFlow still be the number one deep learning framework in five years? Will deep learning, I can’t answer that here. Would I learn it if I were just starting out as a data engineer? Yeah, definitely. Definitely from the perspective of, I want to learn how to implement it and how to use it. You don’t have to become an expert. We’re not trying to become a data scientist from that perspective, but start looking at some of the frameworks, and building out, going through some of the simple examples that they have, and then heavy use on docker, container, and that whole world of being able to build those out. That’ll help you if you’re really trying to look into, hey, what could be next for data engineers? Or, what’s going on now? What’s cutting edge from that perspective? I hope you enjoyed this video, please, if you have any comments on it, if I missed something, put it in the comments section here below. I’m always happy to carry on the discussion. Until next time, see you again on Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Filed Under: Tensorflow Tagged With: Data Engineering, Hadoop, Tensorflow

Spark vs. Hadoop 2019

June 12, 2019 by Thomas Henson Leave a Comment

Spark vs. Hadoop 2019

Spark vs. Hadoop 2019

In 2019 which skill is in more demand for Data Enginners Spark or Hadoop? As career or aspiring Data Engineers it makes sense to keep up with what skills are in demand for the market. Today Spark is hot and Hadoop seems to be on it’s way out but how true is that?

Hadoop born out the web search era and part of the open source community since 2006 has defined Big Data. However, Spark’s release into the open source Big Data community and boosting 100x faster processing for Big Data created a lot of confusion about which tool is better or how each one works. Find out what Data Engineers should be focusing on this episode of Big Data Big Questions Spark vs. Hadoop 2019.

 

Transcript – Spark vs. Hadoop 2019

Hi folks. Thomas Henson here with thomashenson.com, and today is another episode of Wish That Chair Spun Faster. …Big Data Big Questions!

Today’s question comes in from some of the things that I’ve been seeing in my live sessions, so some of the chats, and then also comments that have been posted on some of the videos that we have out there. If you have any comments or have any ideas for the show, make sure you put them in the comments section here below or just let me know, and I’ll try my best to answer these. This question comes in around, should I learn Spark, or should I learn Hadoop in 2019? What’s your opinion?

A lot of people are just starting out, and they’re like, “Hey, where do I start?” I’ve heard Spark, I’ve heard Hadoop’s dead. What do we do here? How do you tackle it?

If you’ve been watching this show for a long time, you’ve probably seen me answer questions similar to this and compare the differences between Spark and Hadoop. This is still a viable question, because I’ve actually changed a little bit he way I think about it, and I’m going to take a different approach with the way that I answer this question, especially for 2019.

In the past I’ve said that it really just depends on what you want to do. Should you learn Spark? Should you learn Hadoop? Why can’t you learn both? Which, I still think, from the perspective of your overall learning technology and career, you’re probably going to want to learn both of them. If we’re talking about, hey, I’ve only got 30 days, 60 days. “I want the quickest results possible, Thomas.” How can I move into a data engineer role, find a career? Maybe I just graduated college, or maybe I’m in high school, and I want to get an internship that maybe turns into a full-time gig. Help me, in the next 30 to 90 days, get something going.

Instead of saying depends, I’m really going to tell you that I think it’s going to be Spark. That’s a little bit of a change, and I’ll talk about some of the reasons why I think that change too. Before we jump into that, let’s talk a little bit about some of the nomenclature that we have to do around Hadoop. When we talk about Hadoop, a lot of times that we’re talking about Hadoop, and MapReduce and Htfs [Phonetic 00:02:10] in this whole piece. From the perspective of writing MapReduce jobs or processing our data, Spark is far and clear the leader in that. Even MapReduce is being decoupled, has been decoupled, and more and more jobs are not written in MapReduce. They’re more written with Flink [Phonetic 00:02:28], or Spark, or Apache Beam, or even [Inaudible 00:02:32] on the back-end. That war has been won by Spark for the most part. Secondly, when we talk about Hadoop, I like to talk about it from an ecosystem perspective. We’re talking about Htfs, we’re talking about even Spark included in that, and Flume, all the different pieces that make up what we call the big data ecosystem. We just call that with the Hadoop ecosystem.

The way that I’m answering this question today is, hey, I’m looking for something in 2019 that could really move the needle. What do you see that’s in demand? I see Spark is very, very much in demand, and I even see Spark being used outside of just Htfs as well, too. That’s not saying that if you’ve learned Hadoop or you’ve learned Htfs you’ve gone down the wrong path. I don’t think that’s the case, and I think that’s still viable. You’re asking me, what can you do to move the needle in 30 to 90 days? Digging down and becoming a Spark develop, that opens up a career option. That’s one of the quickest ways that you can get, and one of the big things we’ve seen out there with the roles. Roles for data engineers. Another huge advantage, we’ve talked about it a little bit on this channel, but the big announcement for what Data Bricks is going from the perspective of investment and what their valuation is. They’re an $2.5 billion advancement, and they’re huge in the Spark community. They’re part of the incubators and on a lot of steering committees for Spark. They have some tools and everything that they sell on top of that, but it’s just really opened my eyes to what’s out there. I knew Spark was pretty big, but the fact that Data Bricks and where they’re going, I think that’s a lot of what we’re seeing. Another point, too, you’ve heard me talk about it a good bit, but where we’re going with deep learning frameworks and bringing it into the core big data area. Spark is going to be that big bridge, I believe. People love to develop in Spark. Spark’s been out there. It gives you the opportunity now with Project Hydrogen and some of the other things that are coming to be able to take and do ETL over GPUs, but also import data and be able to implement and use Tensorflow or PyTorch, or even Caffe 2. It you’re looking in 2019 to choose between Spark and Hadoop to find something in the next 30 to 90 days, I would go all in with Spark. I would learn Spark, whether it be from Java, Scala, or Python, but be able to learn, and be able to start doing some tutorials around that, being able to code. Being able to build out your own projects, and I think that’s going to really open your eyes, and that can really get the needle moving. At some point, you want to go back, and you want to learn how to navigate data with Htfs. How to find things. They’re going on from the Hadoop ecosystem, because it’s all a big piece here, but if you’re asking me, the one thing to do to move the needle in 30 to 90 days, learn Spark.

Thanks again. That’s all I have today for Big Data Big Questions. Remember, subscribe and ring that bell, so you never miss an episode of Big Data Big Questions. If you have any questions, put them in the comment section here below. We’ll answer them on Big Data Big Questions.

 

Filed Under: Hadoop Tagged With: Big Data, Data Engineers, Hadoop, Spark

Certifications Required For Hadoop Administrators?

June 11, 2019 by Thomas Henson 1 Comment

Certifications Required For Hadoop Administrators
Hadoop Certifications

Data Engineers looking to grow their careers are constantly learning and add new skills. What kind of impact do Hadoop Certifications have during the hiring process?

Data Engineers, Developers, and IT in general are known for their abundance of certifications. Everyone has an opinion as well about how much those certifications mean to real skills. On this episode of Big Data Big Questions find out what my thoughts are for Hadoop Admin Certifications and if Enterprises are requiring those for Data Engineers.

Transcript – Certifications Required For Hadoop Administrators

It’s the Big Data Big Questions show! Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of… Come on, I just said it. Big Data Big Questions. Today’s question comes in from a user here on YouTube. If you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson/big-questions. I’ll do my best to answer your questions here, live on one of our shows, or in one of our live YouTube sessions that we’re starting to do on Saturday mornings. Throw those questions in there. Let me know how are doing with this channel, and also if you have any questions around data engineering, machine learning. I’ll even take some of those data science questions, as well. Today’s question comes in around certifications in the Hadoop ecosystem. Are certifications required for Hadoop administrators/Hadoop developers? Absolutely, positively not. They’re not required, right?

Now, there may be some places where they’ll require you to. I did see that back in my day, in software engineering, but in general, they’re not going to be, not going to require you to have that before gaining entry. Now, they might be nice to have. Especially if you’re talking about going into an established team or an established group within your organization that, hey! We’re on the Horton Works stack, and we like to have everybody up to par from a certification perspective.

I haven’t seen that a lot specifically in the data engineering field, but it is something I’ve seen over the years in software engineering, but just not as much here lately. Now, does that mean that I’m saying that you shouldn’t go get a certification? That’s not what I’m saying at all. Especially if you’re learning and trying to get into the Hadoop ecosystem, and you’re like, hey, where do I really start?

First, you start with Big Data Big Questions.

Really, I can use the certifications. Whether it be from Azure, AWS, Cloudera, Horton Works, or Google GCP, Google’s Cloud Platform, you can take any of their certifications and really see, and build out your own learning path. That’s an opportunity there. Even if you’re not going to go down the path of trying to get that certification, if you’re trying to gain information and learn some of the things that you need to know as a good data engineer, whether it be on the developer side, whether it be on the administrative side, that’s definitely where I would start. That’s an opportunity there.

When we look at the question, it poses more of a philosophical question, if we will, in the data engineering and IT world, meaning how do we feel about IT certifications? I’ve answered this question. I had myself and Aaron Banks [Phonetic 00:02:31]. We were talking about specifically around IT certifications, and are they worth it, and we have a full-length video where we really dip into it. I’ll give you a little bit of a preview of my thought process around it.

The way that I look at certifications is, if you’re looking to be able to prove out, especially if you’re outside of a field, then hey, getting a certification might benefit you to make yourself more desirable to getting your application and getting your brand in there. Having a certification does lend some credence in those situations. However, if you’re established in the role, and you’ve been doing data engineering, and you have a lot of experience in it, necessarily you’re not going to really need to have that certification. You’ve been proven. You’ve down the due diligence of being in that role, and you’re applying for a role as a data engineer. You don’t necessarily have to go through that certification process.

Like I said, I really think that certifications are really good. Whenever we’re talking about, hey, maybe I don’t have that experience in that role, and I want to prove to you that, hey, I’m coming from, maybe you’re a web developer like I was. You’re a web developer, and you’re like, “Man. I’d really like to get into this,” Hadoop and this data engineering side of things. Where can I start, or how can I really identify myself to be somebody who wants to take on that next role? That’s where a certification is really going to help. You can get that certification. You can walk through it, but, you’re not going to walk in though, and say, “Hey, I’ve got the certification,” and you, Mr. Data Engineer, Miss Data Engineer, that’s been in that role for six years, “I know more than you do, because I have the certification.” That’s not really the case, and that’s probably not what you want to do, especially if you’re new to an organization.

Be honest, and be gentle in your interview process if you have a certification, but you don’t have the experience, and just say, “Hey, you know, I’m really passionate about it.” I’ve been following Big Data Big Questions for some time, and I thought that it’d be good to get into the data engineering field. To show how serious I am, I actually went through and got, walked through some of the certification process in there, too. Just an opportunity there for you to stand out from the crowd and show your experience, when you don’t really have experience. Let me know how you feel about this question and this answer here. Put it in the comment section below. Love to hear feedback. Also, if you have any questions, make sure you put them in the comments section here below, and never, never forget to subscribe and ring that bell, so that you’ll never miss an episode. Thanks again, and I’ll see you on the next episode of Big Data Big Questions.

 

Filed Under: Hadoop Tagged With: Certifications, Data Engineers, Hadoop, Hadoop Admin, Hadoop Distributed File System

Learn HDFS Without Java?

June 3, 2019 by Thomas Henson 1 Comment

Learn HDFS without Java

HDFS Skills Without Java

In the world of Hadoop and Big Data HDFS is king. Data Engineers looking to boost their administrative skills first learn to navigate the Hadoop Distributed File System (HDFS) before jumping to more complex tasks. If Hadoop is written in Java does that require knowing Java Programming for HDFS. In this video I breakdown what HDFS is and how to learn it without needing to know Java. Find out more by watching this episode of Big Data Big Questions.

Transcript – Learn HDFS Without Java?

Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question came in from a live session. If you’re not familiar, I do a live session sometime on the weekends, and I’m thinking about incorporating another one. If you’d like to be a part of those, make sure you check it out. I’ll post those. Also, let me know if there’s a better time for me to do these. If you’d like to see maybe a Wednesday night or a Tuesday night episode, let me know. Put them in the comments section here below, and also if you have a question, go ahead and throw them down just like this one.

This one came out from my live session. One of the last questions I actually dropped off. As I dropped off, this question came in, so I wanted to make sure that I was getting this one done and out there. the question comes in. It’s can you learn HDFS without Java? This question is a little bit similar to some of the other ones that I answered and talked around Hadoop. MapReduce, and can you do Hadoop, or MapReduce, or Spark without Java? This one takes a twist a little bit on more of the administrator side. I feel like we’re talking about, we’ve discussed the difference between the Hadoop or developer, when we’re talking big data developer versus big data administrator. This one is more around the administration. I’ve said before, on the other ones, where can you do Spark? Can you do MapReduce without Java? It was always, hey, it depends. You absolutely can, but there might be an instance where you need to import or have somebody that’s already using that. For this one, no Java.

You’re cleared. You don’t have to worry about that, and one of the reasons is, if you think about it from an administrator perspective, really what we’re trying to do is, we’re trying to go through and be able to move data around, and understand some of the other tasks, like updates, what we’re doing. I did a whole course around HDFS from the command line. You can go through that course and never do anything around Java-related. It’s pretty cool to be able to go in and do that. Talk more configuration files like what we’re trying to do from that perspective. No worries. No need for java to be able to do HDFS. From a high level, let’s look at some HDFS commands and understand what we’re talking about whenever we’re saying, “Hey, no need for Java.” Then also, more of a need for Linux. If we look here from the command line, one of the things that you can do is go through and look at what we’re doing from an HDFS perspective. All these commands that we’re going to do are HDFS DFS commands. If you look at doing HDFS DFS, just to list out the files that you have here, you’ll use this HDFS DFS LS. This command will take you through, and it’ll show you everything that’s in a directory, right? We’re looking at files that we have in this directory here, and it’s really similar to what we would do if you just logged in to your favorite version of Linux and did LS from the command line.

A lot of these commands are all going to be the same. I actually have a course, like I said, that’ll dig through and go through all these different commands, but look at this command here, too. HDFS, DFS, MKDR. What do you think we’re doing here? If you have a background in Linux, you understand that we’re just making directories. Lastly, some of the things that you’ll also want to have from a Linux perspective that will help you in HDFS are these permissions. How can you be able to be ensured that Bob doesn’t have access to a file that he doesn’t need to have access to or that the HDFS user is allowing other users to be able to create files? That’s where we talk about permissions. Like I said, this is similar. What we do from a Linux perspective, but I have a course that’s all around this, if you’re interested in checking it out, but these are some of the commands and some of the skills that you’ll need to be an HDFS administrator. I’ve also got some other resources that I’ll put in the description here that’ll walk through some quick tutorials that you can walk through, and start using. All the commands that you need to know, like I said, it’s nothing that you need to recite. I actually created some of these blog posts that you’ll see, just because I couldn’t remember some of the commands. Like I was saying, mostly from a Linux perspective, but no need to worry. No need to jump in about, “Man, how am I going to learn Java if I want to be an HDFS administrator,” or start working in HDFS? Totally able to do that, and you can see it here just as simply as how we were able to jump in and do it. If you’re looking to be able to jump in and do some of the commands like we just showed, just go out and download one of the sandboxes or set up just a Hadoop environment on your own. This gives you the ability to play with it in your own lab and start building out some of those other requirements. Now, thanks for tuning in. Thanks for the question. If anybody has a question, make sure you put them in the comments section here below. I’ll try my best to answer these as we see on another episode of Big Data Big Questions.

Filed Under: Hadoop Tagged With: Hadoop, HDFS, Java

Freelance Hadoop Administrative Roles

May 25, 2018 by Thomas Henson Leave a Comment

Freelance Hadoop Admin

Freelance Hadoop Admin Roles

A lot of the world’s economy is shifting to freelance/contracting economy or what Seth Godin terms as a “gig” economy. For Data Engineers heavy on the development side of Hadoop projects that is an easy transition. Software development projects have a a natural flow of a starting and end point. How does that work for the data engineers who are Hadoop Admins? Traditionally Operations or OPs roles are full time with no end in sight. In fact most have on call hours where Administrators have to be available 24/7.  How can these types of Data Engineers find Freelance Hadoop Admin Roles? Find my thoughts on what a Freelance Hadoop Admin role looks like and where to look in the video below.

 

Transcript – Freelance Hadoop Admins Roles

Hi, folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. In today’s episode, I’m going to tackle questions about, what are some of the projects freelance Hadoop engineers can do, or Hadoop administrators? What are some projects in the freelance world that are going to translate and be good targets if you’re looking to be able to grab some kind of freelance Hadoop administrative job?
Find out more, right after this.

Welcome back. Before we jump into today’s question, I just want to remind you. If you have any questions, submit them in the comments section here, below, and then also, make sure you subscribe to my channel, so that you never miss an episode. I will answer as many of these questions as I can get to. I just need you to keep coming in, and submitting the questions, and giving me feedback on the types of content that you’d like to see.

Thank you everyone for subscribing. I really appreciate it, and now let’s jump into today’s Big Data Big Question. My question comes in from a YouTube comment. What freelance projects can be done by Hadoop administrators?

This one’s a little bit tougher, I believe, than when we talk about data engineers and we talk about the development side. I feel like those projects are a little bit easier to find as far as being able to bid for them, but also new projects come in on the development side a lot. When you think about the administrative side, so think about continuous development, continuously holding up that operations side for Hadoop. You’ve got a Hadoop cluster. You’re continually adding new clusters. You’re patching it. You’re doing the day-to-day operations, so, those are more permanent roles.

It’s a little bit harder to find a freelance position for a couple months or something like that for a project on the development side versus in the administrative side. However, I will say, if you’re looking to fill those roles, I think you’re going to find more of a short-term contract with those, and so by those I don’t mean a development. You’ll have a project that comes in. It may take you two weeks, it may take you two months. I believe the Hadoop administrator roles, they’re going to be a little longer if you’re looking for a contract position. A lot of those are going to be more consultive type, so think of new emerging companies. They’re starting up their Hadoop environment, jumping into the Hadoop ecosystem. They don’t really have a basis for how they’re going to do it, and so they’re looking for people to really come in and be those knowledge experts to help them get off the ground.

That kind of engagement is probably going to be at least three months, probably six, maybe even a little bit longer. These are more long-term contracts in my opinion, that you’re going to be able to find. The cool thing about these roles is, if you have a background or you have a desire to be able to be a trainer, and be able to help lead and teach other people, that’s one of my passions, these are the kind of roles that you’re going to be able to do. Not only do you get to be hands-on and be technical, you get to help a team that is brand new to the Hadoop ecosystem, maybe has a ton of experience in other areas, but you get to draw on that experience and help them build out their first Hadoop cluster, start working on some of their first use cases, and it’s something that, it can be very rewarding.

If you’re looking for these roles, and that’s probably what the question is referring to, I would look for companies that are just starting to dive into the Hadoop ecosystem. This is probably going to be a little more, you’re looking for people that are just going into that role, so look to see who’s hiring Hadoop administrators and some of those other roles, and then just reach out and content those companies. Let them know. Give them your background, talk to them a little bit about some of the technologies that they’re working on. If you have anything that you’ve been contributing to or working with in the open source community, around HDFS, or Ambari, or any of the administrative things like Zookeeper, those are where you can really shine, and say, “Hey, look. I’m involved in this community, here. I’d love to come in, have a conversation, talk to you about how you’re setting up your Hadoop cluster, where some of the troubleshooting issues are going to come up. What are some of the things that I’ve seen with my experience, that I think you should look out for, and I can kind of help with?

Those are going to be amazing roles. Like I said, it’s going to be harder, probably, than the data engineer who focuses more on the development side, but it’s going to be, in my opinion, could be a lot more rewarding, because of the fact that, you’re probably going to be more of a consultant, and you’re going to be more running a team. You’re still hands on in the tech, but you’re actually being able to train and communicate to others how they’re going to have this system up and running long time after your engagement ends.

Thanks for the question, and make sure you subscribe to the channel, and then we will see you on the next episode of Big Data Big Questions.

Show Notes Links

Ambari 

Ambari Meetup

Amabari Mailing List 

HDFS Documentation 

Hadoop Mailing List 

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Filed Under: Hadoop Tagged With: Data Engineer, Hadoop, Hadoop Admin

What is the Difference Between Spark & Hadoop

May 14, 2018 by Thomas Henson 1 Comment

Hadoop & Spark

Spark & Hadoop Workloads are Huge

Data Engineers and Big Data Developers spend a lot of type developing their skills in both Hadoop and Spark. For years Hadoop’s MapReduce was King of the processing portion for Big Data Applications. However for the last few years Spark has emerged as the go to for processing Big Data sets. Still it can be unclear what the differences are between Spark & Hadoop. In this video I’ll breakdown the differences every Data Engineer should know between Spark & Hadoop.

Make sure to watch the video below to find out the differences and subscribe to never miss an episode of Big Data Big Questions.

 

Transcript – What is the Difference Between Spark & Hadoop

Hi, folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s question is one I’ve been wanting to tackle for a long time, I’m not sure why I haven’t gotten to it, but I’m ready to it. So, it’s the ultimate showdown. What’s the difference between Hadoop and Spark, and which one will win in the fight. So, find out how I’ll answer that question right after this.

Welcome back. So, today’s question comes in from a user. It came on through a YouTube comment section. So, post your question down here below. You can actually go to my website and go to Big Questions. So, thomashenson.com/bigquestions. Put it out on Twitter. Use the hashtag Big questions. I’ll look it up, try to answer those questions.

So, today’s question comes in and it says, YouTube comment, “Nowadays, there are predominantly two softwares that are used for dealing with big data, Hadoop Ecosystem and Spark. Could you elaborate on the similarities and differences in those two technologies?”

So, that’s an amazing question. It’s one that we hear all the time. So, Hadoop, very sure technology, it’s been out there. Really, it’s associated with a lot of things. There are going on in the Big Data community and a lot of things you talk about, you say big data, it’s almost anonymous, synonymous that you’re going to say Hadoop as well. But with Hadoop being over 10 years, maybe 13 years old, just depending on how you look at it, a lot of people are calling for its death, and Spark is the one that’s going to do that.

But there’s a little bit of difference. Like I said, we say that Hadoop is this all-encompassing thing. You hear me say it all the time, the Hadoop Ecosystem. So, I call it an ecosystem because a lot of things get pulled into the Hadoop Ecosystem. A lot of people say things, like assuming that Hadoop runs and does all the processing, and has all the functionality for your applications, or if you’re running it. But in a lot of data centers, you can run big data clusters and not be using Hadoop or not be using MapReduce.

And so, let me explain a little bit what I mean really by the true definition for Hadoop and then we’ll talk a little bit about Spark. So, Hadoop is built of two components. So, we separated it out into two different components. And so, the first one we’re going to break down is MapReduce. So, you’ve probably heard of MapReduce that’s what started that being able to process large datasets, and so it’s an indexing, somewhat of an indexing way to do data. So, if you have a cluster, you’re able to run your mapper and your reducer jobs, and be able to process data that way, and that functionality is called MapReduce. That’s one portion of Hadoop.

Another part of Hadoop, the really cool, the part that I’ve been involved with a ton is called the Hadoop Distributed File System or HDFS. And so, HDFS is the way that all the data is stored. And so, we have our MapReduce that’s controlling how the data is going to be processed, but HDFS is how we store that data. And, so many applications whether they’re in the Hadoop Ecosystem or new to just data processing or even just scripting, uses that Hadoop or HDFS to be able to pull data and be able to use your data as a file system.

And so, you have those two pieces right there and those two components.

When they talk about Hadoop being old, or Hadoop being slow, or portions of Hadoop that people aren’t interested in, most of the time they’re talking about the MapReduce portion. And so, there’s been a lot of things that have come out. So, there’s been MapReduce 1, and then MapReduce version 2, and Tez, and just different components around to compete with MapReduce, and Spark is one of those technologies as well.

And so, Spark is a framework. It’s called lightning fast, but it’s a framework for processing data. And so, you can still process your data that exist in HDFS, that exist in S3. There’s other places that your data can exist and be processed by Spark, but predominantly, most of the data centers, they still have their data in HDFS. So, things were built upon HDFS. HDFS is where your data is housed and so you process it whether you’re using Spark, whether you’re using Tez, whether you’re using any new way of processing the data, or you still may be using MapReduce, but you can have all that in HDFS. So, when you think about it, the two do compete, they do compete, but primarily from a processing engine.

And so, I’ve got a couple blog post out there that I’ll link to here in the show in the show notes, but you can go out and see where I break down the difference between batch and streaming, and some of those different workloads. And so, Spark really came on whenever we started talking about being able to stream data and being able to process data faster as it comes in.

And so, that’s why you see a lot of people that are talking about Hadoop being the past technology and then Spark’s, the newer technology that’s going to take over the world. There’s still going to be components from the traditional Hadoop like we talked about with HDFS. That’s probably still going to be used I don’t think for a long time. Like I said, there’s still a ton of people and a ton of developers still using MapReduce. And so, MapReduce has its functionalities for when we talk about batch workloads and there’s still development going on with MapReduce 1, and then Tez and some other platforms that are encompassed in the Hadoop community.

So, I would say, if you’re looking at it from a learning perspective, all right, which one do I want to learn, do I want to learn Hadoop, or do I want to learn Spark, and thinking that it’s all or nothing. I would say it’s not. I would focus mainly depending on what you’re looking to do, but I would definitely focus and learn HDFS, and so understand how the file system works and how you can compress, and how you can make those calls because chances are you’re going to be using HDFS and you’re also going to be using Spark, and Tez, and HBase, and Pig, and Hive, and a lot of different other tools in the ecosystem.

And so, I would say, it’s not an either or, you’re not going to pick and say, “I’m only going to do Spark,” or, “I’m only going to do Hadoop.” You’re more than likely going to be using a lot, using Spark too for your streaming applications and for your processing of your data, but you’re still using Hadoop, and the things in Hadoop with HDFS and being able to manage your data maybe with the [INAUDIBLE 00:06:28], and some of the other functionalities that are in that ecosystem. So, it’s not an all or nothing thing. And so, learning one is not going to stop you from getting your job or is going to stop you or prevent you from having to not learn another one. So, it’s not an either-or thing, but if you’re asking who will win in the future, I would say they both win.

Well, that’s all I have for today. Make sure to subscribe to the channel so you never miss an episode. We’ve got a lot of things that we’re working on, so we got some Isilon quick tips that are still rolling out. We’ve got some book reviews starting to get some interviews, so you can see some interviews that [INAUDIBLE 00:07:03] in the past, and then also these Big Data Big Questions. And so, anything that you want to see, just pop here in the comment section and I’ll try to answer it or try to tackle at the best I can. Thanks again.

Filed Under: Hadoop Tagged With: Data Engineer, Hadoop, Spark

What’s New in Hadoop 3.0?

December 20, 2017 by Thomas Henson 1 Comment

New in Hadoop 3.0

Major Hadoop Release!

Hadoop 3.0 is has dropped! There is a lot of excitement in the Hadoop community for a 3.0 release. Now is the time to find out what’s new in Hadoop 3.0 so you can plan for an upgrade to your existing Hadoop clusters. In this video I will explain the major changes in Hadoop 3.0 that every Data Engineer should know.

Transcript – What’s New in Hadoop 3.0?

Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. In today’s episode, we’re going to talk about some exciting new changes in Hadoop 3.0 and why Hadoop has decided to go with a major release in Hadoop 3.0, and what all is in it. Find out more right after this.

Thomas: So, today I wanted to talk to you about the changes that are coming in Hadoop 3.0. So, it’s already went through the alpha, and now we’re actually in the beta phase, so you can actually go out there and download it and play with it. But what are these changes that are in Hadoop 3.0, and then why did we go with such a major release for Hadoop 3.0? So, what all is in this one? There’s two major ones that we’re going to talk about, but let me talk about some of the other ones that are involved with this change, too. So, the first one is more support for containerization. And so if you go through Hadoop 3.0, when you go to the website, you can look, you can actually go through some of the documentation and see where they’re starting to support some of the docker pieces. And so this is just more evidence for the containerization of the world. We’ve seen it with Kubernetes. There’s a lot of different other pieces that are out there with docker. It’s almost like a buzzword to some extent, but it’s really, really been popularized.

It’s really cool changes, too, when you think about it. Because if we go back to when we were in Hadoop 1.0 and even 2.0, it’s kind of been a third rail to say, “Hey, we’re going to virtualize Hadoop.” And now we’re fast forwarding and switching to some of the containers, and so that’s going to be some really cool changes that are coming. Obviously there’s going to be more and more changes that are going to happen [INAUDIBLE 00:01:37], but this is really laying some of the foundation for that support for docker and some of the other major container players out there in the IT industry.

Another big change that we’re starting to see… One again, this is another… I won’t say it’s a monumental change, but it’s just more evidence for support for the cloud. And so the first one is there’s some expanded support for Azure’s data lakes. So, think the unstructured data there. Maybe some of our HTFS components. And then also some big changes in Amazon’s AWS S3. So, S3, they’re actually going to allow for easier management of your metadata with DynamoDB, which is a huge no sequel database used in a DAWS platform. So, those are two…I would say some of the minor changes. Those changes along probably wouldn’t have pushed it to be a Hadoop 3.0 or a major release.

The two major releases…and these are going to deal with the way that we store data, and it’s also going to deal with the way that we protect our data for disaster recover and when you start thinking of those enterprise features that you need to have. And so the first one is support for more than two namenodes. And so we’ve had support since Hadoop 2.0 where we were able to have a standby namenode. What this gave us in pre-having a standby namenode or even having a secondary namenode is if your Hadoop cluster went down…or if your namenode went down…your Hadoop cluster was all the way down, right?

Because that’s where all your data is stores as far as your metadata, and it knows what data is allocated on each of the namenodes. And so once we were able to have that secondary namenode and that shared journal where if one namenode went down, you can have another one. But when we start thinking about fault tolerance and disaster recovery for enterprises, we probably want to be able to expand that out. And so this is one of the ways that we’re actually going to tackle that in the enterprise is to be able to have those changes.

So, be able to support more than two namenodes. And so if you think about it with just doing some calculations, one of the examples is if you have three namenodes, and you have five shared journals, you can actually take two losses of a namenode. So, you could lose two namenodes, and your Hadoop cluster would still be up and running, still be able to run your MapReduce jobs, or if you’re using Spark or something like that, you still have your access to your Hadoop cluster there. And so that’s a huge change when we start to think about where we’re going with the enterprise and just the enterprise adoption. So, you’re seeing a lot of features and requests that are coming from the enterprise customer saying, “Hey, this is the way that we do DR. We’d like to have more fault tolerance built in.” And you’re starting to see that.

So, that was a huge change. One caveat around that…support for those namenodes, but they’re still in the standby mode. So, they’re not what we would talk about when we talk about HTFS federation. So, it’s not supporting three or four different namenodes in different portions of HTFS. I’ve actually got a blog post that you can check out about HTFS federation and kind of where I see that going and how that’s a little bit different, too. So, that was a big change. And then the huge change…I’ve seen some of the results on this before it even came out to the alpha. I think they did some testing in Japan Yahoo. But it’s about using Erasure coding for storing the data. So, if you think about how we store data in HTFS… If you remember the default three, so three times replication. So, as data comes in your namenode, it’s moved to one of your data nodes, and then two [INAUDIBLE 00:05:04] copies are moved to a different rack on two different data nodes. And so that’s to give you that fault tolerance there. So, if you lose one data node, you’re able to get your data and have your data in a separate rack that still would be able to run your MapReduce jobs or your Spark jobs, or whatever you’re trying to do with your data. Maybe just trying to pull it back.

That’s how we traditionally stored it. If you needed more protection, you just bumped it up. But that’s really inefficient. Sometimes we would talk about that being 200% of your data for one portion of your data block. But really, it’s more than that because most customers will have a DR cluster, and so they have it triple replicated over there. So, when you start to think about, “Okay, in our Hadoop cluster, we have it triple replicated. In our DR Hadoop cluster, we have it triple replicated.” Oh, and the data may exist somewhere else as the source data outside of your Hadoop clusters. That’s seven copies of the data. And how efficient is that for data that’s maybe mostly archive? Or maybe it’s compliance data. You want to keep it in your Hadoop cluster.

Maybe you run [INAUDIBLE 00:06:03] over it once a year. Maybe not. Maybe it’s just something you want to hold on to so if you do want to run a job, you can. So, what Erasure coding is going to do is it’s going to give you the ability to store that at a different rate. So, instead of having to triple replicate it, what Erasure coding basically does is it says, “Okay, if we have data, we’re going to break it into six different data blocks, and then we’re going to store three [INAUDIBLE 00:06:27] versus when we’re doing triple replication think of having 12. And so the ability to break that data down and be able to pull the data back from the [INAUDIBLE 00:06:36] is going to give you that ability to get a better ratio for how you’re going to store that data and what your efficiency rate is, too.

So, instead of 200%, maybe it’s going to be closer to 125 or 150. It’s just going to depend as you scale. Just something to look forward to. But it’s really cool because that gives you the ability to one, store more data – bring in more data, hold on to it, and not think so much about the…okay, this is going to take up three times the data just for how big the file is. And so it gives you the ability to hold on to more data and take more somewhat of a risk and be like, “Hey, I don’t know that we need that data right now, but let’s hold on to it because we know that we can use Erasure coding, and we can store it at a different rate. And then as we start to need it, or if it’s something that we need later on, we can bring that back and take that away.” So, think of Erasure coding as more of an archive for your data in HTFS.

And so those are the major changes in Hadoop 3.0. I just wanted to talk to you guys about that and just kind of get that out there. Feel free to send me any questions. So, if you have any questions for Big Data, Big Questions, feel free to go to my website, put it on Twitter, #bigdatabigquestions, put it in the comments section here below. I’ll answer those questions here for you. And then as always, make sure you subscribe so you never miss an episode. Always talking big data, always talking big questions and maybe some other tidbits in there, too. Until next time. See everyone then. Thanks.

Show Notes

Hadoop 3.0 Alpha Notes

Hadoop Summit Slides on Japan Yahoo Hadoop 3.0 Testing

DynamoDB NoSQL Database on AWS

Kubernetes 

 

Filed Under: Hadoop Tagged With: Big Data Big Questions, Hadoop, HDFS

How to Find HDFS Path URL?

December 17, 2017 by Thomas Henson 1 Comment