K8s Archives - Thomas Henson

Comparing Growth in Kubernetes & Hadoop

How do you select a career path in Data Engineering? There are some many options for technologies to learn for System Administrators in Data Engineering. While Hadoop has been king of for the past decade, we must make way for a higher level technology. Kubernetes (K8s) is a distributed computing technology but vastly different from Hadoop on the data processing side. K8s is an open-source technology for automating, managing, and scaling containers. If we are comparing to Hadoop think of comparing to the Systems Administrative side of Data Engineering not Software Engineering.

The questions is how does the K8s ecosystem compare to that Hadoop?

Should I add K8s to my list of must learn technologies?

https://youtu.be/FZP-OJyZf0Q

Kubernetes vs. Hadoop Transcript

Hi, folks. Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, in this episode we’re going to be talking and breaking down Kubernetes versus Hadoop and talking about specifically which one I would prefer, if I was starting out today, to learn as a data engineer. Before we even get into it, I’m not saying that these are the same technologies, but I am comparing the popularity and I’m also comparing what some of the innovations are that we want to study as data engineers, or just people in the industry as a whole, where do we see those markets. So, let’s jump right into that right after this.

All right, so, today’s question comes in… We’re going to talk a little bit about the popularity of Kubernetes specifically, and Hadoop, and where we’re at. One of the biggest things right now is the popularity of Kubernetes in the container world has eclipsed the popularity of Hadoop and the Hadoop ecosystem. If we were looking at a chart, we would see the number of contributions and development to those open source platforms, that Hadoop has just been eclipsed by it. Now, Kubernetes is not replacing Hadoop, but it is changing the way… And there are innovations in Hadoop that are taking advantage of containers and specifically Kubernetes. So, let’s break it down and talk a little bit about each one of these and then we’ll do a comparison of where we see it going for data engineers and then also provide recommendations for which one I would choose if I was starting fresh today.

Kubernetes is an open source orchestration system for automating application deployment, scaling, and management. It was originally designed by Google. Hmmm. That sounds familiar, right? [Laughs] So many things from the open source world comes. But if you think about Kubernetes… If you’ve done anything with containers, and specifically around Docker, Kubernetes is that orchestration, that layer that allows for you to cluster these together. Think of how Yarn works in the Hadoop world. We’ll talk a little bit about Hadoop here soon. Think of it as just being able to orchestrate, “I have all these different nodes that are deploying my application or running different portions of my application.” Kubernetes is the secret sauce to say, “Hey, I need three of these, or four of these,” and be able to not only do some of the load balancing and some of the other pieces, but the orchestration to scale those up and make them elastic and scale them down.

Kubernetes is synonymous with cloud-native. What’s going on with a cloud-native? Being able to move applications from Azure to AWS and make it seamless, or on-prem. To be able to develop something on-prem and be able to push it out. Really cool, really popular. Just another open source piece that’s came out of Google. Man, thank goodness for Google and all the things that they’ve done for the open source world. But, just another way to do that orchestration. So, from a data engineer’s perspective, really cool for us because it changes how we can deploy and use our applications. Back to what we were talking about, even the cloud-native piece. Being able to deploy out, start something, from a POC perspective, to be able to start and take advantage of being able to use something on-prem or do it in the cloud and then pull it back down, that’s really cool. And then also changes that application layer, but first let’s talk a little bit more about Hadoop, then we’ll talk about how it all fits together.

So, Hadoop, we’ve talked about it a bit here. Been around for a long time, synonymous with being able to scale out and make large-scale data decisions, like, be able to have that storage layer and also be able to analyze your data too. So, think of it as a node-based architecture similar to what we were talking about with Kubernetes, little bit different, but a node-based architecture that’s going to allow for us to analyze data and be able to make decisions with it over a large cluster, like, you’re building out a huge system here. So, synonymous with, originally, in early days of MapReduce, but it’s taking over more of a Hadoop ecosystem where we’re talking about not only that storage layer of HDFS, but also that processing layer that actually is going to allow for us to process the data on individual nodes, bring those back to the user, and be able to take advantage of all that under the covers. Also another product built and written of off research that was published by Google. Not specifically open sourced by Google, but some of the research papers that they pushed out there led to the writing of a research paper and the open source portion of Hadoop that became popular. It’s been around… If you follow this channel, you’ve heard me talk about Hadoop a good bit too.

Now, let’s talk a little bit about the architecture. When we’re talking about Hadoop and the architecture, we’re talking about running our data processing maybe with Spark, or Hive, or it could be Pig, or anything like that. Then you’ve got your layer that is your orchestration and that’s where Yarn comes in. It’s going to process that we have the resources on all these different nodes and spread out. And then your data layer comes in at HDFS. That’s where my data is stored, with an HDFS perspective. Well, what Kubernetes can do, so how it fits in the Big Data world and where we really see it is… Think of Spark, TensorFlow, any of those tools that we were talking about. Now our orchestration layer can be actually with Kubernetes. And then we can do persistent storage in our databases, S3, or there are some innovations out there that you can to do it in HDFS, but you’re seeing it used more in a cloud-native perspective, so little bit different change of the architecture. The really cool thing about it is you’re actually abstracting away, so you’re not only just using Yarn just for building out this cluster, building out a system, you’ll be using Kubernetes, which can also do your application development too, like, you are… I’m sorry, application deployment. So, you know, being able to build cloud-native applications that are not just for data analytics but maybe serving out your web host and some of the other pieces. So, really a lot of innovation around that and really cool. There’s a lot of stuff and I just can’t go into it. I’m trying to give you a high level from here but there’s a lot of resources, a lot of courses out there, a lot of other things that maybe we should pick up at some point to talk about, around the innovations with Kubernetes. It’s really cool. It’s really changing, it’s in flux. If anybody is saying that, “Hey, it’s always going to be this way,” it’s one of those things that’s continuing to innovate, just like the data analytics area too.

At the end of the day, I guess the question is, if I were starting out today, would I focus solely on Hadoop, would I focus solely on Kubernetes if I can only choose one? Which, you can never only choose one, but I appreciate and I love these types of questions too because they really push me to make a decision. So, today… One of the things, the biggest thing… Cloud is a huge topic and being able to do things cloud-natively, so being able to support it on-prem, being able to support it and push it out to different multi-cloud… There’s so many different topics and buzzwords in that, but if we really look at the to that and that decision making, Kubernetes is really one of the big portions around that, and that has a huge impact on what we’re doing from a data analytics perspective.

And frankly, because of Hadoop’s not so much ease to use it in the cloud, I think that’s one of the reasons that we’ve seen Hadoop wane with their growth. We’ve seen Hortonworks, Cloudera merge together, MapR be picked up and purchased, and then also IBM’s BigInsights, because of the fact that these were systems that would only work on-prem. You had other options in other cloud perspectives, but AWS had their version versus Azure had their version under the covers, but if you wanted to really pull it back into your own on-prem area or push it out, it was a little more complicated, and you couldn’t just move it from AWS to Azure. Kubernetes has really pushed that, not just from the analytics world, but from that perspective.

So, if you made me choose today and you said, “Hey, man, you can only choose one and it’s something that you’ve got to get skilled up on in the next three to six months,” I’d choose Kubernetes. Not saying that I would not learn Hadoop from that perspective, but if I had to choose between the two of those right now, I think there’d be a bigger opportunity for data engineers and specifically systems administrators and those kinds of people that are more hands-on with the administration piece. I think that’s where we’re going to see a lot… And you’ve seen a lot with the open source tools out there in the Hadoop ecosystem, like Spark and some of the things going on with Project Submarine, just being able to support containers.

That’s all I have today for Big Data Big Questions. If you have a question make sure you put it in the comment section here below or reach out to me. I’ll do my best to answer those questions here, on the next episode of Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

One question I get a lot of on Big Data Big Questions is “Thomas what are you learning”. Honestly not as much as a Is should. It’s true I believe the key to being successful in any part of life is to continually learn.

Looking to change careers from Web Developer to DevOps?

Do you want to be a better partner or spouse?

Trouble with public speaking?

All the answer to the above questions start with learn and end with consistency. If you make it a habit to learn and are consistent with it there isn’t anything you can’t accomplish. Alright enough with selling you on learning! I wanted to share with you what I’m learning and to help motivate myself TO KEEP LEARNING. The way I plan to accomplish this is with monthly learning reports.

30 Minutes of Learning Everydayish…

For along time I’ve advocated for the idea of taking 30 minutes everyday to learn something new. I’ll go through time periods where I’m hitting that everyday then sometimes where I fall behind. While it’s only a target and nothing to beat myself up about, I find it a useful technique when learning any concept. The way I do my 30 minutes of learning is to set a timer on my phone for 30 minutes and focus only on that topic for 30 minutes. Recently, in order to track this habit I’ve been using the Super Habit App. Below you can see how I’ve done over the last month. Honestly not my best work but let’s see how it improves overtime.

Pluralsight

Mainly my 30 Minutes of Learning comes from Pluralsight courses. Not only am I an Author but I’m also a student. Pluralsight has been a part of my personal learning path long before I was an Author. Back when I was a fresh new Web developer I used Pluralsight to learn C# and ASP.NET frameworks. Of course, I also dove into the world of JavaScript, JQuery, and other JS frameworks. Now days I still learn with Pluralsight but the content is more Data Engineering and IT OPs focused.

Docker Big Picture Course

Docker seems to be taking over the world. In fact, contributions and adoption of Docker and Kubernetes has outpaced Hadoop exponentially. So many new applications and services offer a containerized version. In the Hadoop 3.0 release multiple features were to add support for containers. All this container talk has pushed me to learn this amazing Platform-as-a-Services for OS virtualization. My guide on this journey is the great Nigel Poulton. Checkout my notes from the Docker Big Picture Course:

Kubernetes originated out of Google (shocking I know)
Kubernetes is Greek for helmsman or Captain
Web Playground for K8s
Web Playground for Docker
Docker Engine – daemon –> containerd –> OCI
Docker has both community and enterprise versions
Kubernetes = K8s

Docker Deep Dive Course

After working my way through the Docker Big Picture Course I decided to stay in the Docker world by watching the Docker Deep Dive Course. I loved this course where I was able to get hands on with Docker and learn a good bit beyond the basics. Here are a few notes I jotted down throughout this course:

Docker Commands like
- List – docker ps
- Pull Image – docker image pull
- Build Image – docker image build
- Run Image – docker exec
Building Docker images is as easy as writing a YAML file
YAML = Yet Another Markup Language
Docker networking with bridge drive on Linux or NAT driver on Windows
Stacks and Services – code –> container –> services
Docker Universal Control Plane (UCP) is installed on top of Docker Engine
Docker Trusted Registry can be setup for storing all Docker Images.

Data Related Podcast & Blogs

Data Engineer learning doesn’t only take place in courses. I also wanted to track some of the Podcast and articles I consumed throughout the month. The great thing about Podcast is you can listen to them while commuting, working out, house chores, or just about anywhere. Here are a few Podcast and Articles I’ve consumed over the past month.

Conversational AI Best Practices with Cathy Pearl and Jessica Dene Earley-Cha – GCP podcast that digs into the aspects of conversational AI. I loved this podcast to explore where conversational AI is going and where to get started with NLP in GCP. Actually gave me some ideas for my October learning goals.
Microservices.io – Uh I can’t even begin to summarize how much content is on this site. If you are looking to learn more around Microservices (which you should!!) then bookmark this site and read this content over time.
Doctor AI – Dell Tech (full disclosure: #iworkforDell) podcast diving into different topics around AI. In this episode host Jessica explores the possibilities of AI augmentation in the medical field. One the areas I’ve spent a good bit of research in and spoke about. Earlier this year I spoke with a group of Medical Doctors and Researchers at NYU around advances in AI.
Exploring AWS Lake Formation – AWS podcast with guest from around the AWS world. A lot of great content on this on this podcast. Listened to this particularly episode while walking my son so my attention wasn’t what it should have been. Mostly I remember that Data Lake Formation is an AWS services that helps with cataloging and label data to support multiple services (MySQL, Redshift, S3).

On To Next Month

Thanks for supporting this new series and I’m excited to see how it matures over time. Also would love if I got more consistent with my learning as well. If you have ideas for things I should be learning or would like to share what you are learning put it in the comments below. Right now my thoughts are to wrap up the Kubernetes Deep Dive course then move on to Natural Language Processing (NLP). I’ve got ideas for some really cool projects in NLP so it should be fun.

Kubernetes vs. Hadoop Career Growth