Spark vs. Hadoop 2019

In 2019 which skill is in more demand for Data Enginners Spark or Hadoop? As career or aspiring Data Engineers it makes sense to keep up with what skills are in demand for the market. Today Spark is hot and Hadoop seems to be on it’s way out but how true is that?

Hadoop born out the web search era and part of the open source community since 2006 has defined Big Data. However, Spark’s release into the open source Big Data community and boosting 100x faster processing for Big Data created a lot of confusion about which tool is better or how each one works. Find out what Data Engineers should be focusing on this episode of Big Data Big Questions Spark vs. Hadoop 2019.

Transcript – Spark vs. Hadoop 2019

Hi folks. Thomas Henson here with thomashenson.com, and today is another episode of Wish That Chair Spun Faster. …Big Data Big Questions!

Today’s question comes in from some of the things that I’ve been seeing in my live sessions, so some of the chats, and then also comments that have been posted on some of the videos that we have out there. If you have any comments or have any ideas for the show, make sure you put them in the comments section here below or just let me know, and I’ll try my best to answer these. This question comes in around, should I learn Spark, or should I learn Hadoop in 2019? What’s your opinion?

A lot of people are just starting out, and they’re like, “Hey, where do I start?” I’ve heard Spark, I’ve heard Hadoop’s dead. What do we do here? How do you tackle it?

If you’ve been watching this show for a long time, you’ve probably seen me answer questions similar to this and compare the differences between Spark and Hadoop. This is still a viable question, because I’ve actually changed a little bit he way I think about it, and I’m going to take a different approach with the way that I answer this question, especially for 2019.

In the past I’ve said that it really just depends on what you want to do. Should you learn Spark? Should you learn Hadoop? Why can’t you learn both? Which, I still think, from the perspective of your overall learning technology and career, you’re probably going to want to learn both of them. If we’re talking about, hey, I’ve only got 30 days, 60 days. “I want the quickest results possible, Thomas.” How can I move into a data engineer role, find a career? Maybe I just graduated college, or maybe I’m in high school, and I want to get an internship that maybe turns into a full-time gig. Help me, in the next 30 to 90 days, get something going.

Instead of saying depends, I’m really going to tell you that I think it’s going to be Spark. That’s a little bit of a change, and I’ll talk about some of the reasons why I think that change too. Before we jump into that, let’s talk a little bit about some of the nomenclature that we have to do around Hadoop. When we talk about Hadoop, a lot of times that we’re talking about Hadoop, and MapReduce and Htfs [Phonetic 00:02:10] in this whole piece. From the perspective of writing MapReduce jobs or processing our data, Spark is far and clear the leader in that. Even MapReduce is being decoupled, has been decoupled, and more and more jobs are not written in MapReduce. They’re more written with Flink [Phonetic 00:02:28], or Spark, or Apache Beam, or even [Inaudible 00:02:32] on the back-end. That war has been won by Spark for the most part. Secondly, when we talk about Hadoop, I like to talk about it from an ecosystem perspective. We’re talking about Htfs, we’re talking about even Spark included in that, and Flume, all the different pieces that make up what we call the big data ecosystem. We just call that with the Hadoop ecosystem.

The way that I’m answering this question today is, hey, I’m looking for something in 2019 that could really move the needle. What do you see that’s in demand? I see Spark is very, very much in demand, and I even see Spark being used outside of just Htfs as well, too. That’s not saying that if you’ve learned Hadoop or you’ve learned Htfs you’ve gone down the wrong path. I don’t think that’s the case, and I think that’s still viable. You’re asking me, what can you do to move the needle in 30 to 90 days? Digging down and becoming a Spark develop, that opens up a career option. That’s one of the quickest ways that you can get, and one of the big things we’ve seen out there with the roles. Roles for data engineers. Another huge advantage, we’ve talked about it a little bit on this channel, but the big announcement for what Data Bricks is going from the perspective of investment and what their valuation is. They’re an $2.5 billion advancement, and they’re huge in the Spark community. They’re part of the incubators and on a lot of steering committees for Spark. They have some tools and everything that they sell on top of that, but it’s just really opened my eyes to what’s out there. I knew Spark was pretty big, but the fact that Data Bricks and where they’re going, I think that’s a lot of what we’re seeing. Another point, too, you’ve heard me talk about it a good bit, but where we’re going with deep learning frameworks and bringing it into the core big data area. Spark is going to be that big bridge, I believe. People love to develop in Spark. Spark’s been out there. It gives you the opportunity now with Project Hydrogen and some of the other things that are coming to be able to take and do ETL over GPUs, but also import data and be able to implement and use Tensorflow or PyTorch, or even Caffe 2. It you’re looking in 2019 to choose between Spark and Hadoop to find something in the next 30 to 90 days, I would go all in with Spark. I would learn Spark, whether it be from Java, Scala, or Python, but be able to learn, and be able to start doing some tutorials around that, being able to code. Being able to build out your own projects, and I think that’s going to really open your eyes, and that can really get the needle moving. At some point, you want to go back, and you want to learn how to navigate data with Htfs. How to find things. They’re going on from the Hadoop ecosystem, because it’s all a big piece here, but if you’re asking me, the one thing to do to move the needle in 30 to 90 days, learn Spark.

Thanks again. That’s all I have today for Big Data Big Questions. Remember, subscribe and ring that bell, so you never miss an episode of Big Data Big Questions. If you have any questions, put them in the comment section here below. We’ll answer them on Big Data Big Questions.