Spark Archives - Thomas Henson

Spark vs. Hadoop 2019

June 12, 2019 by Thomas Henson Leave a Comment

Spark vs. Hadoop 2019

In 2019 which skill is in more demand for Data Enginners Spark or Hadoop? As career or aspiring Data Engineers it makes sense to keep up with what skills are in demand for the market. Today Spark is hot and Hadoop seems to be on it’s way out but how true is that?

Hadoop born out the web search era and part of the open source community since 2006 has defined Big Data. However, Spark’s release into the open source Big Data community and boosting 100x faster processing for Big Data created a lot of confusion about which tool is better or how each one works. Find out what Data Engineers should be focusing on this episode of Big Data Big Questions Spark vs. Hadoop 2019.

Transcript – Spark vs. Hadoop 2019

Hi folks. Thomas Henson here with thomashenson.com, and today is another episode of Wish That Chair Spun Faster. …Big Data Big Questions!

Today’s question comes in from some of the things that I’ve been seeing in my live sessions, so some of the chats, and then also comments that have been posted on some of the videos that we have out there. If you have any comments or have any ideas for the show, make sure you put them in the comments section here below or just let me know, and I’ll try my best to answer these. This question comes in around, should I learn Spark, or should I learn Hadoop in 2019? What’s your opinion?

A lot of people are just starting out, and they’re like, “Hey, where do I start?” I’ve heard Spark, I’ve heard Hadoop’s dead. What do we do here? How do you tackle it?

If you’ve been watching this show for a long time, you’ve probably seen me answer questions similar to this and compare the differences between Spark and Hadoop. This is still a viable question, because I’ve actually changed a little bit he way I think about it, and I’m going to take a different approach with the way that I answer this question, especially for 2019.

In the past I’ve said that it really just depends on what you want to do. Should you learn Spark? Should you learn Hadoop? Why can’t you learn both? Which, I still think, from the perspective of your overall learning technology and career, you’re probably going to want to learn both of them. If we’re talking about, hey, I’ve only got 30 days, 60 days. “I want the quickest results possible, Thomas.” How can I move into a data engineer role, find a career? Maybe I just graduated college, or maybe I’m in high school, and I want to get an internship that maybe turns into a full-time gig. Help me, in the next 30 to 90 days, get something going.

Instead of saying depends, I’m really going to tell you that I think it’s going to be Spark. That’s a little bit of a change, and I’ll talk about some of the reasons why I think that change too. Before we jump into that, let’s talk a little bit about some of the nomenclature that we have to do around Hadoop. When we talk about Hadoop, a lot of times that we’re talking about Hadoop, and MapReduce and Htfs [Phonetic 00:02:10] in this whole piece. From the perspective of writing MapReduce jobs or processing our data, Spark is far and clear the leader in that. Even MapReduce is being decoupled, has been decoupled, and more and more jobs are not written in MapReduce. They’re more written with Flink [Phonetic 00:02:28], or Spark, or Apache Beam, or even [Inaudible 00:02:32] on the back-end. That war has been won by Spark for the most part. Secondly, when we talk about Hadoop, I like to talk about it from an ecosystem perspective. We’re talking about Htfs, we’re talking about even Spark included in that, and Flume, all the different pieces that make up what we call the big data ecosystem. We just call that with the Hadoop ecosystem.

The way that I’m answering this question today is, hey, I’m looking for something in 2019 that could really move the needle. What do you see that’s in demand? I see Spark is very, very much in demand, and I even see Spark being used outside of just Htfs as well, too. That’s not saying that if you’ve learned Hadoop or you’ve learned Htfs you’ve gone down the wrong path. I don’t think that’s the case, and I think that’s still viable. You’re asking me, what can you do to move the needle in 30 to 90 days? Digging down and becoming a Spark develop, that opens up a career option. That’s one of the quickest ways that you can get, and one of the big things we’ve seen out there with the roles. Roles for data engineers. Another huge advantage, we’ve talked about it a little bit on this channel, but the big announcement for what Data Bricks is going from the perspective of investment and what their valuation is. They’re an $2.5 billion advancement, and they’re huge in the Spark community. They’re part of the incubators and on a lot of steering committees for Spark. They have some tools and everything that they sell on top of that, but it’s just really opened my eyes to what’s out there. I knew Spark was pretty big, but the fact that Data Bricks and where they’re going, I think that’s a lot of what we’re seeing. Another point, too, you’ve heard me talk about it a good bit, but where we’re going with deep learning frameworks and bringing it into the core big data area. Spark is going to be that big bridge, I believe. People love to develop in Spark. Spark’s been out there. It gives you the opportunity now with Project Hydrogen and some of the other things that are coming to be able to take and do ETL over GPUs, but also import data and be able to implement and use Tensorflow or PyTorch, or even Caffe 2. It you’re looking in 2019 to choose between Spark and Hadoop to find something in the next 30 to 90 days, I would go all in with Spark. I would learn Spark, whether it be from Java, Scala, or Python, but be able to learn, and be able to start doing some tutorials around that, being able to code. Being able to build out your own projects, and I think that’s going to really open your eyes, and that can really get the needle moving. At some point, you want to go back, and you want to learn how to navigate data with Htfs. How to find things. They’re going on from the Hadoop ecosystem, because it’s all a big piece here, but if you’re asking me, the one thing to do to move the needle in 30 to 90 days, learn Spark.

Thanks again. That’s all I have today for Big Data Big Questions. Remember, subscribe and ring that bell, so you never miss an episode of Big Data Big Questions. If you have any questions, put them in the comment section here below. We’ll answer them on Big Data Big Questions.

What is the Difference Between Spark & Hadoop

May 14, 2018 by Thomas Henson 1 Comment

Spark & Hadoop Workloads are Huge

Data Engineers and Big Data Developers spend a lot of type developing their skills in both Hadoop and Spark. For years Hadoop’s MapReduce was King of the processing portion for Big Data Applications. However for the last few years Spark has emerged as the go to for processing Big Data sets. Still it can be unclear what the differences are between Spark & Hadoop. In this video I’ll breakdown the differences every Data Engineer should know between Spark & Hadoop.

Make sure to watch the video below to find out the differences and subscribe to never miss an episode of Big Data Big Questions.

Transcript – What is the Difference Between Spark & Hadoop

Hi, folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s question is one I’ve been wanting to tackle for a long time, I’m not sure why I haven’t gotten to it, but I’m ready to it. So, it’s the ultimate showdown. What’s the difference between Hadoop and Spark, and which one will win in the fight. So, find out how I’ll answer that question right after this.

Welcome back. So, today’s question comes in from a user. It came on through a YouTube comment section. So, post your question down here below. You can actually go to my website and go to Big Questions. So, thomashenson.com/bigquestions. Put it out on Twitter. Use the hashtag Big questions. I’ll look it up, try to answer those questions.

So, today’s question comes in and it says, YouTube comment, “Nowadays, there are predominantly two softwares that are used for dealing with big data, Hadoop Ecosystem and Spark. Could you elaborate on the similarities and differences in those two technologies?”

So, that’s an amazing question. It’s one that we hear all the time. So, Hadoop, very sure technology, it’s been out there. Really, it’s associated with a lot of things. There are going on in the Big Data community and a lot of things you talk about, you say big data, it’s almost anonymous, synonymous that you’re going to say Hadoop as well. But with Hadoop being over 10 years, maybe 13 years old, just depending on how you look at it, a lot of people are calling for its death, and Spark is the one that’s going to do that.

But there’s a little bit of difference. Like I said, we say that Hadoop is this all-encompassing thing. You hear me say it all the time, the Hadoop Ecosystem. So, I call it an ecosystem because a lot of things get pulled into the Hadoop Ecosystem. A lot of people say things, like assuming that Hadoop runs and does all the processing, and has all the functionality for your applications, or if you’re running it. But in a lot of data centers, you can run big data clusters and not be using Hadoop or not be using MapReduce.

And so, let me explain a little bit what I mean really by the true definition for Hadoop and then we’ll talk a little bit about Spark. So, Hadoop is built of two components. So, we separated it out into two different components. And so, the first one we’re going to break down is MapReduce. So, you’ve probably heard of MapReduce that’s what started that being able to process large datasets, and so it’s an indexing, somewhat of an indexing way to do data. So, if you have a cluster, you’re able to run your mapper and your reducer jobs, and be able to process data that way, and that functionality is called MapReduce. That’s one portion of Hadoop.

Another part of Hadoop, the really cool, the part that I’ve been involved with a ton is called the Hadoop Distributed File System or HDFS. And so, HDFS is the way that all the data is stored. And so, we have our MapReduce that’s controlling how the data is going to be processed, but HDFS is how we store that data. And, so many applications whether they’re in the Hadoop Ecosystem or new to just data processing or even just scripting, uses that Hadoop or HDFS to be able to pull data and be able to use your data as a file system.

And so, you have those two pieces right there and those two components.

When they talk about Hadoop being old, or Hadoop being slow, or portions of Hadoop that people aren’t interested in, most of the time they’re talking about the MapReduce portion. And so, there’s been a lot of things that have come out. So, there’s been MapReduce 1, and then MapReduce version 2, and Tez, and just different components around to compete with MapReduce, and Spark is one of those technologies as well.

And so, Spark is a framework. It’s called lightning fast, but it’s a framework for processing data. And so, you can still process your data that exist in HDFS, that exist in S3. There’s other places that your data can exist and be processed by Spark, but predominantly, most of the data centers, they still have their data in HDFS. So, things were built upon HDFS. HDFS is where your data is housed and so you process it whether you’re using Spark, whether you’re using Tez, whether you’re using any new way of processing the data, or you still may be using MapReduce, but you can have all that in HDFS. So, when you think about it, the two do compete, they do compete, but primarily from a processing engine.

And so, I’ve got a couple blog post out there that I’ll link to here in the show in the show notes, but you can go out and see where I break down the difference between batch and streaming, and some of those different workloads. And so, Spark really came on whenever we started talking about being able to stream data and being able to process data faster as it comes in.

And so, that’s why you see a lot of people that are talking about Hadoop being the past technology and then Spark’s, the newer technology that’s going to take over the world. There’s still going to be components from the traditional Hadoop like we talked about with HDFS. That’s probably still going to be used I don’t think for a long time. Like I said, there’s still a ton of people and a ton of developers still using MapReduce. And so, MapReduce has its functionalities for when we talk about batch workloads and there’s still development going on with MapReduce 1, and then Tez and some other platforms that are encompassed in the Hadoop community.

So, I would say, if you’re looking at it from a learning perspective, all right, which one do I want to learn, do I want to learn Hadoop, or do I want to learn Spark, and thinking that it’s all or nothing. I would say it’s not. I would focus mainly depending on what you’re looking to do, but I would definitely focus and learn HDFS, and so understand how the file system works and how you can compress, and how you can make those calls because chances are you’re going to be using HDFS and you’re also going to be using Spark, and Tez, and HBase, and Pig, and Hive, and a lot of different other tools in the ecosystem.

And so, I would say, it’s not an either or, you’re not going to pick and say, “I’m only going to do Spark,” or, “I’m only going to do Hadoop.” You’re more than likely going to be using a lot, using Spark too for your streaming applications and for your processing of your data, but you’re still using Hadoop, and the things in Hadoop with HDFS and being able to manage your data maybe with the [INAUDIBLE 00:06:28], and some of the other functionalities that are in that ecosystem. So, it’s not an all or nothing thing. And so, learning one is not going to stop you from getting your job or is going to stop you or prevent you from having to not learn another one. So, it’s not an either-or thing, but if you’re asking who will win in the future, I would say they both win.

Well, that’s all I have for today. Make sure to subscribe to the channel so you never miss an episode. We’ve got a lot of things that we’re working on, so we got some Isilon quick tips that are still rolling out. We’ve got some book reviews starting to get some interviews, so you can see some interviews that [INAUDIBLE 00:07:03] in the past, and then also these Big Data Big Questions. And so, anything that you want to see, just pop here in the comment section and I’ll try to answer it or try to tackle at the best I can. Thanks again.

Should Data Engineers Know Machine Learning Algorithms?

November 10, 2017 by Thomas Henson Leave a Comment

How involved should Data Engineers be in learning Machine Learning Algorithms?

For the past few years Data Scientist are one of the hottest jobs in IT. A huge part of what Data Scientist do is selecting Machine Learning Algorithms for projects like SmartHomes and SmartCars. What about the Data Engineer, should they know Machine Learning Algorithms as well? Find out in this episode of Big Data Big Questions.

Transcript – Should Data Engineers Know Machine Learning Algorithms?

Hi, folks. Thomas Henson here, with ThomasHenson.com, and today is another episode of Big Data, Big Questions. And so, today’s question that comes in is, “Should data engineers know machine-learning algorithms?” So, we’re going to tackle that question right after this.

Welcome back. So, today, we’re going to talk a little bit about algorithms, right? So, you know, put your math hat on, and let’s dive into this question today. And so, today’s question, it’s one I get a lot. It’s about the role of a data engineer in machine-learning. And basically, it is… You know, I’ve taken this question from a couple of different sources that I’ve seen, where they’ve asked, you know, “Should data engineers know machine learning algorithms?” And kind of where some of that falls into is, you know, what is the role of the data engineer, and what is the role of the data scientist? And so, really, this question, for me, is really simple. I’m going to go off of my experience and kind of share with you what I’ve done around machine-learning algorithms and how I’ve approached it in my career as a data engineer, software engineer, you know, Hadoop administrator.

There’s a couple of different ways to look at it, but basically, the way that I’ve approached it is I haven’t really learned it. And when I say, “Learned it,” or, “Know it,” I’ve not been in…you know, I’m not going to make a recommendation on it. So, you know, the way I look at it is you should be familiar with them. So, you should be familiar with them, especially familiar with them as far as, like, what’s involved in the package? So, are you using Mahout? You know, what are the algorithms in there, what are the algorithms in your workflow? And then, all the other libraries too.

So, if you’re evaluating other libraries…so, maybe you guys are looking to…you know, maybe you haven’t used Spark and you want to look at the e-mail library that’s there, and you’re kind of going back and forth through those, you want to understand from a basic very high-level, you know, what those algorithms are, and for sure, what algorithms you’re using in your environment, so you can make an educated recommendation saying, “Hey, you know, I think we should move this. Let’s still have the data scientists involved, and have them, you know, look and make sure that the algorithm that we’re going to be using from those packages are going to fall in line with what we’re really using,” because that’s one of the things too, you’ll find that they will differentiate a little bit, so, you know, what we’re using in my house may not be exactly the same, you know, version in, you know, MadLib, or, you know, the ML library.

And so, just be able to understand kind of for sure what’s in your workflow. Be familiar with them too. Another thing that I did…so, like I said, be familiar with them from a high level, but not be making a recommendation, I actually did, you know, picked one, so I would say, you know, be familiar with them, but pick one that you really want to…you know, really want to understand and learn. I picked Singular Value Decomposition, because that’s something that we used a lot in our workflow, and so, I was just kind of…had a natural curiosity for it, and it…you know, it had a really cool story too around it. So, you know, I found some stories around it, you know, it was made really popular with the Netflix Challenge. So, back…Netflix had a challenge for…you know, to, “Beat our data scientists with your algorithm.” And so, SVD was used to, you know, do some of the sorting there, and it was kind of made famous from that perspective, and so, you know, I was familiar with it, but I made sure that I understood one, just for natural curiosity.

Now, if you are looking to, you know, at some point, make a jump, right, to data scientist, if you’re a data engineer, and at some point down the road, you’d like to be…you know, “I want to be the data scientist. I want to say, ‘Hey, this is the algorithm we should use.’” You know, maybe you just want to be a data scientist because, you know, for a couple years running, it’s been the…you know, the sexiest career, you know, in IT for a while, and so, if that’s kind of your approach, you know, definitely start to know them.

Obviously, learn the ones that are in your environment first, because that’s going to be the easiest, because you’re going to have the access to, you know, why you’re using it, how you’re using it, and you have access to the data scientists too, to kind of, you know, take you under their wing, to some extent, and, you know, show you the ins and outs of why you’re using what you did and, you know, kind of why you didn’t use other ones too. For an aspiring data scientist, then yes, for sure, you want to jump in and, you know, start to understand and start to know them. But for a data engineer, I don’t think you have to learn the algorithms, right? I think you have to be familiar with them, I think, you know, for natural curiosity, you know, maybe learn one or two.

But really, our role is not to recommend and say, “Hey, you know, these are the algorithms I think we should use,” or even, like, to pick packages and say, “Hey, these packages here, we’re going to…you know, we’re going to standardize on that and that’s the only thing we’re going to use.” That’s…you know, that’s not really our role, right?

If you have any questions, make sure you submit them to Big Data, Big Questions. You can do it from the website, go to Twitter, use the hashtag #BigDataBigQuestions, in the comment section there, however you want to get in touch with me and get those questions answered. Also, make sure you subscribe so that you never miss all these Big Data, Big Questions goodness, and so that you can always, you know, learn more. Thanks again, folks!

Show Notes

Mahout

Spark MLlib

MADlib

Big Data Big Questions

Netflix Prize

Singular-Value Decomposition