Hadoop Architecture Archives

Learning how to develop streaming architectures can be tricky and difficult. In Big Data the Kappa Architecture has become the powerful streaming architecture because of the growing need to analyze streaming data. For the past few years the Lambda architecture has been king but in past year the Big Data community has seen a transformation to the Kappa Architecture.

What is the Kappa Architecture? How can you implement the Kappa Architecture in your environment? Watch this video and find out!

Transcript

(forgive any errors the video was transcribed by a machine..)

Hi folks Thomas Henson here with thomashenson.com and this is another episode of big data big questions and so today what we’re going to do is we’re going to tackle the Kappa architecture and explain how we can use that in Big Data and why it’s so popular right now find out more right after this.

[Music]

So in a previous episode we talked about the lambda architecture and how the land architecture is kind of the standard that we’ve seen in big data before we had spark and streaming and Flink and you know all those processing engines that work with a you know in big data to do streaming and so you can find that video right here Oh check it out we’re in the same shirt pretty cool so after you watch that video now we need to talk about the capital architecture and the reason we’re going to talk about the Kappa is because it’s based and it’s more kind of morphed actually from what the lambda architecture is and so when we talk about the Lambda architecture we talked how we had a to dualistic you know framework so we have your speed layer and your batch or MapReduce later but more of a transactional and right so you have two layers you’re still moving your data in HDFS you’re still point your data into Q well the capital architecture what we’re trying to do there and where the industry is going is not to have to support two different frameworks right so I mean anytime you’re supporting two but two versions of code or two different layers of code it just it’s more complicated you know you mean more developers and is just more risk right you know you look at the 80/20 rule you’re always going have you know probably 20% of you know 20% of bugs cause 80% of your problems so you know why have to manage two different layers and so what we’re starting to see is we’re starting to move all our data into one system where we can interact with it through our API and you know pull out you know whether you know whether we’re running a you know flute job or whether we’re running some kind of distributed search maybe using solar or ElasticSeach but we want to collapse all that down into one different framework and so okay that that sounds pretty simple but it’s not really implemented like we think and so one of the big tips and one thing I want you to pay attention to is when you’re talking about the capital architecture you’re saying okay I’m going have all this let I’m going have this one layer here that’s going to interact and I want to run all my jobs you know whether I’m running through spark around through Flink that’s how we’re going to process this data what you want to make sure is we want to make sure that you’re not just using Kafka or some kind of message queue and you know you’re pulling your job you’re still doing you know you’re still pulling this your API’s and still running your streaming jobs from there but you may still be you know taking that data and moving it into HDFS and still running some processing here and so really what we want to see with the Kappa architecture is we want to see where we’re taking our data and you know whatever our queuing system is you can check out per Vega I oh and there’s some information there about that architecture layer and what you’ll see is you want that data to be able to you so your source data comes in you want your data to exist and it’s a kind of queuing system but then you also want that to auto to your app but you don’t want your API’s where you’re writing directly to HDFS because then you’re just writing to two different systems as well so you want something to abstract away all that storage so whether your data comes in and it’s more archival and it’s sitting in HDFS or sitting in some kind of object-based storage or it’s the streaming you know it’s the streaming applications and you’re trying to pull that data off as fast as you can you only want to interact with that one system and so that’s what we say when we talk about Kappa and that’s what Kappa really is intended to be so remember you want to abstract away that storage layer once your queuing system where you’re only dealing with API’s and so you want to be pulling your spark jobs your Flink jobs in your stripping research through one pipeline not through two different pipelines where you’re breaking up your speed layer and you’re breaking up you know maybe your batch of your transactional layer so that’s what the Kappa architecture is explained make sure you subscribe to this video so you never miss an episode you definitely want to keep up with what’s going on in Big Data any questions you have submit those big data big questions do in the comments below send me an email you know put it on the comment section or go to the Big Data big question section on my blog thanks again and I’ll see you next time.

What is Lambda Architecture?

Since the Spark, Storm, and other streaming processing engines entered the Hadoop ecosystem the Lambda Architecture has been the defacto architecture for Big Data with a real-time processing requirement. In this episode of Big Data Big Questions I’ll explain what the Lambda Architecture is and how developers and administrators can implement in their Big Data workflow.

Transcript

(forgive any errors text was transcribed by a machine)

Hi folks Thomas Henson here with thomashenson.com and today is another episode of big data big questions and so today’s question is what is the lambda architecture and how does that relate to our big data and Hadoop ecosystem? Find out right after this so when we talk about the lambda architecture and how that’s implemented into do we have to go back and look at Hadoop 1.0 and 2.0 when we really didn’t have a speed layer or Spark for streaming analytics and so back in the traditional days of Hadoop 1.0 and 2.0 we were using MapReduce for most of our processing and so the way that that would work is our data would come in we would pull our data into HDFS once our data was in HDFS we would run some kind of MapReduce job so you know we need to use pig or hive or to write our own custom job or some of the other frameworks that are in the ecosystem so that was all you know mostly transactional right so all our data had to be in HDFS so we had to have a complete view of our data to be able to process it later on we started looking at it and seeing that hey we need to be able to pull data in and do it in when data is not really complete right so unless transactional when we maybe have incomplete parts of the data or the data is continuing to be updated and so that’s where spark and Flink and some of the other streaming analytics and streaming processing engines came in is that we wanted to be able to process that data as it came came in and do it a lot faster too and so we took out the need really to even put it into HDFS for when we first we’re starting to process it because that takes time to write so we wanted to be able to move our data and process it before it even hit you know our HDFS and our disconnect that whole system but we still needed to be able to process that for batch processing right so some analytics some data that we’re going to pull we want to do that in real time right but then there’s other insights like maybe some monthly reports quarterly reports that are just better for transactional right and even when we start to talk about you know how we run a process and hold on to historical data and kind of use as a traditional enterprise data warehouse but in a larger you know more Hadoop platform basis like hi presto and some of the other SQL engines that are working on top of us do and so the need came where we were having these two different you know two different systems and how we were going to process data so we started adapting the lambda architecture so both the land architecture was was as your data come in it would sit and maybe a queue so maybe you can have it sitting in Kafka or just some kind of message queue any data that needed to be pulled out and processed streaming we would take and we will process that and what would call our speed layer so we have our speed layer maybe using smart or flee to pull out some insights and push those right out to our dashboards for our data that was going to exist for battleship for the you know transactional processing and just hold them for historical data we would have our MapReduce layer so we’re all a batch and so if you think about two different prongs so you have your speed layer coming in here pulling out your insights but your data as it sits in the cube goes into HDFS and still there to you know run hide will top up or hold on for historical data or maybe to still run some MapReduce jobs and pull up to a dashboard and so what we would have is you have two pronged approach there with your speed layer being your speed letter being on top and then your bachelor being on the bottom and then so as that dated would come in you still have your data in HDFS but you’re still be able to pull your data from you know your real time processing as the data is coming in and so that’s what we started talking about when we were saying lambda architecture is just a two layer system to be able to do our MapReduce and our best job and then also a speed layer to do our streaming analytics you know whether it be through spark flee or attaching beam and some of the other pieces so it’s a really good process to know it’s and you know it’s something that’s been in the industry for quite a long time so if you’re new to the Hadoop environment definitely want to know and be able to reference it back to but there are some other architecture that we’ll talk about in some future episodes so make sure you subscribe so that you never miss an episode so go right now and subscribe so that the next time that we talk about an architecture that you don’t miss it and I’ll check back with you next time thanks folks

Big Data Big Questions: Big Data Kappa Architecture Explained

Transcript

Big Data Lambda Architecture Explained

What is Lambda Architecture?

Transcript