August 2017 - Thomas Henson

Breaking The World of Processing

Streaming and Real-Time analytics are pushing the boundaries of our analytic architecture patterns. In the big data community we now break down analytics processing into batch or streaming. If you glance at the top contributions most of the excitement is on the streaming side (Apache Beam, Flink, & Spark).

What is causing the break in our architecture patterns?

A huge reason for the break in our existing architecture patterns is the concept of Bound vs. Unbound data. This concept is as fundamental as the Data Lake or Data Hub and we have been dealing with it long before Hadoop. Let’s break down both Bound and Unbound data.

Bound Data

Bound data is finite and unchanging data, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).

Traditionally we have analyzed data as Bound data sets looking back into the past. Using historic data sets to look for patterns or correlation that can be studied to improve future results. The timeline on these future results were measured in months or years.

For example, testing a marketing campaign for the Telsa Model S would take place over a quarter. At the end of the quarter sales and marketing metrics are measured deeming a success or failure for the campaign. Tweaks for the campaign are implemented for next quarter and the waiting cycle continues. Why not tweak and measure the campaign from the first onset?

Our architectures and systems were built to handle data in this fashion because we didn’t have the ability to analyze data in real-time. Now with the lower cost for CPU and explosion in Open Source Software for analyzing data, future results can be measured in days, hours, minutes, and seconds.

Unbound Data

Unbound data is unpredictable, infinite, and not always sequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generated on a Web Scale Enterprise Network is Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc. Trying to analyze all this data as Bound data is asking for pain and failure (trust me I’ve been down this road).

Our world is built on processing unbound data. Think of ourselves as machines and our brains as the processing engine. Yesterday I was walking across a parking lot with my 5 year old daughter. How much Unbound data (stimuli) did I process and analyze?

Watching for cars in the parking lot and calculating where and when to walk
Ensuring I was holding my daughter’s hand and that she was still in step with me
Knowing the location of my car and path to get to car
Puddles, pot holes, and pedestrians to navigate

Did all this data (stimuli) come in concise and finite fashion for me to analyze? Of course not!

All the data points were unpredictable and infinite. At any time during our walk to the car more stimuli could be introduced(cars, weather, people, etc). In the real world all our data is Unbound and has always been.

How to Manage Bound vs. Unbound Data

What does this mean? It means we need better systems and architectures for analyzing Unbound data, but we also need to support those Bound data sets in the same system. Our systems, architectures, and software has been built to run bound data sets. Since the 1970’s where relations database were built to hold data collected. The problem is in the next 2-4 years we are going to have 20 – 30 billion connected devices. All sending data that we as consumers will demand instant feedback on!

On the processing side the community has shifted to true streaming analytics projects with Apache Flink, Apache Beam and Spark Streaming to name a few. Flink is a project showing strong promise of consolidating our Lambda Architecture into a Kappa Architecture. By switching to a Kappa Architecture developers/administrators can support on code base for both streaming and batch workloads. Not only does this help with the technical debt of managing two system, but eliminates the need for multiple writes for data blocks.

Scale-out architectures have provided us the ability to quickly expand our demand. Scale-out is not just Hadoop clusters that allow for Web Scale, but the ability to scale compute intense workloads vs. storage intense. Most Hadoop cluster are extremely CPU top heavy because each time storage is needed CPU is added as well.

Will your architecture support 10 TBs more? How about 4 PBs? Get ready for explosion in Unbound data….

Should I Use Kappa Architecture For Real-Time Analytics?

Analytics architectures are challenging to design. If you follow the latest trends in Big Data, you’ll see a lot different architecture patterns to chose from.

Architects have a fear of choosing the wrong pattern. It’s what keeps them up at night.

What architecture should be used for designing a real-time analytics application. Should I use the Kappa Architecture for real-time analytics? Watch this video and find out!

Video

Transcript

Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of Big Data, Big Questions. Today’s question is all about the Kappa architecture and real-time analytics. So, our question today came in from a user, and it’s going to be about how we can tackle the Kappa architecture, and is it a good fit for those real-time analytics, for sensor networks, how it all kind of works together. Find out more, right after this.

So, today’s question came in from Francisco. And it’s Francisco from Chile, and he says, “Best regards from Chile.” So, Francisco, thanks for your question, and thanks for watching. So, his question is, “Hi, I’m building a system for processing sensor network data in near real-time. All this time, I’ve been studying the Lambda architecture in order to achieve this. But now, I’ve ran into the Kappa architecture, and I’m having trouble deciding between which one.” He says, what he wants to do is, he wants to analyze this near real-time data in real-time. So, as the data is coming from the sensors, he wants to obtain knowledge, and then push those out in some kind of a UI. So, some kind of charts and graphs, and he’s saying, do we have any suggestions about why we would choose one of these architectures that we would recommend for him? Well, thanks again, Francisco, for your question. And so, yes, I have some thoughts about how we should set up that network. So, but let’s review, real quick, about what we’ve talked about in previous videos of the Lambda architecture, and what the Kappa architecture is, and then how we’re going to implement those.

So, if you remember, the Lambda architecture – we have two different streams. And so, we have a batch-level stream and we have a real-time. So, as your data comes in, it might come in through something that’s a queueing system, called Kafka, or it’s in a area right there, where we’re just using it to queue all the data as it comes in. And so, for that real-time, you will follow that real-time stream, and so, you might use Spark, or Flink, or some kind of real-time processing engine that’s going to do the analytics, and push that out to some of your dashboards for data that’s just as it’s coming in, right? So, as soon as that data comes in, you want to analyze it as quick as you can – it’s what we call near real-time, right? But, you also have your batch layer. So, for your batch processing, for your storing of the data, right? Because, at some point, your queueing system, whether it’s Kafka or something, it’s going to get very, very large, and some of that data’s going to be old, and you don’t need to have it in a area where you can stream it out and analyze it all the time. So, you want it to be able to tier, or you want to move that data off to, maybe, HTFS, or S3 Object. And so, from there, you can use your distributed search, you can have it in HTFS, use Cassandra, or some other kind of… maybe it’s HBase, or some kind of NoSQL database that’s working on top of Hadoop. And then, you also can run your batch jobs there. So, you can run your MapReduce jobs there, whether it’s traditional MapReduce, or whether it’s Spark’s batch-level processing. But, you have two layers.

And so, that’s one of the challenges with the Lambda architectures – you have these two different layers, right? So, you’re supporting two levels of code, and for a lot of your processing, a lot of your data that’s coming in, maybe you’re just using the real-time there, but maybe the batch processing is used every month. But, you’re still having to support those two different levels of code. And so, that’s why we talk about the Kappa architecture, right? So, the Kappa architecture, it simplifies it. So, as your data comes in, you want to have your data that’s in one queueing system, or one storage device – where your data comes in, you can do your analytics on it, so you can do your real-time processing, and push that data our to your dashboards, your web applications, or however you’re trying to consume that data. Then, also do your distributed search, as well. So, you can… If you’re using ElasticSearch, or some other kind of distributed search, maybe it’s Solr, or some of the other ones, you can be able to analyze that data and have it supporting that real-time search, as well. But, you might use Spark, and Flink, for your real-time analytics, but you also wanted to do your batch, too. So, you’re going to have some batch processing that’s going to be done. But, instead of creating a whole ‘nother tier, you want to be able to do that within that queueing system that you have. And so, whether you’re using Kafka, or whether you’re using Pravega, which is a new open-source product that just was released by Dell, you want to be able to have all that data in one spot, so that when you’re queueing that data, you know that it’s going to be there. But, you can also do your analytics on it. So, you can use it for your distributed search, you can use it for those streaming analytics jobs, but also, whenever you go back to do some of your batch, or some of your transitional processing, you know that it’s in that same location, too. That way, there’s not as much redundancy, right? So, you’re not having to store data in multiple locations, and it’s taking up more room than you really need.

And so, this is what we call the Kappa architecture, and this is why it’s so popular right now is, it simplifies that workstream. And so, when we start deciding between those two, back to Francisco’s question – Francisco, your application – it seems like it has a real need for real-time, right? So, there’s a lot of things that are going on there from the network, and a lot of traffic that’s coming in. And so, this is going to be where we break down a couple of different concepts. And so, we talked about bound and unbound. So, a bound dataset is data that we know how much data is going to come in, right? Or, we wait to do, and do the processing on that data, after it’s already came in. And so, when you think of bound data, think of sales orders, think of interview numbers. And so, we know that’s largely what we would consider transactional data, so we know all the data as it’s coming in, and then we’re running the calculation then. But, what your data is, is unbound. And so, when we talk about unbound data is, you don’t know how much data is coming in, right? And it’s infinitive. It’s not going to end. So, with network traffic, you don’t know how long that’s going to be going on. So, the network traffic’s going to continue to come in, you don’t know… You might get one terabyte at one point, you might get ten terabytes, you might scale up all in one second. And then, as the data comes in, it might come in uneven, right? So, you might have some that’s timestamped a little bit earlier than other data that’s coming in, too. And so, that’s what we call unbound data.

And so, for unbound data, the Kappa architecture works really well. It also works really well for bound data, too. So, when we start to look at that, and looking at your project, my recommendation is to use the Kappa architecture – go ahead and use it because you’re using real-time data. But then, for those batch levels, and I’m sure that you’ll start having some processing and some pieces that you start doing that are batch – you can also consume that in the Kappa architecture, as well. And so, there are some things you can look into, so, you can choose streaming analytics, with Spark streaming, you can look at Flink, Beam – those are some of the applications you can use. But, you can also use distributed search, so you can use Solr, you can use ElasticSearch – all those are going to work well, whether you choose the Kappa architecture, or whether you choose the Lambda architecture. My recommendation is, go with the Kappa architecture.

Well, thanks guys, that’s another episode of Big Data, Big Questions. Make sure you subscribe, so that you never miss an episode. If you have any questions, have your question answered on Big Data, Big Questions – just go to the website, put your comments below, reach out to me on Twitter, however you want. Submit those questions, and have me answer those questions here on Big Data, Big Questions. Thanks again, guys.

Archives for August 2017

Bound vs. Unbound Data in Real Time Analytics