Real-Time Analytics Archives

Breaking The World of Processing

Streaming and Real-Time analytics are pushing the boundaries of our analytic architecture patterns. In the big data community we now break down analytics processing into batch or streaming. If you glance at the top contributions most of the excitement is on the streaming side (Apache Beam, Flink, & Spark).

What is causing the break in our architecture patterns?

A huge reason for the break in our existing architecture patterns is the concept of Bound vs. Unbound data. This concept is as fundamental as the Data Lake or Data Hub and we have been dealing with it long before Hadoop. Let’s break down both Bound and Unbound data.

Bound Data

Bound data is finite and unchanging data, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).

Traditionally we have analyzed data as Bound data sets looking back into the past. Using historic data sets to look for patterns or correlation that can be studied to improve future results. The timeline on these future results were measured in months or years.

For example, testing a marketing campaign for the Telsa Model S would take place over a quarter. At the end of the quarter sales and marketing metrics are measured deeming a success or failure for the campaign. Tweaks for the campaign are implemented for next quarter and the waiting cycle continues. Why not tweak and measure the campaign from the first onset?

Our architectures and systems were built to handle data in this fashion because we didn’t have the ability to analyze data in real-time. Now with the lower cost for CPU and explosion in Open Source Software for analyzing data, future results can be measured in days, hours, minutes, and seconds.

Unbound Data

Unbound data is unpredictable, infinite, and not always sequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generated on a Web Scale Enterprise Network is Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc. Trying to analyze all this data as Bound data is asking for pain and failure (trust me I’ve been down this road).

Our world is built on processing unbound data. Think of ourselves as machines and our brains as the processing engine. Yesterday I was walking across a parking lot with my 5 year old daughter. How much Unbound data (stimuli) did I process and analyze?

Watching for cars in the parking lot and calculating where and when to walk
Ensuring I was holding my daughter’s hand and that she was still in step with me
Knowing the location of my car and path to get to car
Puddles, pot holes, and pedestrians to navigate

Did all this data (stimuli) come in concise and finite fashion for me to analyze? Of course not!

All the data points were unpredictable and infinite. At any time during our walk to the car more stimuli could be introduced(cars, weather, people, etc). In the real world all our data is Unbound and has always been.

How to Manage Bound vs. Unbound Data

What does this mean? It means we need better systems and architectures for analyzing Unbound data, but we also need to support those Bound data sets in the same system. Our systems, architectures, and software has been built to run bound data sets. Since the 1970’s where relations database were built to hold data collected. The problem is in the next 2-4 years we are going to have 20 – 30 billion connected devices. All sending data that we as consumers will demand instant feedback on!

On the processing side the community has shifted to true streaming analytics projects with Apache Flink, Apache Beam and Spark Streaming to name a few. Flink is a project showing strong promise of consolidating our Lambda Architecture into a Kappa Architecture. By switching to a Kappa Architecture developers/administrators can support on code base for both streaming and batch workloads. Not only does this help with the technical debt of managing two system, but eliminates the need for multiple writes for data blocks.

Scale-out architectures have provided us the ability to quickly expand our demand. Scale-out is not just Hadoop clusters that allow for Web Scale, but the ability to scale compute intense workloads vs. storage intense. Most Hadoop cluster are extremely CPU top heavy because each time storage is needed CPU is added as well.

Will your architecture support 10 TBs more? How about 4 PBs? Get ready for explosion in Unbound data….

I honestly think developing real-time analytics is one of the hardest feats for developers to take on!

I’ll admit I’m for sure biased, but that doesn’t make me wrong.

My first project in the Hadoop eco-system was a real-time application when the Hadoop community still didn’t have real-time processing. I’ve always been honest in my posts and always will. So let me not sugarcoat this…the project sucked and was deemed a failure!

My team didn’t understand the requirements for real-time and couldn’t meet the requirements. The project was over budget and delayed. However, all was not lost, years have past and I learned a lot and the Hadoop community now has a new frameworks to speed up processing in real-time. Before developing your real-time analytics project please read these top 3 recommendations for real-time analytics! You will thank me….

What is Real-Time Analytics

Real-time analytics is the ability to analyze data as soon as it is created. Not only as soon as it’s created, but before all the data is uncovered. Traditional batch architectures have all the data in place before processing, but real-time processing is done as the data is created.

To be picky there is really no such thing as real-time analytics! What we have right now is near real-analytics which to humans is at millisecond speed or just faster than our competitors. For true real-time we would have to analyze the data at the same time it occurs and right now there is always some latency from networking, processing, etc. Let’s table this discussion until quantum computing becomes mainstream……

Take for example a GPS enabled application for brick and mortar stores. Using the phone as a sensor the application will know the customers proximity to the store. When the customer is close the store a offer is sent via the phone. Sounds simple, but imagine millions of sensors entering data into the system and trying to analyze the data location information. Add to this example for knowing store locations, store hours, local events, inventory levels, etc. Now many things could go wrong here. For example, the application could send offer for a product not in stock, send offer too late once the customer is out of range, send offer to a store that is closed.

Still think building those real-time applications is easy? Let’s look into the future…

Tsunami of Real-Time Data

How much data are we talking about for the future of real-time analytics? Gartner predicts that by 2020 worldwide we will have 20.4 billion devices connect. The predcition roughly estimates the world population of 7 billion people with an averages 3 devices per person. Sounds like a lot, however, I think it’s a conservative prediction. How many devices do you have connect in your home? I have 25 in my home and I’m not considered on the bleeding edge. I’ve talked with quite a few people who have as many as 75 plus. So let’s say 1/4 of the population has 15 devices by 2020 that will total closer to 28 plus billion devices.

Recommendations for Real-Time Analytics

Since we know why streaming processing and real-time analytics are growing at a perplexing pace, lets’ discuss recommendations for building those real-time applications.

1 – Timing is Everything

Know the time to value for the insights of the data. All data has different values assigned to it and that value degrades over time. Picture our previous example for a retailer using location services to send offers via a mobile application. How valuable is potential customer’s location? It’s really valuable, but only if the application can process the data quickly enough to send an incentive while the customer is near their physical location. Otherwise, the application is providing historical information.

After understanding the time value of the data you can find the correct framework (Flink, Spark, Storm, etc) to process the data. Most of the streaming data we need to be processed real-time for specific insights. Example pulling that user location data and time. Remember not all data is processed the same way batch vs. streaming.

2 – Make Sure Applications will Scale

Make sure your real-time application can scale. Not just scale with large influxes of data, but independently with processing and storage. In the future of IoT and Streaming data sources will be extremely unpredictable. One day you might ingest 2 TB of new data and the next 2 PB. If you think I’m joking checkout my talk from the DataWorks Summit of the Future Architecture of Streaming Analytics. Build application on the foundation of architectures, services, and components that can scale. Remember our friend Murphy and his law about how things can go wrong.

Scaling isn’t all focused on just being able to ingest more data, but scaling independently with compute and capacity. Make sure your real-time application supports a data lake strategy. Isilon’s Data Lake Platform give the ability to separate compute and capacity when growing your Hadoop clusters. So when a new set of data comes in that is 10 TB of data that isn’t really growing and probably will only run weekly or monthly you can scale your capacity without having to add unneeded capacity. Also, a data lake strategy gives you the ability to opt out of the 3x replication with 200% utilization vs. 80% utilization on Isilon. Whether you use Isilon or not make sure you have a data lake strategy that builds on the architecture of independent scaling!!

3 – Life Cycle Cost of Data

Since we know the value of the data decrease over time we need to assign a cost for that data. I know you probably just rolled your eyes when I mentioned the cost of data, but it’s important to understand that data is a product. Just like Amazon sells books for different prices, they also assign cost data’s value varies over time

As big data developers we want to hold on to data forever and bring in as many news sources as possible. However, when our manager or CFO gets the bill for all the capacity you need you will be sitting endless meetings and writing up justification reports for about why you are holding all this data. This means less time doing what we love, coding in our Hadoop Cluster. Know the value of your data and plan accordingly!!

Wrap Up of Real-time Analytics

Finishing up our discussion, remember that real-time analytics is processing of data as soon as the data is generated. By analyzing the data as it’s generated decision can be made quicker which helps create better applications for our users. When building real-time applications make sure you follow my 3 recommendations by understanding the time value of the data, building on systems that scale independently, and assigning value to the data. Successfully building real-time applications depends on these 3 core points.

Bound vs. Unbound Data in Real Time Analytics