Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Bound vs. Unbound Data in Real Time Analytics

August 9, 2017 by Thomas Henson Leave a Comment

Bound vs. Unbound Data

Breaking The World of Processing

Streaming and Real-Time analytics are pushing the boundaries of our analytic architecture patterns. In the big data community we now break down analytics processing into batch or streaming. If you glance at the top contributions most of the excitement is on the streaming side (Apache Beam, Flink, & Spark).

What is causing the break in our architecture patterns?

A huge reason for the break in our existing architecture patterns is the concept of Bound vs. Unbound data. This concept is as fundamental as the Data Lake or Data Hub and we have been dealing with it long before Hadoop. Let’s break down both Bound and Unbound data.

Bound Data

Bound data is finite and unchanging data, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).

Traditionally we have analyzed data as Bound data sets looking back into the past. Using historic data sets to look for patterns or correlation that can be studied to improve future results. The timeline on these future results were measured in months or years.

For example, testing a marketing campaign for the Telsa Model S would take place over a quarter. At the end of the quarter sales and marketing metrics are measured deeming a success or failure for the campaign. Tweaks for the campaign are implemented for next quarter and the waiting cycle continues. Why not tweak and measure the campaign from the first onset?

Our architectures and systems were built to handle data in this fashion because we didn’t have the ability to analyze data in real-time. Now with the lower cost for CPU and explosion in Open Source Software for analyzing data, future results can be measured in days, hours, minutes, and seconds.

Unbound Data

Unbound data is unpredictable, infinite, and not always sequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generated on a Web Scale Enterprise Network is Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc. Trying to analyze all this data as Bound data is asking for pain and failure (trust me I’ve been down this road).

Bound vs. Unbound

Our world is built on processing unbound data. Think of ourselves as machines and our brains as the processing engine. Yesterday I was walking across a parking lot with my 5 year old daughter. How much Unbound data (stimuli) did I process and analyze?

  • Watching for cars in the parking lot and calculating where and when to walk
  • Ensuring I was holding my daughter’s hand and that she was still in step with me
  • Knowing the location of my car and path to get to car
  • Puddles, pot holes, and pedestrians to navigate

Did all this data (stimuli) come in concise and finite fashion for me to analyze? Of course not!

All the data points were unpredictable and infinite. At any time during our walk to the car more stimuli could be introduced(cars, weather, people, etc). In the real world all our data is Unbound and has always been.

Default
1
 

How to Manage Bound vs. Unbound Data

What does this mean? It means we need better systems and architectures for analyzing Unbound data, but we also need to support those Bound data sets in the same system. Our systems, architectures, and software has been built to run bound data sets. Since the 1970’s where relations database were built to hold data collected. The problem is in the next 2-4 years we are going to have 20 – 30 billion connected devices. All sending data that we as consumers will demand instant feedback on!

On the processing side the community has shifted to true streaming analytics projects with Apache Flink, Apache Beam and Spark Streaming to name a few. Flink is a project showing strong promise of consolidating our Lambda Architecture into a Kappa Architecture. By switching to a Kappa Architecture developers/administrators can support on code base for both streaming and batch workloads. Not only does this help with the technical debt of managing two system, but eliminates the need for multiple writes for data blocks.

Scale-out architectures have provided us the ability to quickly expand our demand. Scale-out is not just Hadoop clusters that allow for Web Scale, but the ability to scale compute intense workloads vs. storage intense. Most Hadoop cluster are extremely CPU top heavy because each time storage is needed CPU is added as well.

Will your architecture support 10 TBs more? How about 4 PBs? Get ready for explosion in Unbound data….

Filed Under: Streaming Analytics Tagged With: Big Data, Real-Time Analytics, Streaming Analytics, Unstructured Data

Subscribe to Newsletter

Archives

  • February 2021 (2)
  • January 2021 (5)
  • May 2020 (1)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Machine Learning Engineer Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Recent Posts

  • Tips & Tricks for Studying Machine Learning Projects
  • Getting Started as Big Data Product Marketing Manager
  • What is a Chief Data Officer?
  • What is an Industrial IoT Engineer with Derek Morgan
  • Ultimate List of Tensorflow Resources for Machine Learning Engineers

Copyright © 2025 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in