Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Ultimate Big Data Battle: Batch Processing vs. Streaming Processing

May 8, 2017 by Thomas Henson 2 Comments

Today developers are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. There are many projects that are helping to accelerate and speed up this innovation. All of these projects rely on batch and streaming processing, but what is the difference between batch and streaming processing? Let’s dive into the debate around batch vs. streaming.

Batch Processing vs. Streaming Processing

What is Streaming Processing in the Hadoop Ecosystem

Streaming processing typically takes place as the data enters the big data workflow. Think of streaming as processing data that has yet to enter the data lake. While the data is queued it’s being analyzed. As new data enters the data is read and the results are recalculated. A streaming processing job is often times named a real-time application because the ability to process quickly changing data. While streaming processing is very fast it has yet to be truly real-time (maybe some day).

The reason streaming processing is so fast is because it analyzes the data before it hits disk. Reading data from disk incurs more latency than reading from RAM. Of course this all comes at a cost. You are only bound by how much data you can fit in the memory (for now..).

To understand the differences between batch and streaming processing let’s use a real-time traffic application as an example. The traffic application is a community driven driving application that provides real-time traffic data. As drivers report conditions on their commute the data is processed to share data with other commuters. The data is extremely time sensitive since finding out about traffic stop or fender bender an hour later would be worthless information. Streaming processing is used to provide the updates on traffic conditions, estimate time to destination and recommend alternative routes.


What is Batch Processing in the Hadoop Ecosystem

Batch processing and Hadoop are thought of as being the same thing. All data is loaded into HDFS  and then MapReduce kicks off a batch job to process that data. If the data changes the job needs to be ran again. Step by Step processing that can be paused or interrupted, but not changed from a data set perspective. For a job in MapReduce typically the data already exist on the disk in HDFS. Since the data already exists on the DataNodes, the data must be read from each disk in the cluster where the data is contained. The processing of shuffle this data and results becomes the constraint in batch processing.

Not a big deal unless batch process takes longer than the value of the data. Using the data lake analogy the batch processing analysis takes place on data in the lake (on disk) not the streams (data feed) entering the lake.

Let’s step back into the traffic application to see how batch is used. What happens when a user wants to find out what her commute time will be for a future trip. In that case real-time data will be less important (the further away from the commute time) and the historic data will the key to setting that model. Predicting the commute could be processed with a batch engine because typically the has already been collected.

Batch Processing vs. Streaming Processing

Is Streaming Better Than Batch?

Asking if streaming is better than batch is like asking if a which Avenger is better. If the Hulk can tear down buildings does that make him better than Ironman? What about the Avengers in Age of Ultron when they were trying to reflect off the Vibranium core? How would that all strength have helped here? In this case Ironman and Thor were better than the Hulk.

Just like with the Avengers, streaming and batch are better when working together. Streaming processing is extremely suited for cases when time matters. Batch processing shines when all the data has been collect and ready to test models. There is no one is better than the other augement right now with batch vs. streaming.

The Hadoop eco-system is seeing a huge shift into the world of streaming and batch coupled together to provide both processing models. Both workflows types come at a cost. So analyzing data with a streaming workflow that could be analyzed in a batch workflow is putting added cost to the results. Be sure the workflow will match the business objective.

Batch and Streaming Projects

Both workflows are fundamental to analyzing data in the Hadoop eco-system. Here are some of the projects and which workflow camp they fall into:

MapReduce – MapReduce  is where it all began. Hadoop 1.0 was all about storing your data in HDFS and using MapReduce to analyze that data once it was loaded into HDFS. The process could take hours or days depending on the amount of data.

Storm – Storm is a real time analysis engine. Where MapReduce process data in batches, storm does analysis in streams or as data is ingested. A project once seen as the defacto streaming analysis engine but has lost some of that momentum with emergence of other streaming projects.

Spark – Processing engine for streaming data at scale. Most popular streaming engine in the Hadoop eco-system with the most active contributors. Does not require data to be in HDFS for analysis.

Flink – Hybrid processing engine that uses streaming and batch processing models. Data is broken down into bound(batch) and unbound (unbound) sets. Stream processing engine that incorporates batch processing

Beam – Another hybrid processing engine breaking processing into streaming and batch processing. Runs with both Spark and Flink. Heavy support from the Google family. A project with a heavy amount of optimism right now in the Hadoop eco-system because of it’s ability to run both batch and streaming processing depending on the workload.

Advice on Batch and Streaming Process

At the end of the day, a solid developer will want to understand both workflows. It’s all going to come down to the use case and how either workflow will help meet the business objective.

Filed Under: Big Data

Subscribe to Newsletter

Archives

  • February 2021 (2)
  • January 2021 (5)
  • May 2020 (1)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Machine Learning Engineer Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Recent Posts

  • Tips & Tricks for Studying Machine Learning Projects
  • Getting Started as Big Data Product Marketing Manager
  • What is a Chief Data Officer?
  • What is an Industrial IoT Engineer with Derek Morgan
  • Ultimate List of Tensorflow Resources for Machine Learning Engineers

Copyright © 2025 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in