Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

DataWorks Summit 2017 Recap

June 19, 2017 by Thomas Henson Leave a Comment

All Things Data

Just coming off an amazing week with a ton of information in the Hadoop Ecosystem. It’s been a 2 years since I’ve been to this conference. Somethings have changed like the name from Hadoop Summit to DataWorks Summit. Other things stayed the same like breaking news and extremely great content.

I’ll try to sum up my thoughts from the sessions I attended and people I talked with.

First there was an insanely great session called The Future Architecture of Streaming Analytics put on by a very handsome Hadoop Guru, Thomas Henson. It was a well received session where I talked about how to architect streaming application for the next 2-5 years where we will see some 20 billion plus connected devices worldwide.

DataWorks Summit 2017 Recap

Hortonwork & IBM Partnership

Next there was breaking news with Hortonworks and IBM partnerships. The huge part of the partnership was that IBM’s BigInsights will merge with Hortonworks Data Platform. Both IBM and Hortonworks are part of the open data platform .

What does this mean to the Big Data community? Well more consolidation of Hadoop distros packages, but more collaboration into the big data frameworks. This is good for the community because it allows us to focus on the open-source frameworks inside the big data community. Now instead of having to work though the difference of BigInsights vs. HDP, development will be poured into Spark, Ambari, HDFS, etc.

Hadoop 3.0 Community Updates

New updates coming the with the next release of Hadoop 3.0 was great! There is a significant amount of changes coming with the release which is slated for GA August 15, 2017. The big focus is going to be with the introduction of Erasure Coding for data striping, supporting containers for YARN, and some minor changes. Look for an in-depth look at Hadoop 3.0 in a follow up post.

Hive LLAP

If you haven’t looked deeply at Hive in the last year or so….you’ve really missed out. Hive is really starting to mature to a EDW  on Hadoop!! I’m not sure how many different breakout sessions there were on Hive LLAP but I know it was mentioned in most I attended.

The first Hive breakout session was hosted by Hortonworks Co-founder Alan Gates. He walked through the latest updates and future roadmap for Hive.  Also the audience was posed a question: What do we except in a Data Warehouse?

  • Governance
  • High Performance
  • Management & Monitoring
  • Security
  • Replication & DR
  • Storage Capacity
  • Support for BI

We walked through where the Hive community was in addressing these requirements. Hive LLAP was certainly there on the higher performance. More on that now….

Another breakout session focused on a shoot off for the Hadoop SQLs. Wow this session was full and very interesting. Here is the list of SQL engines tested in the shoot out:

  • MapReduce
  • Presto
  • Spark SQL
  • Hive LLAP

All the test were run using the Hive Benchmark Testing on the same hardware. Hive LLAP was the clear winner with MapReduce the huge loser (no surprise here). The Spark SQL performed really well but there were issues using the thrift server which might have skewed the results. Kerberos was not implemented on the testing as well.

Pig Latin Updates

Of course there were sessions on Pig Latin! Yahoo presented their results on converting all Pig jobs from MapReduce to Tez jobs. After seeing the keynote about Yahoo’s conversation rate from MapReduce jobs to Tez/Spark/etc jobs shows that Yahoo is still running a ton of Pig jobs. Moving to Tez has increased the speed and efficiency of the Pig jobs at Yahoo. Also in the next few months Pig on Spark should be released.

DataWorks Summit 2017 Recap

Closing Thoughts

After missing last year at the Hadoop Summit or DataWorks Summit it was fun to be back. DataWorks Summit is still the premier events for Hadoop developer/admins to come and learn new features developed by the community. For sure this year the theme seemed to be benchmark testing, mix between Streaming Analytics, and Big Data EDW. It’s definitely an event I will try to make again next year to keep up with the Hadoop community.

 

Filed Under: Big Data Tagged With: DataWorks Summit, Hadoop, Streaming Analytics

Subscribe to Newsletter

Archives

  • February 2021 (2)
  • January 2021 (5)
  • May 2020 (1)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Machine Learning Engineer Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Recent Posts

  • Tips & Tricks for Studying Machine Learning Projects
  • Getting Started as Big Data Product Marketing Manager
  • What is a Chief Data Officer?
  • What is an Industrial IoT Engineer with Derek Morgan
  • Ultimate List of Tensorflow Resources for Machine Learning Engineers

Copyright © 2025 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in