Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

35 Reasons to Learn Hadoop Today

June 13, 2016 by Thomas Henson Leave a Comment

The best time to plant a tree was 20 years ago, the second best time to plant one is today.

Okay so what does this old Chinese proverb have to do with learning Hadoop? Let’s break down the proverb. Trees are awesome when they are huge and provide a ton of shade or have large branches for tire swings. However to enjoy a tree like this it has be planted a long time ago. I live in a new neighborhood so I’m out of luck.

Reasons to learn Hadoop Today

Learning a new technology is like planting a tree. Everyone wants to enjoy the shade or be an expert without having to put in the time.

Hadoop & Spark are hot topics right now in the Dev/IT space. Many companies are looking for experts in Hadoop. Truth be told there aren’t many out there. The technology is new and evolving daily. Just checkout this blog post I wrote over a year ago about the popular frameworks, the number of new projects has doubled since that post was written.

Reason to learn Hadoop today..

So why don’t you become an expert in the Hadoop space? The best time to to start is today. Sign up for my newsletter to learn how you can become a Hadoop expert.

  1.  You can command a higher salary (average 140K/year).
  2. You can get in on the ground floor of the Big Data movement.
  3. You can contribute to the “Big Data” frameworks through the opensource community.
  4. Chance to work with enormous data sets. Well it is BIG data…
  5. Opportunity to change the world with data.
  6. Huge community support.
  7. Work some of the biggest companies on the planet. Facebook, Verizon, Netflix, MLB,…..
  8. Cutting edge technology that is constantly evolving.
  9. Internet of Things
  10. Get to play with Hadoop’s friends Sqoop, Kafka, Pig, Hive, HBase, and many more.
  11. Top required skill employers are looking for.
  12. Unstructured data is exploding. Think Exabytes.
  13. Many available languages to write to write Map Reduce jobs with. Java, Python, Scala.
  14. Telling people at a dinner party that you do Machine learning is great conversation starter.
  15. Chance to learn something new EVERYDAY.
  16. Self driving cars.
  17. You run Hadoop on Linux.
  18. Hadoop is a scale out architecture.
  19. You will learn to love statistics.
  20. You will blow people away with your knowledge of algorithms.
  21. You get to us Mahout for K-means, Singular Value Decomposition, Least Squared, and more.
  22. You will begin to says phrases like “it was a small data set of only 10 terabytes”.
  23. Autonomous agents are mind blowing.
  24. Your Marketing friends are going to love you.
  25. The Elephants, Hives , and Pigs are going to need a Zookeeper.
  26. You can analyze video data.
  27. Data Scientist is ranked the #1 sexiest job of 21st century.
  28. MapReduce can be easy. 
  29. high demand + talent gap = multiple opportunities
  30. Hadoop runs from the command line.
  31. Java is your friend.
  32. Future proofing your career.
  33. You can become a Chief Data Officer
  34. Predictive and prescriptive analytics are cool.
  35. All companies are becoming data companies.

How many more reasons do you need to learn Hadoop today? If you are ready then I suggest you with my Hadoop from the command line course to learn the basics.

Filed Under: Hadoop Tagged With: Big Data, Hadoop, Machine Learning

HDFS Getting Started Course

February 22, 2016 by Thomas Henson 4 Comments

Are you ready to get some Hadoop knowledge dropped on you?

Well here it is after eight long months since my last Pluralsight course.

HDFS Getting Started has been launched. I couldn’t be more excited to have this course released.

HDFS Getting Started

HDFS Getting Started is baseline course for anyone working with Hadoop. Starting development with Hadoop is easy when testing in your local sandbox but what happens when it’s time to go from testing to production?

Hadoop management and orchestration is hard. Most task are accomplished from the command line. Even something as simple as moving data from your local machine into HDFS can seem complicated.

What’s HDFS Getting Started about?

My new Pluralsight course, HDFS Getting Started, walks through real life examples of moving data around in the Hadoop Distributed File System (HDFS). Learning to use the hdfs dfs commands will ensure you have the baseline Hadoop skills needed to excel in the Hadoop ecosystem.

Structured data is all around us in the form of relational databases. In this course we will ingest data from MySQL database into HDFS using Sqoop. Walk through a quick tutorial of writing a Sqoop script to move structured stock market data in MySQL into HDFS.

Pig and Hive are great ways to structure data in HDFS for analysis, but moving that data around in HDFS can get tricky. In this course we walk through using both applications to analyze stock market data. All from the Hive and Pig command lines.

Hbase is another hot application in the Hadoop ecosystem. Do you know how to move data from HDFS into HBase? In HDFS Getting Started learn to take our stock market data index it and move it into HBase by writting a Pig script.

How is the Course Broken down?

HDFS Getting started is broken down into six modules. The modules cover different applications and how they use HDFS to query/ingest/manipulate/move data in Hadoop.

HDFS Getting Started Modules

  1. Understanding HDFS
  2. Creating, Manipulating and Retrieving HDFS Files
  3. Transferring Relational Data to HDFS using Sqoop
  4. Querying Data with Hive and Pig
  5. Processing Sparse Data with HBase
  6. Automating Basic HDFS Operations

Let me know if you have any questions about the course or a suggestion for a new course.

Filed Under: Hadoop Tagged With: Big Data, Hadoop, HDFS, Pluralsight

Apache Pig Eval Functions Series

July 27, 2015 by Thomas Henson 5 Comments

Ready to master the Apache Pig but not sure how to get started?

How can I master Apache Pig?

The process for mastering a programming language is that same as learning any other skills. Practice, Practice, Practice.

The practice needs to be focused and using different scenarios to be effective. Performing the same task over will not get you to the mastery level. However using Pig functions you haven’t used before to process real world examples is the kind of practice needed to master Pig Latin.

Imagine a race car driver trying to become an elite racer. He will practice racing on different tracks and even practice specific scenarios he might see in a race. Now let’s apply that logic to a Pig developer, you need to practice using Pig to solve real world problems and using new functions.

Pig eval functions deep dive

In this Apache Pig Eval Function series I have packages together different data sets and a series of quick scenarios for developers to solve using Pig. Each post in this series will focus on a single Pig Eval Function, defining the function and then using that function in a real world example. All the code and data are provided for each of these examples making your journey to becoming a Apache Pig master easier.

Looking for a complete way to master the basics of Pig? Try my Pluralsight course Pig Latin:Getting Started.

Why Eval Functions?

Pig ships with many different built-in functions and learning these functions can save you time. The eval or evaluation functions are a group of functions that you will typically not learn when you first start out with Pig. When we think about evaluations functions the first thing that comes to mind are mathematical functional like addition or subtraction, but Pig Eval Functions contain useful string functions as well. Concatenating a string is one of the most useful Pig Eval string functions, think about trying to merge two fields such as first and last name. The Pig Eval Functions has a CONCAT() function built-in with the standard Pig build.

The code and data can be found at my Example Pig Script Github page.

Deep dive for the Apache Pig Eval Functions:

  • Average – Learn how to get the average of a column using the AVG() function. 
  • Sum – Get the sum of a column using the SUM() function. 
  • Concatenation – Merge two or more columns together using the CONCAT() function.
  • Tokenize – Find out how to breakdown fields using the TOKENIZE() function.
  • Min –  Using the MIN() function over data in Pig
  • Max – Master using the MAX() function in Pig Latin.
  • Count – Take the population data from previous tutorials and use the COUNT() function.

Filed Under: Hadoop Pig Tagged With: Apache Pig, Apache Pig Latin, Hadoop, Pig Eval Series

Pig Eval Series: Tokenize

July 2, 2015 by Thomas Henson 1 Comment