Thomas Henson

  • Data Engineering Courses
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Archives for October 2017

Complete Pig Join Example

October 30, 2017 by Thomas Henson 2 Comments

Let’s say you have two sets of structured or unstructured data. How to combine two sets of data (relations) in Pig Latin?

Look at the example below. If you wanted to combine the cereal and price data sets what would you use? Pig Latin offers Joins which are similar to that of SQL’s Joins. Let’s dive into Pig Latin Joins by comparing to traditional SQL Joins.

Pig Join Example

SQL Joins

In SQL joins combine two tables by rows based off a related column. For example id in the table below can be used to join the two tables. The tables can be joined to show any of the columns joined by the Join. Easy concept in SQL to understand because we have our data structured and a primary key of ID. In Pig most of the time we are combine structured or unstructured data. Let’s see what happens when we want to marry the same data but use Pig Latin to JOIN the fields.

Cereal
1
2
3
4
5
6
7
ID,Name,Calories,Protein,Fat,Carbs
1,AppleJacks,117,1,0.6,27
2,Boo Berry,118,1,0.8,27
3,Cap'n Crunch,144,1.3,2.1,31
4,Cinnamon Toast Crunch,169,2.7,4.4,3
5,Cocoa Blasts,130,1,1.2,29
6,Cocoa Puffs,117,1,1,26

Cereal-Price
1
2
3
4
5
6
7
ID, Name, Price, Denom
1,AppleJacks,2.25,US
2,Boo Berry,2.10,US
3,Cap'n Crunch,2.05,US
4,Special K,3.00,US
5,Total,2.50,US
6,Wheaties,3.00,US

 

Example Data

For the Pig Join examples I will use the Cereal data used in other Pig Tutorials. In this walkthough I’ve added a new data set called Cereal-Price. All the code and data samples are located on my Github Example Pig Script.

Cereal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Name,Calories,Protein,Fat,Carbs
AppleJacks,117,1,0.6,27
Boo Berry,118,1,0.8,27
Cap'n Crunch,144,1.3,2.1,31
Cinnamon Toast Crunch,169,2.7,4.4,3
Cocoa Blasts,130,1,1.2,29
Cocoa Puffs,117,1,1,26
Cookie Crisp,117,1,0.9,26
Corn Flakes,101,2,0.1,24
Corn Pops,117,1,0.2,28
Crispix,113,2,0.3,26
Crunchy Bran,120,1.3,1.3,31
Froot Loops,118,2,0.9,26
Frosted Mini-Wheats,175,5,0.8,41
Golden Grahams,149,2.7,1.3,33
Honey Nut Clusters,214,4,2.7,46
Honey Nut Heaven,192,4,3.7,38
King Vitaman,80,1.3,0.7,17
Kix,87,1.5,0.5,20
Life,160,4,1.9,33
Lucky Charms,114,2,1.1,25
Multi-Grain Cheerios,108,2,1.2,24
Product 19,100,2,0.4,25
Raisin Bran,195,5,1.6,47
Reese's Puffs,171,2.7,3.9,31
Rice Chex,94,1.6,0.2,22
Rice Krispie Treats,160,1.3,1.7,35
Smart Start,182,4,0.7,43
Special K,117,7,0.4,22
Total,129,4,0.9,31
Wheaties,107,3,1,24

Cereal Price

1
2
3
4
5
6
AppleJacks,2.25,US
Boo Berry,2.10,US
Cap'n Crunch,2.05,US
Special K,3.00,US
Total,2.50,US
Wheaties,3.00,US

Pig Inner Join

Inner Join Example //quick operation for combining two tuples/data sets or tables……

First let’s load both data sets using the Pig Load Function.

1
2
3
4
5
6
7
8
9
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',')  AS
(name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray);
 
price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',')  AS
(name:chararray, price:chararray);
 
total = JOIN cereal by name, price by name;
 
DUMP total;

After we have loaded both data sets we can use the Pig Inner Join function to combine price and nutrition information.

1
2
3
4
5
6
7
8
9
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',')  AS
(name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray);
 
price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',')  AS
(name:chararray, price:chararray);
 
total = JOIN cereal by name, price by name;
 
DUMP total;

pig join example

What happens if we want to combine relations that don’t have key fields? Or fields that match up.

Pig Outer Join

Pig Outer Join offers more options for how joins are performed in Pig. Three options are added to the Outer Joins:

  1. Left – Allows for data to be joined from the left of the join. Matches everything in the left relations to the corresponding right side relations.
  2. Right – Places data joined from the right of the join. Returns all relations from the right side and corresponding matched fields.
  3. Full – Join similar to Inner join where all data is combined. Brute force

Using Pig Outer Joins allows for merge together relations with out matching keys or fields. In the previous example our Inner Join only combined tables where a price existed in the price variable. Cereals without prices like Cinnamon Toast Crunch and others were left when describe total was excited.  Outer Joins give us the option to merge all relations and rows together.

Left Outer Join Example

1
2
3
4
5
6
7
8
9
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',')  AS
(name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray);
 
price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',')  AS
(name:chararray, price:chararray);
 
total = JOIN cereal by name, price by name;
 
DUMP total;

Left Outer Join Results

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
(Kix,87,1.5,0.5,20,,)
(Life,160,4,1.9,33,,)
(Name,Calories,Protein,Fat,Carbs,,)
(Total,129,4,0.9,31,Total,2.50)
(Crispix,113,2,0.3,26,,)
(Wheaties,107,3,1,24,Wheaties,3.00)
(Boo Berry,118,1,0.8,27,Boo Berry,2.10)
(Corn Pops,117,1,0.2,28,,)
(Rice Chex,94,1.6,0.2,22,,)
(Special K,117,7,0.4,22,Special K,3.00)
(AppleJacks,117,1,0.6,27,AppleJacks,2.25)
(Product 19,100,2,0.4,25,,)
(Cocoa Puffs,117,1,1,26,,)
(Corn Flakes,101,2,0.1,24,,)
(Froot Loops,118,2,0.9,26,,)
(Raisin Bran,195,5,1.6,47,,)
(Smart Start,182,4,0.7,43,,)
(Cap'n Crunch,144,1.3,2.1,31,Cap'n Crunch,2.05)
(Cocoa Blasts,130,1,1.2,29,,)
(Cookie Crisp,117,1,0.9,26,,)
(Crunchy Bran,120,1.3,1.3,31,,)
(King Vitaman,80,1.3,0.7,17,,)
(Lucky Charms,114,2,1.1,25,,)
(Reese's Puffs,171,2.7,3.9,31,,)
(Golden Grahams,149,2.7,1.3,33,,)
(Honey Nut Heaven,192,4,3.7,38,,)
(Honey Nut Clusters,214,4,2.7,46,,)
(Frosted Mini-Wheats,175,5,0.8,41,,)
(Rice Krispie Treats,160,1.3,1.7,35,,)
(Multi-Grain Cheerios,108,2,1.2,24,,)

The output from the left outer example displays all Cereal results and combines those with corresponding price.

Right Outer Join

1
2
3
4
5
6
7
8
9
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',')  AS
(name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray);
 
price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',')  AS
(name:chararray, price:chararray);
 
total = JOIN cereal by name, price by name;
 
DUMP total;

Right Outer Join Results

1
2
3
4
5
6
(Total,129,4,0.9,31,Total,2.50)
(Wheaties,107,3,1,24,Wheaties,3.00)
(Boo Berry,118,1,0.8,27,Boo Berry,2.10)
(Special K,117,7,0.4,22,Special K,3.00)
(AppleJacks,117,1,0.6,27,AppleJacks,2.25)
(Cap'n Crunch,144,1.3,2.1,31,Cap'n Crunch,2.05)

Results here are much cleaner than our previous example because we are combining off the smaller field.

Full Outer Join

1
2
3
4
5
6
7
8
9
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',')  AS
(name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray);
 
price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',')  AS
(name:chararray, price:chararray);
 
total = JOIN cereal by name, price by name;
 
DUMP total;

Full Outer Join Results

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
(Kix,87,1.5,0.5,20,,)
(Life,160,4,1.9,33,,)
(Name,Calories,Protein,Fat,Carbs,,)
(Total,129,4,0.9,31,Total,2.50)
(Crispix,113,2,0.3,26,,)
(Wheaties,107,3,1,24,Wheaties,3.00)
(Boo Berry,118,1,0.8,27,Boo Berry,2.10)
(Corn Pops,117,1,0.2,28,,)
(Rice Chex,94,1.6,0.2,22,,)
(Special K,117,7,0.4,22,Special K,3.00)
(AppleJacks,117,1,0.6,27,AppleJacks,2.25)
(Product 19,100,2,0.4,25,,)
(Cocoa Puffs,117,1,1,26,,)
(Corn Flakes,101,2,0.1,24,,)
(Froot Loops,118,2,0.9,26,,)
(Raisin Bran,195,5,1.6,47,,)
(Smart Start,182,4,0.7,43,,)
(Cap'n Crunch,144,1.3,2.1,31,Cap'n Crunch,2.05)
(Cocoa Blasts,130,1,1.2,29,,)
(Cookie Crisp,117,1,0.9,26,,)
(Crunchy Bran,120,1.3,1.3,31,,)
(King Vitaman,80,1.3,0.7,17,,)
(Lucky Charms,114,2,1.1,25,,)
(Reese's Puffs,171,2.7,3.9,31,,)
(Golden Grahams,149,2.7,1.3,33,,)
(Honey Nut Heaven,192,4,3.7,38,,)
(Honey Nut Clusters,214,4,2.7,46,,)
(Frosted Mini-Wheats,175,5,0.8,41,,)
(Rice Krispie Treats,160,1.3,1.7,35,,)
(Multi-Grain Cheerios,108,2,1.2,24,,)

The returned results are similar to the Inner Join except now there is a null reference in rows missing the pricing.

Summing it Up

Pig Latin offers a lot options for transforming both unstructured and semi-structured data inside of Hadoop. The Pig Latin Join relational operator is a powerful way to perform ETL on data before it enters HDFS or after it’s already in HDFS. Ready for another Pig Latin lesson? Check out my Pig Eval Series.

 

Filed Under: Hadoop Pig Tagged With: Apache Pig, Apache Pig Latin, Apache Pig Tutorial

Setting Up Passwordless SSH for Ambari Agent

October 26, 2017 by Thomas Henson Leave a Comment

Setting Up Passwordless SSH for Ambari Agent

Want to know one of the hardest part for me installing Hadoop with Ambari?

Setting up Passwordless ssh for all nodes so that Ambari Agent could do the install. Looking back it might be a trivial thing to get right, but at that time my Linux skills were lacking. Plus I had been cast into the Hadoop Administrator role after only being a Pig Developer after a month.

Having a background in Linux is very beneficial to excelling as Data Engineer or Hadoop Administrator. However if you have just been thrown into the role or looking to build your first cluster from scratch check out the video below on Setting Up Passwordless SHH for Ambari Agent.

Transcript – Setup Passwordless SSH for Ambari Agent

One step we need to do before installing Ambari is to set up passwordless SSH on our Ambari boxes. So, what we’re going to do is we’re actually going to generate a key on our master node and send those out to our data nodes. I wanted to caution you that this sounds very easy, and if you’re familiar with Linux and you’ve done this a couple of times, you understand that it might be trivial. But if it’s something you haven’t done before or if it’s something you haven’t done in a while, you want to make sure that you walk through this step. One of the reasons that you really want to walk through this before we install anything Ambari-related or Hadoop-related is because this is going to help us troubleshoot problems that we might have with permissions. So, if we know that this piece works, we can eliminate all the other problems.

No problem if you haven’t set it up before. We’re actually going to walk through that in the demo here. But first, let’s just look at it from an architectural perspective. So, what we’re going to do is, on our master node, we’re going to generate both a public and a private key. Then we’re going to share out that public key with all the data nodes, and what this is going to do is this is going to allow for the master node to log in via SSH with no password into data node 1, 2, and 3. So, since master node can actually login to these, we only have to install Ambari on the master node and then allow the master node to run all the installation on all the other nodes. You’ll see more of that once we get into installing Ambari and Ambari Agent, but just know that we have to have this public key working in order to have passwordless SSH.

The steps to walk through it are pretty easy. So, what we’re going to do is we’re going to login to our master node and we’re going to create a key. So, we’ll type in ssh-keygen. From there, it’ll generate the public and private key, and then we will copy the public key to data node 1, 2 and 3. Next, we’re going to add that key to the authorized list on all the data nodes. We’re going to test from our main node into data node 1, 2, and 3 just to make sure that our passwordless SSH works and that we can log in as root. Now let’s step through that in a demo.

Now we’re ready to set up passwordless SSH in our environment. So, in my environment, I have node one, which will be my master node, and I’m going to set up passwordless SSH on node 2, 3, and 4. But in this demo, we’re just going to walk through doing it on node 1 and node 2, and then we can just replicate it—the same process—on the other nodes.

So, the first thing we need to do is, on our master node or node 1, we’re going to generate our public key and our private key. So, ssh-keygen. We’re going to keep it defaulted to go into the .ssh folder. I’m not going to enter anything for my passphrase.

You can see there’s a random image and we can run an ll on our .ssh directory, and we see that we have both our public and our private key. Now what we need to do is we need to move that public key over to our data node 1, and then we’ll be able to login without using a password. So, I’m going to clear out the screen, and now what we’re going to do is we’re going to use scp and just move that public key over to node 2.

So, since we haven’t set up our passwordless SSH, it will prompt us for a password here. So, we got the transfer complete and now I’m going to login to node 2. Still haven’t used that password. If we run a quick ll, we can see we have our public key here, and now all we need to do is set up our .ssh directory and add this public key to the authorized keys. So, we’re going to make that directory, where it’s inside that directory, and we can see nothing is in it. Now it’s time to move that public key into this .ssh directory.

Then we have our public key, and now let’s just cat that file. We’re going to create an authorized keys. And we have two files here, so we have our public key, and then we’ve also written an authorized keys, which is going to be that public key. So, we’re going to exit out. As you can see, I’m back in node 1. So, now we should be able to just SSH in and not be prompted for a password. And you can see now we’re here in node 2.

So that’s how you set up your passwordless SSH. We’ll need to do this for all data nodes that we’re going to add to the cluster, and this will allow that, once we have Ambari installed on our main node, Ambari will be able to go and make changes to all the data nodes and do all the updates and upgrades all at one time so that you’re not having to manage each individual upgrade, each individual update.

Filed Under: Ambari Tagged With: Ambari, Hadoop, Hadoop Distributed File System, Hortonworks, Learn Hadoop

Kappa Architecture Examples in Real-Time Processing

October 11, 2017 by Thomas Henson Leave a Comment

Kappa Architecture

“Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

Yes we can build models based on the real-time processing and in fact there are some you use every day….

In today’s episode of Big Data Big Questions, we will explore some real-world Kappa Architecture examples. Watch out this video and find out!

Video

Transcription

Hi, folks. Thomas Henson here with thomashenson.com. And today we’re going to have another episode of Big Data, Big Questions. And so, today’s episode, we’re going to focus on some examples of the Kappa Architecture. And so, stay tuned to find out more.

So, today’s question comes in from a user on YouTube, Yaso1977 . They’ve asked: “Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

And so, I think this user is stemming this question from their defense for either their master’s degree or their Ph.D. So, first off, Yaso1977, congratulations on standing on your defense and creating your research project around this. I’m going to answer this question as best I could and put myself in your situation where if I was starting out and had to come up with a research project to be able to stand for either my Master’s or my Ph.D. What would I do, and what are some of the things I would look at?

And so, I’m going to base most of these around the Kappa Architecture because that is the future, right, of streaming analytics and IoT. And it’s kind of where we’re starting to see the industry trend. And so, some of those examples that we’re going to be looking for are not just going to be out there just yet, right? We still have a lot of applications and a lot of users that are on Lambda. And Kappa is still a little bit more on the cutting edge.

So, there are three main areas that I would look for to find those examples. The first one is going to be in IoT. So your newer IoTs to the Internet of things workflows, you’re going to start to see that. One of the reasons that we’re going to see that is because there’s millions and millions of these devices that are out there.

And so, you can think of any device, you know, whether be it from a manufacturer that has sensors on manufacturing equipment, smart cards, or even smartphones, and just information from multiple millions of users that are all streaming back in and doing some kind of prediction modeling doing some kind of analytics on that data as it comes in.

And so, on those newer workflows, you’re probably going to start to see the Kappa Architecture being implemented in there. So, I would focus first off looking at IoT workflows.

Second, this is the tried and true one that we’ve seen all throughout Big Data since we’ve started implementing Hadoop, but fraud detection, specifically with credit cards and some of the other pieces. So, you know, look at that from a security perspective, and so a lot of security. I mean, we just had the Equifax data breach and so many other ones.

So, I would, for sure, look at some of the fraud detection around, you know, maybe, some of the major credit card companies and see kind of what they’re doing and what they have published around it. Because just like in our IoT example, we’re talking millions and millions, maybe, even billions of users all having, you know, multiple transactions going on at one time. All that data needs to be processed and needs to be logged, and, you know, we’re looking for fraud detection. That needs to be pretty quick, right? Because you need to be able to capture that in the moment that, you know…Whether you’re inserting your chip card or whether you’re swiping your card, you need to know whether that’s about to happen, right?

So, it has to be done pretty quickly. And so, it’s definitely a streaming architecture. My bet is there’s some people out there that are already using that Kappa Architecture.

And then another one is going to be anomaly detection. I’m going to break that into two different ones. So, anomaly detection ones talk about security from the insider threats. So, think of being able to capture, you know, insider threats in your organization that are maybe trying to leak data or trying to give access to people that don’t need to have it. Those are still things that happen in real-time. And, you know, the faster that you can make that decision, the faster that you could predict that somebody is an insider threat, or that they’re doing something malicious on your network, the quicker and the less damage that is going to be done to your environment.

And then, also, anomaly detection from manufacturers. So, we’re talking about a little bit about IoT but also looking at manufacture. So, there’s a great example. And I would say that, you know, for your research, one of the books that you would want to look into is the Introduction to the Apache Flink. There’s an example in there from a manufacturer of Erickson who’ve implemented the Kappa Architecture. And what they have is…I think it’s like 10 to 100 terabytes of data that they’re processing at one time. And they’re looking for anomaly detection in that workflow to see, you know, are there sensors? Are there certain things that are happening that are out of the norm so that maybe they can stop manufacturing defect or predict something that’s going to go wrong within their manufacturing area, and then also, you know, externally, you know, from when the users have their devices and be able to predict those too?

So, those are the three areas that I would check, definitely check out the Introduction to Apache Flink, a lot of talk about the Kappa Architecture. Use that as some of your resources and be able to, you know, pull out some of those examples.

But remember, those three areas that I would really key on and look at are IoT, fraud detection. So, look at some of the credit companies or other fraud detections. And then also, anomaly detection, whether be insider threats or manufacturers.

So, that’s the end of today’s episode for Big Data, Big Questions. I want to thank everyone for watching. And before you leave, make sure that you subscribe. So, you never want to miss an episode. You never want to miss any of my Big Data Tips. So, make sure you subscribe, and I will see you next time. Thank you

Filed Under: Big Data Tagged With: Big Data, Big Data Big Questions, IoT, Kappa

Subscribe to Newsletter

Archives

  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile Apace Pig Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Deep Learning DevOps DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS HDFS Commnads IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Podcast Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tutorial Unstructured Data

Follow me on Twitter

My Tweets

Recent Posts

  • Learning Tensorflow with TFLearn
  • Hello World Tensorflow – How This Data Engineer Got Started with Tensorflow
  • DynamoDB Index Tutorial
  • 17 Deep Learning Terms Every Data Scientist Must Know
  • Splunk For Data Engineers

Copyright © 2019 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in