Let’s say you have two sets of structured or unstructured data. How to combine two sets of data (relations) in Pig Latin?
Look at the example below. If you wanted to combine the cereal and price data sets what would you use? Pig Latin offers Joins which are similar to that of SQL’s Joins. Let’s dive into Pig Latin Joins by comparing to traditional SQL Joins.
SQL Joins
In SQL joins combine two tables by rows based off a related column. For example id in the table below can be used to join the two tables. The tables can be joined to show any of the columns joined by the Join. Easy concept in SQL to understand because we have our data structured and a primary key of ID. In Pig most of the time we are combine structured or unstructured data. Let’s see what happens when we want to marry the same data but use Pig Latin to JOIN the fields.
1 2 3 4 5 6 7 |
ID,Name,Calories,Protein,Fat,Carbs 1,AppleJacks,117,1,0.6,27 2,Boo Berry,118,1,0.8,27 3,Cap'n Crunch,144,1.3,2.1,31 4,Cinnamon Toast Crunch,169,2.7,4.4,3 5,Cocoa Blasts,130,1,1.2,29 6,Cocoa Puffs,117,1,1,26 |
1 2 3 4 5 6 7 |
ID, Name, Price, Denom 1,AppleJacks,2.25,US 2,Boo Berry,2.10,US 3,Cap'n Crunch,2.05,US 4,Special K,3.00,US 5,Total,2.50,US 6,Wheaties,3.00,US |
Example Data
For the Pig Join examples I will use the Cereal data used in other Pig Tutorials. In this walkthough I’ve added a new data set called Cereal-Price. All the code and data samples are located on my Github Example Pig Script.
Cereal
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Name,Calories,Protein,Fat,Carbs AppleJacks,117,1,0.6,27 Boo Berry,118,1,0.8,27 Cap'n Crunch,144,1.3,2.1,31 Cinnamon Toast Crunch,169,2.7,4.4,3 Cocoa Blasts,130,1,1.2,29 Cocoa Puffs,117,1,1,26 Cookie Crisp,117,1,0.9,26 Corn Flakes,101,2,0.1,24 Corn Pops,117,1,0.2,28 Crispix,113,2,0.3,26 Crunchy Bran,120,1.3,1.3,31 Froot Loops,118,2,0.9,26 Frosted Mini-Wheats,175,5,0.8,41 Golden Grahams,149,2.7,1.3,33 Honey Nut Clusters,214,4,2.7,46 Honey Nut Heaven,192,4,3.7,38 King Vitaman,80,1.3,0.7,17 Kix,87,1.5,0.5,20 Life,160,4,1.9,33 Lucky Charms,114,2,1.1,25 Multi-Grain Cheerios,108,2,1.2,24 Product 19,100,2,0.4,25 Raisin Bran,195,5,1.6,47 Reese's Puffs,171,2.7,3.9,31 Rice Chex,94,1.6,0.2,22 Rice Krispie Treats,160,1.3,1.7,35 Smart Start,182,4,0.7,43 Special K,117,7,0.4,22 Total,129,4,0.9,31 Wheaties,107,3,1,24 |
Cereal Price
1 2 3 4 5 6 |
AppleJacks,2.25,US Boo Berry,2.10,US Cap'n Crunch,2.05,US Special K,3.00,US Total,2.50,US Wheaties,3.00,US |
Pig Inner Join
Inner Join Example //quick operation for combining two tuples/data sets or tables……
First let’s load both data sets using the Pig Load Function.
1 2 3 4 5 6 7 8 9 |
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',') AS (name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray); price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',') AS (name:chararray, price:chararray); total = JOIN cereal by name, price by name; DUMP total; |
After we have loaded both data sets we can use the Pig Inner Join function to combine price and nutrition information.
1 2 3 4 5 6 7 8 9 |
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',') AS (name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray); price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',') AS (name:chararray, price:chararray); total = JOIN cereal by name, price by name; DUMP total; |
What happens if we want to combine relations that don’t have key fields? Or fields that match up.
Pig Outer Join
Pig Outer Join offers more options for how joins are performed in Pig. Three options are added to the Outer Joins:
- Left – Allows for data to be joined from the left of the join. Matches everything in the left relations to the corresponding right side relations.
- Right – Places data joined from the right of the join. Returns all relations from the right side and corresponding matched fields.
- Full – Join similar to Inner join where all data is combined. Brute force
Using Pig Outer Joins allows for merge together relations with out matching keys or fields. In the previous example our Inner Join only combined tables where a price existed in the price variable. Cereals without prices like Cinnamon Toast Crunch and others were left when describe total was excited. Outer Joins give us the option to merge all relations and rows together.
Left Outer Join Example
1 2 3 4 5 6 7 8 9 |
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',') AS (name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray); price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',') AS (name:chararray, price:chararray); total = JOIN cereal by name, price by name; DUMP total; |
Left Outer Join Results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
(Kix,87,1.5,0.5,20,,) (Life,160,4,1.9,33,,) (Name,Calories,Protein,Fat,Carbs,,) (Total,129,4,0.9,31,Total,2.50) (Crispix,113,2,0.3,26,,) (Wheaties,107,3,1,24,Wheaties,3.00) (Boo Berry,118,1,0.8,27,Boo Berry,2.10) (Corn Pops,117,1,0.2,28,,) (Rice Chex,94,1.6,0.2,22,,) (Special K,117,7,0.4,22,Special K,3.00) (AppleJacks,117,1,0.6,27,AppleJacks,2.25) (Product 19,100,2,0.4,25,,) (Cocoa Puffs,117,1,1,26,,) (Corn Flakes,101,2,0.1,24,,) (Froot Loops,118,2,0.9,26,,) (Raisin Bran,195,5,1.6,47,,) (Smart Start,182,4,0.7,43,,) (Cap'n Crunch,144,1.3,2.1,31,Cap'n Crunch,2.05) (Cocoa Blasts,130,1,1.2,29,,) (Cookie Crisp,117,1,0.9,26,,) (Crunchy Bran,120,1.3,1.3,31,,) (King Vitaman,80,1.3,0.7,17,,) (Lucky Charms,114,2,1.1,25,,) (Reese's Puffs,171,2.7,3.9,31,,) (Golden Grahams,149,2.7,1.3,33,,) (Honey Nut Heaven,192,4,3.7,38,,) (Honey Nut Clusters,214,4,2.7,46,,) (Frosted Mini-Wheats,175,5,0.8,41,,) (Rice Krispie Treats,160,1.3,1.7,35,,) (Multi-Grain Cheerios,108,2,1.2,24,,) |
The output from the left outer example displays all Cereal results and combines those with corresponding price.
Right Outer Join
1 2 3 4 5 6 7 8 9 |
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',') AS (name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray); price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',') AS (name:chararray, price:chararray); total = JOIN cereal by name, price by name; DUMP total; |
Right Outer Join Results
1 2 3 4 5 6 |
(Total,129,4,0.9,31,Total,2.50) (Wheaties,107,3,1,24,Wheaties,3.00) (Boo Berry,118,1,0.8,27,Boo Berry,2.10) (Special K,117,7,0.4,22,Special K,3.00) (AppleJacks,117,1,0.6,27,AppleJacks,2.25) (Cap'n Crunch,144,1.3,2.1,31,Cap'n Crunch,2.05) |
Results here are much cleaner than our previous example because we are combining off the smaller field.
Full Outer Join
1 2 3 4 5 6 7 8 9 |
cereal = LOAD '/user/hue/Pig_Examples/cereal.csv' USING PigStorage(',') AS (name:chararray, calories:chararray, protein:chararray, fat:chararray, carbs:chararray); price = LOAD '/user/hue/Pig_Examples/cereal-price.csv' USING PigStorage(',') AS (name:chararray, price:chararray); total = JOIN cereal by name, price by name; DUMP total; |
Full Outer Join Results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
(Kix,87,1.5,0.5,20,,) (Life,160,4,1.9,33,,) (Name,Calories,Protein,Fat,Carbs,,) (Total,129,4,0.9,31,Total,2.50) (Crispix,113,2,0.3,26,,) (Wheaties,107,3,1,24,Wheaties,3.00) (Boo Berry,118,1,0.8,27,Boo Berry,2.10) (Corn Pops,117,1,0.2,28,,) (Rice Chex,94,1.6,0.2,22,,) (Special K,117,7,0.4,22,Special K,3.00) (AppleJacks,117,1,0.6,27,AppleJacks,2.25) (Product 19,100,2,0.4,25,,) (Cocoa Puffs,117,1,1,26,,) (Corn Flakes,101,2,0.1,24,,) (Froot Loops,118,2,0.9,26,,) (Raisin Bran,195,5,1.6,47,,) (Smart Start,182,4,0.7,43,,) (Cap'n Crunch,144,1.3,2.1,31,Cap'n Crunch,2.05) (Cocoa Blasts,130,1,1.2,29,,) (Cookie Crisp,117,1,0.9,26,,) (Crunchy Bran,120,1.3,1.3,31,,) (King Vitaman,80,1.3,0.7,17,,) (Lucky Charms,114,2,1.1,25,,) (Reese's Puffs,171,2.7,3.9,31,,) (Golden Grahams,149,2.7,1.3,33,,) (Honey Nut Heaven,192,4,3.7,38,,) (Honey Nut Clusters,214,4,2.7,46,,) (Frosted Mini-Wheats,175,5,0.8,41,,) (Rice Krispie Treats,160,1.3,1.7,35,,) (Multi-Grain Cheerios,108,2,1.2,24,,) |
The returned results are similar to the Inner Join except now there is a null reference in rows missing the pricing.
Summing it Up
Pig Latin offers a lot options for transforming both unstructured and semi-structured data inside of Hadoop. The Pig Latin Join relational operator is a powerful way to perform ETL on data before it enters HDFS or after it’s already in HDFS. Ready for another Pig Latin lesson? Check out my Pig Eval Series.