It all started with me trying to learn to use the Pig Subtract Function for my Pig Eval Series. It will be fairly straight forward, use this function to compare two data sets Hadoop right? Well it didn’t exactly workout as planned. Pig’s Subtract Function did work and it did help me compare to data sets but finding the correct use case was hard. My biggest struggle was making sure the data existed in the same bag. During the struggle I ended up cheating to one of my favorite Pig Functions. Let’s see how I used the function and which one I would recommend in the future…..
Subtract in Pig Latin
Using the Subtract function in Pig Latin is not a mathematical subtraction in the sense of taking 3 and subtracting 2 resulting in 1. Remember Pig Latin excels at ETL functions on unstructured and semi-structured data. The best way to describe the way Pig Latin’s Subtract function works with an example. Let’s say we were provided two separate lists of potential customers and we wanted to find out which customers existed on only one of the list. Here is an example
List1(john, zach, wayne, mary, jane) SUBTRACT (think compare) List2 (wayne, jane) = (john, zach, mary)
See in the above example 3 names were not in the second list of customers. Pig’s Subtract function is perfect for problems where you need to compare data sets in Hadoop.
Now let’s try another example using real world data.
Pig Latin Subtract Example
In this walk through let’s use 2 different Titanic passenger list. What we want to compare is to check both list for passengers that are listed as survivors. It would be horrible to report a passenger survived when they didn’t. Imagine having to report in today’s information age. Let’s use Pig Latin’s Subtract function to flag passengers who are not on both lists, so they will be tagged for further investigation.
Below are the two data sets we want to compare. All data and completed scripts can be found on my Github Example-Pig Project. In the first and second data set the second column a 1 means survived and 0 means didn’t survive.
1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S 2 St Louis, MO
1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S 11 Montreal, PQ / Chesterville, ON
1 0 Allison, Miss. Helen Loraine female 2 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON
1 0 Allison, Mr. Hudson Joshua Creighton male 30 1 2 113781 151.5500 C22 C26 S 135 Montreal, PQ / Chesterville, ON
1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 113781 151.5500 C22 C26 S Montreal, PQ / Chesterville, ON
1 1 Anderson, Mr. Harry male 48 0 0 19952 26.5500 E12 S 3 New York, NY
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
Writing the Pig Script
To setup are passengers to be flagged and investigated here is what we need to do
- Load both data sets using LOAD
- Define each column
- Shorten all columns in data set 1 except 1, 2 & 3 (we are only using last names for this tutorial)
- Shorten all columns in data set 2 except 1, 2 & 4
- Filter both data sets for only passengers who survived
- Combine two data sets using COGROUP
- Use Subtract Function comparing both data sets
- Display results using a simple DUMP
--0 == didn't survive / 1 == survived
titanic_list = LOAD '/user/hue/Pig_Examples/titanic_passenger_list.csv' USING PigStorage(',') AS (f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13);
--0 == didn't survive / 1 == survived
kaggle_list = LOAD '/user/hue/Pig_Examples/train.csv' USING PigStorage(',') AS (k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12);
k = FOREACH kaggle_list GENERATE k1, k2, k4;
ks = FILTER k BY k2 == 1;
w = FOREACH titanic_list GENERATE f1, f2, f3;
ws = FILTER w BY f2 == 1;
x = COGROUP ws BY f1, ks BY k1;
final = FOREACH x GENERATE SUBTRACT(1,2);
When I first looked at using the Pig’s Subtract function, I imagined it would work different. Having to join separate data sets into two separate bags to compare them may not be best used in the example above. On second thought for comparing the different Titanic passenger list using Pig’s inner join might work better. Let me know what you think about the Pig Subtract Function and where it should be used.