How many fingers am I holding up?

1….2…3…

Three is the answer I was looking for. Learning to count might sound trivial but is something you do everyday. What happens when you need to count variables in your data set?

If you are using Pig, which if you reading this post you are, you can use the Apache Pig COUNT function. Using the COUNT function is easy to learn, but you will have to use a GROUP BY function before invoking the COUNT function.

Let’s walk through a quick demo using our population data I used for the Pig Sum example. In this scenario I want to count the number of age groups per year. The population data goes from 0 – 90 in multiples of 5 which gives us 19, but the data is broken by by gender as well. The total should be 38 age groups per year. It was easy to figure this out with a small data set, but if I were looking a millions to billions of data points I would definitely run out of fingers and toes to count on.

### Population Data

All source files can be found at Pig-Example.

Here is a sample of the population data. Each row has year, age, gender, and population size.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
1850,0,1,1483789 1850,0,2,1450376 1850,5,1,1411067 1850,5,2,1359668 1850,10,1,1260099 1850,10,2,1216114 1850,15,1,1077133 1850,15,2,1110619 1850,20,1,1017281 1850,20,2,1003841 1850,25,1,862547 1850,25,2,799482 1850,30,1,730638 1850,30,2,639636 1850,35,1,588487 1850,35,2,505012 1850,40,1,475911 1850,40,2,428185 ..... |

### Grouping the YEAR

Start by loading the population data.

After we load the population data into the POPULATION variable and declare our field names, I am going to use the GROUP BY function to categorize each entry by year. The official term for this is taking each tuple (think row or entry) and placing it in a bag (think sub-heading). Now my population data is nested by each year(referred to as a bag). When I dump the YEAR variable I would see something like this 1850(tuple 1, tuple 2…), 1851(tuple39, tuple40…).

1 2 3 4 5 6 |
population = LOAD '/user/hue/pig-examples/population.csv' USING PigStorage(',') AS (year: int, age: int, gender: int, popsize: int); year = GROUP population BY year; DUMP year; |

### Adding the Apache Pig COUNT

Now I have my population data into bags by year, I can count each tuple (think row or instance). To use the COUNT function I will use the FOREACH function passing in the YEAR variable and COUNT the number of POPULATION variable. The final line of the script is a simple DUMP of the RESULT variable.

1 2 3 4 5 6 7 8 |
population = LOAD '/user/hue/pig-examples/population.csv' USING PigStorage(',') AS (year: int, age: int, gender: int, popsize: int); year = GROUP population BY year; result = FOREACH year GENERATE COUNT(population); DUMP result; |

### Final Results

The results output show 38 ages in each year for population data set. Remember in this data set is broken down by gender and age.

Using the COUNT function is simple and remember it is always preceded by the GROUP BY function. Interested in more Pig Eval function? Checkout the entire Pig Eval Series.

If you are interested in a deep dive into to Pig Latin be sure to watch my Pig Latin: Getting Started Pluralsight Course.