Embedding Latin Pig into Python, the third millenium dinosaur!

What happens when a python eats a pig? Or better, when we embed a pig into a python?

Well, we are not talking about real animals but about two very powerful technologies: Apache Pig and Python! In this post we will show how to combine both of them for creating a blocking single-thread python application that programmatically executes a pig query and then return an iterator of the aggregated results for further processing.

Why Pig?

Pig is a high-level programming tool for building map reduce jobs based on a language called Latin Pig. The main advantage over SQL-like languages on top of Hadoop, such as Hive, is that it supports ETL workflow. We can then take the partial results of a query and pipe them as input of the next one and so on. What happens when the chain is completed? Pig can output results either to the stdout or storing into a file in HDFS. Very easy and friendly, Pig allows you to quickly crunch your large data set, do some operations, filtering, UDFs and aggregating into a smaller set of results (theoretically fitting in memory). In the other hand, Pig is not really a complete programming language, for instance it misses an iteration statement and a lot of basic features commonly provided by any programming environment.

Why Python?

Too many reasons, for our purpose we just want to consider that is the most popular language among data scientists.

The dinosaur!

Let’s suppose our goal is to load a tsv formatted data set from HDFS into Pig, filter based on some column value rules, group by, aggregate and then do some very complex processing with the aggregated results not expressible using Pig.
In this example our data set that looks like:

user country interests volume
2938243096 Italy Food, Football, Cars 54
4327804923 Brazil Football, Dancing 23
3272832938 France Food, Wine 65

And what we want is filtering by only Scandinavian countries, group by interests and sum up the volume of each combination of interests to a partial result. We can achieve with the following query:

data = LOAD 'DATA_INPUT_PATH' USING PigStorage('\t') AS (user, country, interests, volume);
scandinavia = FILTER data BY (country == 'Sweden') OR (country == 'Norway') OR (country == 'Denmark');
grouped_interests = GROUP scandinavia BY interests;
interests_volume = FOREACH grouped_interests GENERATE group, SUM(scandinavia.volume);
DUMP interests_volume;

The partial output will look like:

interests volume
Food, Football, Cars 43545656
Football, Dancing 344324
Music 54455445543

And finally we want to take those aggregated results and do some complex further process with it, for example, computing the set of interest that gives a 95% coverage of the total volume. This operation is impossible to code directly in Pig and we do not want to take the cost and pain of creating a UDF function. Moreover we may want to take those pig results and consume them programmatically. What we could do is running the following pig query, reading from the stdout (or dump to a file and then read from HDFS) and pipe the output to our python script. In this post instead, we want to show how to integrate the Pig query directly in our script.

# my_script.py

def do_something(iterator):
    for pig_tuple in iterator:
        interests = str(pig_tuple.get(0))
        volume = float(str(pig_tuple.get(1)))
        # DO SOMETHING

import sys
from org.apache.pig.scripting import *

if __name__ == '__main__':
    P = Pig.compile("""
        data = LOAD '$data' USING PigStorage('\t') AS (user, country, interests, volume);
        scandinavia = FILTER data BY (country == 'Sweden') OR (country == 'Norway') OR (country == 'Denmark');
        grouped_interests = GROUP scandinavia BY interests;
        interests_volume = FOREACH grouped_interests GENERATE group, SUM(scandinavia.volume);
        DUMP interests_volume;
    params = {'data': sys.argv[1]}
    bound = P.bind(params)
    stats = bound.runSingle()
    if not stats.isSuccessful():
            raise 'failed'
    iterator = stats.result("interests_volume").iterator()

The $data variable is the data input path that we want to take as argument and bind into our pig environment. In this code, we first execute the Pig query and then we get an iterator with our partially aggregated results represented as tuples and pass it to our custom function.

Since that Pig native API is written in Java we will use Jython to compile our code into a Java jar.
In order to run it, you need to install jython in the Pig classpath. Download Jython, execute the following command to build a stand-alone jar version:

java -jar jython_installer-2.5.2.jar -s -d /tmp/jython-install -t standalone -j $JAVA_HOME

Before to run the pig script, you need to export the jython.jar to the PIG_CLASSPATH variable:

export PIG_CLASSPATH=$PIG_CLASSPATH:/tmp/jython-install/jython.jar

You can execute your script as:

pig my_script.py 'HDFS PATH'> 'OUTPUT_FILE.out'

Pig will automatically recognize the script, compile it with Jython and execute it in proper environment that will let you loading the partial aggregated results into your application domain, executes the python-coded instructions and the output the stdout to a file.

Cool, isn’t it? What’s your opinion about?


About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Big Data, Pig, Python, Software Development and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s