Embedding Latin Pig into Python, the third millenium dinosaur!

What happens when a python eats a pig? Or better, when we embed a pig into a python?

Well, we are not talking about real animals but about two very powerful technologies: Apache Pig and Python! In this post we will show how to combine both of them for creating a blocking single-thread python application that programmatically executes a pig query and then return an iterator of the aggregated results for further processing.

Why Pig?

Pig is a high-level programming tool for building map reduce jobs based on a language called Latin Pig. The main advantage over SQL-like languages on top of Hadoop, such as Hive, is that it supports ETL workflow. We can then take the partial results of a query and pipe them as input of the next one and so on. What happens when the chain is completed? Pig can output results either to the stdout or storing into a file in HDFS. Very easy and friendly, Pig allows you to quickly crunch your large data set, do some operations, filtering, UDFs and aggregating into a smaller set of results (theoretically fitting in memory). In the other hand, Pig is not really a complete programming language, for instance it misses an iteration statement and a lot of basic features commonly provided by any programming environment.

Why Python?

Too many reasons, for our purpose we just want to consider that is the most popular language among data scientists.

The dinosaur!

Let’s suppose our goal is to load a tsv formatted data set from HDFS into Pig, filter based on some column value rules, group by, aggregate and then do some very complex processing with the aggregated results not expressible using Pig. Continue reading Embedding Latin Pig into Python, the third millenium dinosaur!