Many of you are familiar with HBase. If you are not, HBase is a NoSQL database modeled after Google’s BigTable paper was published and aims to provide a key-value columnar database on top of HDFS, the Hadoop File System.
HBase lets you insert/query data indexed by a rowkey and organized into columns and families of columns. The rowkey is unique for each data but it can be associated to “unlimited” number of cells, where each cell corresponds to a given column. Columns are then grouped by family so that columns belonging to the same family are always partitioned together.
The column identifier is then specified by <COLUMN_FAMILY>:<COLUMN_NAME>
For instance let’s suppose we have one family called ‘w’ and the columns are: ”referral’, ‘n_clicks’, ‘rank’. The corresponding identifiers are: ‘w:referral’, ‘w:n_clicks’, ‘w:rank’. Sometimes we want to hierarchical organize our columns so that we have multiple sub-columns of the same parent column, something like: ‘n_clicks of 1st order’, ‘n_clicks of 2nd order’ … or ‘rank on topwebsites.foo.bar’, ‘rank on hottestdomains.foo.bar’ and so on. One first idea would be to keep the same notation of the column family/qualifier using the colon ‘:’ character as delimiter: ‘w:n_clicks:1’, ‘w:n_clicks:2’, ‘w:rank:tws’,’w:rank:hd’… HBase will convert the string representation of the column qualifier into bytes. Though HBase allows you to use any character, we want to show in this post why we do not recommend.
HBase is great for scalable and quick queries, but what if we want to scan and process the entire table? Since our physical data resides in HDFS we can run a Map Reduce job on top of it. Good! Do we really want to write a row M/R job? Wouldn’t be nice to have a SQL-like language that automatically maps our data from HBase into something more structured?