Hive mapping of HBase columns containing colon ‘:’ character

Apache HBase

Many of you are familiar with HBase. If you are not, HBase is a NoSQL database modeled after Google’s BigTable paper was published and aims to provide a key-value columnar database on top of HDFS, the Hadoop File System.

HBase lets you insert/query data indexed by a rowkey and organized into columns and families of columns. The rowkey is unique for each data but it can be associated to “unlimited” number of cells, where each cell corresponds to a given column. Columns are then grouped by family so that columns belonging to the same family are always partitioned together.

HBase data model. Source: HBASE SCHEMA DESIGN and Cluster Sizing Notes ApacheCon Europe, November 2012

HBase data model. Source: HBASE SCHEMA DESIGN
and Cluster Sizing Notes
ApacheCon
Europe, November 2012

The column identifier is then specified by <COLUMN_FAMILY>:<COLUMN_NAME>

For instance let’s suppose we have one family called ‘w’ and the columns are: ”referral’, ‘n_clicks’, ‘rank’. The corresponding identifiers are: ‘w:referral’, ‘w:n_clicks’, ‘w:rank’. Sometimes we want to hierarchical organize our columns so that we have multiple sub-columns of the same parent column, something like: ‘n_clicks of 1st order’, ‘n_clicks of 2nd order’ … or ‘rank on topwebsites.foo.bar’, ‘rank on hottestdomains.foo.bar’ and so on. One first idea would be to keep the same notation of the column family/qualifier using the colon ‘:’ character as delimiter: ‘w:n_clicks:1’, ‘w:n_clicks:2’, ‘w:rank:tws’,’w:rank:hd’… HBase will convert the string representation of the column qualifier into bytes. Though HBase allows you to use any character, we want to show in this post why we do not recommend.

HBase is great for scalable and quick queries, but what if we want to scan and process the entire table? Since our physical data resides in HDFS we can run a Map Reduce job on top of it. Good! Do we really want to write a row M/R job? Wouldn’t be nice to have a SQL-like language that automatically maps our data from HBase into something more structured?

Apache Hive

Hive is a data warehouse that projects structure onto the data stored in HDFS and provides a SQL-like language called HiveQL for building easily query then translated in M/R jobs. Hive can externally maps data outside its repository as long as it resides in HDFS, this solution is called Hive external mapping and it works fine (almost) with data stored in HBase.

CREATE EXTERNAL TABLE db.hive_table(rowkey STRING, rank STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ":key,w:rank")
TBLPROPERTIES ('hbase.table.name' = 'HBASE_TABLE_NAME');

Hive uses SerDe for serializing/deserializing data stored in any custom format. Unfortunately in the today version 0.11.0 due to this issue,  SerDe is not able to map the column qualifiers that contain the colon character.

You cannot map a column with the following properties:

CREATE EXTERNAL TABLE db.hive_table(rowkey STRING, rank STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ":key,w:rank:tws,w:rank:hd") TBLPROPERTIES ('hbase.table.name' = 'HBASE_TABLE_NAME');

Otherwise you will get an error like:

FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Error: the HBase columns mapping contains a badly formed column family, column qualifier specification.)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Standing to the Java documentation in the source code :

   * Utility method for parsing a string of the form '-,b,s,-,s:b,...' as a means of specifying
   * whether to use a binary or an UTF string format to serialize and de-serialize primitive
   * data types like boolean, byte, short, int, long, float, and double. This applies to
   * regular columns and also to map column types which are associated with an HBase column
   * family. For the map types, we apply the specification to the key or the value provided it
   * is one of the above primitive types. The specifier is a colon separated value of the form
   * -:s, or b:b where we have 's', 'b', or '-' on either side of the colon. 's' is for string
   * format storage, 'b' is for native fixed width byte oriented storage, and '-' uses the
   * table level default.
   *
   * @param hbaseTableDefaultStorageType - the specification associated with the table property
   *        hbase.table.default.storage.type
   * @throws SerDeException on parse error.
   *

Hence if you want to stick with your schema design, you need a workaround for the current release of Hive to map the underlying HBase table.

Recent Hive versions have this column prefix matching feature, that will allow you to map all the columns matching a specified prefix.

CREATE EXTERNAL TABLE hive_hbase_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,w:rank*")
TBLPROPERTIES ("hbase.table.name" = "TEST_HBASE_TABLE");

This will create a column of type Map<STRING,STRING> in Hive, where the key of the map will be the full column name (e.g. ‘rank:tws’) and the value is the value of the cell.
If you are not running the latest version of Hive you may miss this feature. Hence, what you want to do is mapping the entire column family:

CREATE EXTERNAL TABLE hive_hbase_test(rowkey STRING, w MAP<STRING, STRING>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,w:")
TBLPROPERTIES ("hbase.table.name" = "TEST_HBASE_TABLE");

Now you will have a single column containing the map of all the fields. If you want to split them out you could create and external view on top of this extracting the single columns from this big map.

CREATE VIEW hive_hbase_test_with_columns
AS
SELECT rowkey,
w['rank:tws'] as rank_tws,
w['rank:hd'] as rank_hd,
w['n_click:1'] as n_click_1,
w['n_click:2'] as n_click_23
FROM hive_hbase_test;

The drawback is that the view is applied after the base mapping takes place, that means we cannot avoid to load all the columns even if we are not interested in all of them.
Hopefully future releases of SerDe will fix this problem and we will be able to simply maps column qualifiers that contains any arbitrary character as HBase supports.

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Big Data, HBase, Hive and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s