HBase Secondary Indexes using Fuzzy Filter

HBase is a great technology for real-time querying of data using the rowkey prefix matching as index, but sometimes secondary indexes are required.

We can organize our data inserting some columns in the rowkey and the remaining ones as column qualifiers.

Example:

<USER>_<DATE_TIME>_<WEB_DOMAIN>

We are not able to filter all the data regarding one particular user just by doing a scan operation using STARTROW =>
“USERNAME” and FILTER => RowKeyPrefixFilter(“USERNAME”) .

What if we want to find out all the users for a given date and web domain?

Due to our design the only way possible in HBase is to have a full scan and use the rowkey regex filter but this means scanning the entire table with dramatic performance issues.

Secondary indexes also are possible, we could store the same data into a second table but with a different schema like:

<DATE_TIME>_<WEB_DOMAIN>

and a column users containing all the usernames seen for the given date and domain.

This approach implies duplicating the data and also maintaining the indexes consistent, which is not always an easy job.

The proposed solution

The proposed solution is based on the design of fixed length keywords in the rowkey and the Fuzzy Row Key filter in replacement of the regex one.

The technique is called fast-forwarding server-side filters using the fuzzy byte-mask. Continue reading HBase Secondary Indexes using Fuzzy Filter

Hive mapping of HBase columns containing colon ‘:’ character

Apache HBase

Many of you are familiar with HBase. If you are not, HBase is a NoSQL database modeled after Google’s BigTable paper was published and aims to provide a key-value columnar database on top of HDFS, the Hadoop File System.

HBase lets you insert/query data indexed by a rowkey and organized into columns and families of columns. The rowkey is unique for each data but it can be associated to “unlimited” number of cells, where each cell corresponds to a given column. Columns are then grouped by family so that columns belonging to the same family are always partitioned together.

HBase data model. Source: HBASE SCHEMA DESIGN and Cluster Sizing Notes ApacheCon Europe, November 2012
HBase data model. Source: HBASE SCHEMA DESIGN
and Cluster Sizing Notes
ApacheCon
Europe, November 2012

The column identifier is then specified by <COLUMN_FAMILY>:<COLUMN_NAME>

For instance let’s suppose we have one family called ‘w’ and the columns are: ”referral’, ‘n_clicks’, ‘rank’. The corresponding identifiers are: ‘w:referral’, ‘w:n_clicks’, ‘w:rank’. Sometimes we want to hierarchical organize our columns so that we have multiple sub-columns of the same parent column, something like: ‘n_clicks of 1st order’, ‘n_clicks of 2nd order’ … or ‘rank on topwebsites.foo.bar’, ‘rank on hottestdomains.foo.bar’ and so on. One first idea would be to keep the same notation of the column family/qualifier using the colon ‘:’ character as delimiter: ‘w:n_clicks:1’, ‘w:n_clicks:2’, ‘w:rank:tws’,’w:rank:hd’… HBase will convert the string representation of the column qualifier into bytes. Though HBase allows you to use any character, we want to show in this post why we do not recommend.

HBase is great for scalable and quick queries, but what if we want to scan and process the entire table? Since our physical data resides in HDFS we can run a Map Reduce job on top of it. Good! Do we really want to write a row M/R job? Wouldn’t be nice to have a SQL-like language that automatically maps our data from HBase into something more structured?

Apache Hive

Hive is a data warehouse that projects structure onto the data stored in HDFS and provides a SQL-like language called HiveQL for building easily query then translated in M/R jobs. Hive can externally maps data outside its repository as long as it resides in HDFS, this solution is called Hive external mapping and it works fine (almost) with data stored in HBase. Continue reading Hive mapping of HBase columns containing colon ‘:’ character