HBase Secondary Indexes using Fuzzy Filter

HBase is a great technology for real-time querying of data using the rowkey prefix matching as index, but sometimes secondary indexes are required.

We can organize our data inserting some columns in the rowkey and the remaining ones as column qualifiers.

Example:

<USER>_<DATE_TIME>_<WEB_DOMAIN>

We are not able to filter all the data regarding one particular user just by doing a scan operation using STARTROW =>
“USERNAME” and FILTER => RowKeyPrefixFilter(“USERNAME”) .

What if we want to find out all the users for a given date and web domain?

Due to our design the only way possible in HBase is to have a full scan and use the rowkey regex filter but this means scanning the entire table with dramatic performance issues.

Secondary indexes also are possible, we could store the same data into a second table but with a different schema like:

<DATE_TIME>_<WEB_DOMAIN>

and a column users containing all the usernames seen for the given date and domain.

This approach implies duplicating the data and also maintaining the indexes consistent, which is not always an easy job.

The proposed solution

The proposed solution is based on the design of fixed length keywords in the rowkey and the Fuzzy Row Key filter in replacement of the regex one.

The technique is called fast-forwarding server-side filters using the fuzzy byte-mask. Continue reading HBase Secondary Indexes using Fuzzy Filter