HBase Secondary Indexes using Fuzzy Filter

HBase is a great technology for real-time querying of data using the rowkey prefix matching as index, but sometimes secondary indexes are required.

We can organize our data inserting some columns in the rowkey and the remaining ones as column qualifiers.

Example:

<USER>_<DATE_TIME>_<WEB_DOMAIN>

We are not able to filter all the data regarding one particular user just by doing a scan operation using STARTROW =>
“USERNAME” and FILTER => RowKeyPrefixFilter(“USERNAME”) .

What if we want to find out all the users for a given date and web domain?

Due to our design the only way possible in HBase is to have a full scan and use the rowkey regex filter but this means scanning the entire table with dramatic performance issues.

Secondary indexes also are possible, we could store the same data into a second table but with a different schema like:

<DATE_TIME>_<WEB_DOMAIN>

and a column users containing all the usernames seen for the given date and domain.

This approach implies duplicating the data and also maintaining the indexes consistent, which is not always an easy job.

The proposed solution

The proposed solution is based on the design of fixed length keywords in the rowkey and the Fuzzy Row Key filter in replacement of the regex one.

The technique is called fast-forwarding server-side filters using the fuzzy byte-mask.

If you are not familiar with HBase filters, there is a feature where each filter could give hints regarding the next rowkey to seek: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT .

For more information, see here: https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html#getNextCellHint%28org.apache.hadoop.hbase.Cell%29

FuzzyRowFilter takes advantage of a byte-mask specified by the user to generate hints of the next predicted rowkey resulting in smart jumps among rows and avoiding a full table scan. In order to be able to specify a query mask we must design our rowkey schema so that each field has fixed length, truncating and padding could be used for adapting variable-length fields, if any.

Let’s suppose that the maximum number of characters for an username is 10 and that we can express a date using the format YYYYMMDD and a webdomain with at most 20 characters, we will have something like:

username date web domain variable length rowkey fixed length rowkey
alice 28th June 2014 foo.bar alice_20140628_foo.bar alice*****20140628foo.bar*************
bob 1st June 2014 example.com bob_20140601_example.com bob*******20140628example.com*********

If we want to perform a scan of all the usernames of 28th June 2014 for the domain foo.bar we will have to specify a mask as following:

??????????20140628foo.bar*************

? is the wildcard that represents the bytes of the rowkey that are unknown and for which we need to make predictions in the next seek hint.

This does not only apply to the prefix of the rowkey but we can mix any character with the wild char ?.

If we want to filter only by date, we will specify:

??????????20140628????????????????????

If we want to filter by year and month but we do not know the date but instead we are interested on all the domains of foo (foo.com, foo.bar, foo.net, ecc…), we could create a mask like:
??????????201406??foo.????????????????

It is important to specify foo. otherwise we may match stuff like foo-1.com or foooooo.net .

The documentation of the FuzzyRowFilter (http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/filter/FuzzyRowFilter.html) provides this example:

Filters data based on fuzzy row key. Performs fast-forwards during scanning. It takes pairs (row key, fuzzy info) to match row keys. Where fuzzy info is a byte array with 0 or 1 as its values:

  • 0 – means that this byte in provided row key is fixed, i.e. row key’s byte at same position must match
  • 1 – means that this byte in provided row key is NOT fixed, i.e. row key’s byte at this position can be different from the one in provided row key

Example: Let’s assume row key format is userId_actionId_year_month. Length of userId is fixed and is 4, length of actionId is 2 and year and month are 4 and 2 bytes long respectively. Let’s assume that we need to fetch all users that performed certain action (encoded as “99”) in Jan of any year. Then the pair (row key, fuzzy info) would be the following: row key = “????_99_????_01” (one can use any value instead of “?”) fuzzy info = “\x01\x01\x01\x01\x00\x00\x00\x00\x01\x01\x01\x01\x00\x00\x00” I.e. fuzzy info tells the matching mask is “????_99_????_01”, where at ? can be any value.

How it works?

HBase will start scanning from the first row applying the defined mask which will substitute the characters that are not wild (expressed with ?) resulting into an hint for the next rowkey.

Let’s assume we have the following series of rows:

  1. alice*****20140310foo.bar*************
  2. alice*****20140312foo.bar*************
  3. alice*****20140629foo.bar*************
  4. alice*****20140704foo.bar*************
  5. ali1989***20140310example.com*********
  6. ali1989***20140522example.com*********
  7. ali1989***20140628example.com*********
  8. ali1989***20140628example.net*********
  9. bob*******20140624example.com*********
  10. bob*******20140625example.com*********
  11. bob*******20140626example.com*********
  12. bob*******20140627example.com*********
  13. bob*******20140628example.com*********

If the first rowkey we hit in our table is ‘alice*****20140310foo.bar*************’, applying the mask ‘??????????20140628????????????????????’ the next hint will be startrow ‘alice*****20140628’, we can jump to the row 3 without scanning all the rows of alice user in the middle, if the next row at the specified startrow point does not start with ‘alice*****20140628’ means there is not data for user alice on the specified date, if this row starts with alice but with another date (lexicographical higher) we can skip the user alice and jump to the next user. If it directly points to a different user, we can re-applying the mask as we have done for alice and jump to the interested date, if any. If we do not know who is the next user, we can skip to the next unknown one simply incrementing the right-most character of ‘alice*****’ which will be ‘alice****+’, please note that ‘*’ is the 42th character in the ASCII code and ‘+’ is the 43th. The next jump will point to row 5 ‘ali1989***20140310foo.bar*************’. From row 5 can directly jump to row 7 applying the mask and now we found one result matching our filtering criteria. We can keep scanning next in-order rows until we hit something that does not match our mask anymore, aka we are scanning all the domains of user ali1989 on the day 20140628. The stopping row will be the 9, which has a new user bob. We can jump to 13 and so on…

 Efficiency

The efficiency strictly depends on the cardinality of the keyword we want to filter and how many jumps it has to perform in order to skip. In our example of getting all the rows of the specified date, for each user we were jumping to the next user-date prefix. The number of jumps will be equal to the cardinality of users. The skipping factor is given by the cardinality of the date multiplied by the cardinality of the domains, that represents the average number of rows to scan for each user. Supposing we have accumulated data for 1 year, 1000 users and that each user in average visits 30 web domain a day, we will have 30 * 365  * 1000 ~ 10 millions rows but we will retrieve only the rows of the specified date (30000) at the cost of performing 1000 jumps.

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Big Data, HBase and tagged , , , . Bookmark the permalink.

One Response to HBase Secondary Indexes using Fuzzy Filter

  1. Ihor Mochurad says:

    That’s a great blogpost, thanks for sharing.
    I have a question though. Let’s assume I am using FuzzyRowFilter to fetch timeseries data.
    My composite key consists of 6 components, one of them is salt_bucket and one of them is timestamp. While I’ll be doing exact matching on 3 components, I still like to have an ability to specify the time range for my query. Otherwise, it will bring back millions of rows.
    Is FuzzyRowFilter designed for this kind of scenario?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s