The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour

From Data Science Milan meetup event:

In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team’s goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:

• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.

• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.

• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.

• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.

• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).

• How Scala (and functional programming) helped our cause.

Coding practices for data products development

This is the part 2 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Coding practices

Code should be developed in a proper IDE and make use of advanced tools for re-factoring, auto-completion, syntax highlighting and auto-formatters; at least.

Notebooks should use routine libraries from the main codebase. As soon as some code is developed in a notebook and is reusable, it should be moved into a codebase. Rule of thumb might be each notebook cell should not exceed the 10 lines, after that either needs refactoring or it should be pulled away. Only exception is long code used only and specifically for the one off investigation that does not make sense outside that particular context.

Do not introduce unnecessary dependencies in the codebase (e.g. plotting libraries). Keep the code repository lean and add dependencies to your particular use case rather than the project repository.

During development is recommended to do frequent git commits. When the ticket is ready to go, the developer should first run a git diff develop and review its own code before to create the pull request (PR).

The pull request should only contain the minimum amount of code specified in the corresponding ticket requirements. You don’t anticipate functions that you know will need in the future even though this future is a couple of hours later. Avoid abstractions or general-purpose methods. First a working code for your specific use case then you will refactor it.

Agile manifesto says:

“Simplicity–the art of maximizing the amount
of work not done–is essential.”

Make your code structure flat:

  • data containers
  • static classes containing functions/methods/utils
  • entry point classes defining the end-to-end job and putting all of the pieces together

Copy and paste the same code if needed, duplication is not always bad if it makes the design simpler. Only extract methods and abstract classes if you have at least 3 use cases.

Comments in the code is very likely to cause out-of-sync documentation. Clean code, good design and self-explaining namings will make your code self-documenting. The only exception to comments are TODO, FIXME and annotations explaining why an hack was needed and in which conditions the current implementation might fail. Obviously avoiding hacks in the first place is the best solution but sometime we need to cope with them. Abuse of TODOs but do not leave non-working code without annotations.

Extreme attention should be paid to the code style and conventions. Having bad formatted code or inconsistent patterns make the code very hard to read and maintain.

After the PR is sent for review, chase your reviewer to review your code asap. Resist from starting a new task until the review is not finished and the PR merged into the develop branch. Do one thing per time and move to the next only when the previous is 100% done.

Reviewers should not accept justification regarding bad practices. Code reviews is the only way to guarantee a convergence of the team towards the excellence. It definitely pays off in the long term. The process of code reviewing should go forth and back until both the two parties are satisfied.

Testing

notestnobeer

You should always come up with smart ways of testing your code. Laziness or “I know it works” approaches should not be accepted. Only code that may not require tests are one-off analysis since that are humanly supervised and are not going into production.

A code without tests is risky, cannot be refactored and cannot be maintained since that unit tests serve as documentation. If someone changes your code than you can still be blamed and be responsible of the failure even though your code used to work. Tests are the only way of protecting validity of your solutions. Time spent in testing is the greatest long-term investment you can do for your project.

If you spot a bug that was not found in your tests, that is an indicator that this test case should be added. Don’t just fix it, make sure you first have the failing test for it. Debug your code by adding unit tests and breaking down end-to-end methods into smaller composable functions. Debugging by adding unit tests will give you a much safer and repeatable way to make your code robust.

Read-eval-print-loop (REPL) debugging is just another type of exploratory analysis, if you want to follow that way then remember to turn your manual techniques into automated tests.

Obviously all of the above problems would not exist in case of TDD.

When your fantasy of creating manual test cases is about to finish or you are too tired of keeping adding tests that always succeed, consider also adding a few property-based tests with random generators.

Unit tests are necessary but is the whole end-to-end that matters. Make sure you have at least a few integration tests in place. The best is if those integration tests actually maps to real use cases.

Pair working

We found pair working to be much more productive than working as isolated individuals. A data science team generally is cross-functional with people ranging from a more engineering background to more theoretical analytical/statistics background. Good rule is to pair opposite individuals together and swap their competencies so that who is good at coding will do the modelling and vice versa. Code review process still applies as usual even though the code was written together, it might be worth to involve someone else with no priori knowledge of the project to review the code and methodology.

Functional Programming

Function programming offers a few advantages over the other paradigms and we found it to suit very well with Data Munging and Machine Learning algorithms. Just to name few:

  • Implementing any complex logic as combination of simple first-order functions instead of long and non reusable methods.
  • No state, no side-effect, the same code will return the same output at every single cal. No debugging is needed.
  • Close match with math. You can implement any algorithm same way you read them from academic papers.
  • No need to think of how make your code to execute efficiently. Focus on functionalities only.
  • High abstract level, keep your brain trained on lateral thinking instead of following mechanical procedures.
  • Conciseness, you will be surprised of how many algorithms (single node or distributed) can be implemented in a single line.
  • Higher readability, you only needs to understand what the functions aim to do and not what the values of each variable represent at each step.
  • Concurrency for free at no extra cost. Full parallelism.
  • Same code for local implementations magically scales up in a distributed environment. That means you can prototype locally without have to re-engineer your solution for the big data system.
  • Type system, you know what functions can be used and what the form of intermediate transformations are. No need of read-eval-print loops or hacky print calls. Easy to implement, reasoning and refactoring complex algorithms without introducing bugs.
  • No explicit loops, you know how your algorithm is converging via recursion.
  • Flat and minimal structure, no need to create tons of classes or verbose notations. You can use anonymous functions, pattern matching and wildcard notations.

Popular languages in Data Science are not always natively functional but most of them offer their functional extension or some external library does. See for example this project of introducing the functional APIs of Scala to Python collections: http://pedrorodriguez.io/blog/2015/03/14/functional-programming-collections-python/.

If you work in Data Science or Big Data and have never done functional programming before, you should really look into it. You might find it a bit steeply at the beginning but after you master it you will be superbly productive.

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are:

  1. Data stored in non scalable infrastructure for analysis and processing
  2. Data governance and security policies

1. Data often resides into central data warehouse and RDBMS of which many legacy applications and analysts depends on.
Data Scientists insteads cannot build their models or perform exploratory analysis by using SQL queries. They need the data to be available into a scalable, programmatic and reactive stack such as Hadoop and Apache Spark and develop their logic using languages such as Python, R, Scala… (for comparison of how Python and Scala compare for Spark, see this post: 6 points to compare Python and Scala for Data Science using Apache Spark).
2. Nevertheless, data cannot just be transferred (in technical terms sqoop-ed) to an Hadoop cluster without incurring into tedious bureaucracy,  ingestion inconsistencies and strict policies. In big corporations that translates to at least a month to decide what tables are interesting and a few more months to write the ETL logic, move the data and test the consistency.

At Barclays we developed a stack to logically map the raw data from the central data warehouse into Spark and use Tachyon for in-memory saving the data for long-term availability. In such stack, we are able to iterate fast with immediate data availability from a scalable Big Data cluster by skipping the data ingestion process and still complying with all of the data policies.

Tachyon was the key enabling technology for us.

Our workflow iteration time decreased from hours to seconds. Tachyon enabled something that was impossible before.

You can find the original article published on DZone in collaboration with Gene Pang, Software Engineer at Tachyon Nexus and Haoyuan Li, CEO of Tachyon Nexus:
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds

AGILE DATA SCIENCE ITERATION 0: The Simple Solution

This is the third post of the Agile Data Science Iteration 0 series:

Previously

What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset

At this stage you should have a clear statement of what problem you are trying to solve and how you can objectively measure the quality of any possible solution regardless of what the final implementation will be. You have an initial background of state of the arts and an understanding of what the data look like.  You can now implement your first simple solution to the problem.

The Basic Solution

7. Define and quickly develop the simplest solution to the problem

The challenge here is “Are you able to implement a basic solution that solve the end-to-end goal (not necessary with the required quality) in a few days?”.

Before trying to think of very scalable algorithms or advanced modelling techniques, have you thought about a simples rules classifier? What if you can easily predict if an user is about to default his bank loan by simply looking at difference between how much is his earning and spending in the past 3 months and come out with a rule like “if that amount is less than X than the user is very likely to default”.

Maybe you spent 3 days of analysis and 2 days of development and you solved your problem even before to start it! Or maybe is not good enough but now you have got a basis to compare when you analyse the risk of continuing in this project at each iteration by comparing with what could be achieved in just 5 days of work.

8. Release/demo first basic solution

What would be more agile than releasing and demoing the basic solution? Never under-estimate the value of feedback and how your mind can focus on the next big thing given that everything done so far is checkpointed and reviewed.

Now you have got a quick and simple solution that does try to solve the business goal even if it might be not accurate yet. It could, but not necessary,  is your MVP. That depends on if the acceptance criteria are fulfilled or not.
What is important is that you have spent just a few days and you have got something to deliver and demoing. This will give you the following benefits:

  • trustiness with your stakeholder that you can deliver quickly
  • first set of feedback
  • inspiration for further improvements
  • baseline for comparison

You have all of the knowledge to start preparing your solution proposal.

The Proposal Preparation

9. Research of background for ways to improve the basic solution

Now you clearly now what goal you want to achieve, what minimum requirements to meet, what the data looks like, what basic solution to compare to. This is the right time for doing some deeper research of better ways of solving the problem by using more advanced techniques, domain specific knowledge and/or additional datasets.

10. Gather additional data into proper infrastructure (if required)

Like 5) but only if additional data is required by the current proposal.

11. Ad-hoc Exploratory Data Analysis (EDA)

At this stage the EDA is targeting explicitly to extracting knowledge related to the improved solution to propose.

12. Propose better solution minimising potential risks and marginal gain

Because you know have a comparison baseline, you should prefer quantifying the incremental benefit of your model rather than its absolute evaluation and try to trade off the additional complexity with the potential value gain.

At this stage you should most of the requirements defined. You have probably changed your mind different times as you were researching about the problem and re-scoping it into smaller problems. You now know what has to be implemented for your first MVP.

***

Details of how to structure your ETL will follow on the next post of the “Agile Data Science Iteration 0” series, stay tuned.
Meanwhile, why not sharing or commenting below?

The Initial Investigation << prev | next >> The ETL