The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour

From Data Science Milan meetup event:

In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team’s goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:

• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.

• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.

• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.

• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.

• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).

• How Scala (and functional programming) helped our cause.

Robust and declarative machine learning pipelines for predictive buying

Proof of concept of how to use Scala, Spark and the recent library Sparkz for building production quality machine learning pipelines for predicting buyers of financial products.

The pipelines are implemented through custom declarative APIs that gives us greater control, transparency and testability of the whole process.

The example followed the validation and evaluation principles as defined in The Data Science Manifesto available in beta at