Proof of concept of how to use Scala, Spark and the recent library Sparkz for building production quality machine learning pipelines for predicting buyers of financial products.
The pipelines are implemented through custom declarative APIs that gives us greater control, transparency and testability of the whole process.
The example followed the validation and evaluation principles as defined in The Data Science Manifesto available in beta at http://www.datasciencemanifesto.org
At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics.
What we want to share is a mix of methodology and tools for:
- Investigating Interactively the data; and
- Writing quality code in a productive environment; and
- Embedding the developed functions into executable entry points; and
- Presenting the results in a clean and visual way; and
- Meeting the required acceptance criteria.
AKA: Delivering a Data Science MVP quickly in a complete Agile way!
The topics covered in this workshop are:
- DataFrame/RDD conversions and I/O
- Exploratory Data Analysis (EDA)
- Scalable Feature Engineering
- Modelling (MlLib and ML)
- End-to-end Evaluation
- Agile Methodology for Data Science
At the end of the workshop the lessons learnt are:
Spark, Dataframe, RDDs:
DataFrame is great for I/O, schema inference from the sources and when you have flatten schemas. Operations start to be more complicated with nested and array fields.
RDD gives you the flexibility of doing your ETL using the richness of the Scala framework, in the other hand you must be careful on optimizing your execution plans.
Functional Programming allowed us to express complex logic with a simple and clear code and free of side effects.
RDD gives you the flexibility of doing your ETL using the richness of the scala framework, in the other hand you must be careful on optimizing your execution plans.
Map joins with broadcast maps is very efficient but we need to make sure to reduce at minimum its size before to broadcast, e.g. applying some filtering to remove the unmatched keys before the join or capping the size of each value in case of size-variable structures (e.g. hash maps).
Developing in the notebook is very painful and non productive, the more you write code the more become impossible to track and refactor it.
- ML, MlLib
Do not underestimate simple solutions. In the worst case they serve as baseline for benchmarking.
Even tough the Logistic Regression was better on classifying as true or false, the simple model outperformed when running the end-to-end ranking evaluation.
Focus on solving problems rather than models or algorithms.
Many Data Science problems can be solved with counts and divisions, e.g. Naïve Bayes.
Logistic Regression “raw scores” are NOT probabilities, treat them carefully!
- Spark Notebook
SparkNotebook is good for EDA and as entry point for calling APIs and presenting results.
Developing in the notebook is non very productive, the more you write code the more become harder to track and refactor previously developed code.
Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste every time into a notebook dedicated cell.
In order to keep normal Notebook cells clean, they should not contain more than 4/5 lines of code or complex logic, they should ideally just code queries in the form of functional processing and entry points of a logic API.
Plotting in the notebook with the built in visualization is handy but very rudimental, can only visualize 25 points, we created a Pimp to take any Array[(Double,Double)] and interpolate its values to only 25 points.
Tip: when you visualize a Scala Map with Double keys in the range 0.0 to 1.0, the take(25) method will return already uniform samples in that range and since the x-axis is numerical, the built-in visualization will automatically sort it for you.
Probably we should have investigated advanced libraries like Bokeh or D3 that are already supported in the Notebook.
Check the source code on the GitHub page: https://github.com/gm-spacagna/wordpress-posts-recommender.