WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

—At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
—The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics.
Details of the competition available at —https://www.kaggle.com/c/predict-wordpress-likes.

What we want to share is a mix of methodology and tools for:

  • —Investigating Interactively the data; and
  • —Writing quality code in a productive environment; and
  • —Embedding the developed functions into executable entry points; and
  • —Presenting the results in a clean and visual way; and
  • —Meeting the required acceptance criteria.

——AKA: Delivering a Data Science MVP quickly in a complete Agile way!

The topics covered in this workshop are:

  • —DataFrame/RDD conversions and I/O
  • —Exploratory Data Analysis (EDA)
  • —Scalable Feature Engineering
  • —Modelling (MlLib and ML)
  • —End-to-end Evaluation
  • —Agile Methodology for Data Science

At the end of the workshop the lessons learnt are:

  • Spark, Dataframe, RDDs:—
    • DataFrame is great for I/O, schema inference from the sources and when you have flatten schemas. Operations start to be more complicated with nested and array fields.
    • —RDD gives you the flexibility of doing your ETL using the richness of the Scala framework, in the other hand you must be careful on optimizing your execution plans.
      Functional Programming allowed us to express complex logic with a simple and clear code and free of side effects.

      RDD gives you the flexibility of doing your ETL using the richness of the scala framework, in the other hand you must be careful on optimizing your execution plans.

    • —Map joins with broadcast maps is very efficient but we need to make sure to reduce at minimum its size before to broadcast, e.g. applying some filtering to remove the unmatched keys before the join or capping the size of each value in case of size-variable structures (e.g. hash maps).

      Developing in the notebook is very painful and non productive, the more you write code the more become impossible to track and refactor it.

  • ML, MlLib
    • —ETL and feature engineering is the most time-consuming part, once you obtained the data you want in vector format then you can convert back to DataFrame and use the ML APIs.
    • —ML unfortunately does not wrap everything available in MlLib, sometime you have to convert back to RDD[LabeledPoint] or RDD[(Double, Vector)] in order to use the MlLib features (e.g. evaluation metrics).

      Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste

    • —ML pipeline API (Transformer, Estimator, Evaluator) seems cool but for an MVP is a pre-mature abstraction.
  • Modeling
    • —Do not underestimate simple solutions. In the worst case they serve as baseline for benchmarking.
    • —Even tough the Logistic Regression was better on classifying as true or false, the simple model outperformed when running the end-to-end ranking evaluation.
    • —Focus on solving problems rather than models or algorithms.
      Many Data Science problems can be solved with counts and divisions, e.g. Naïve Bayes.
    • —Logistic Regression “raw scores” are NOT probabilities, treat them carefully!
  • Spark Notebook
    • —SparkNotebook is good for EDA and as entry point for calling APIs and presenting results.
    • —Developing in the notebook is non very productive, the more you write code the more become harder to track and refactor previously developed code.
    • —Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste every time into a notebook dedicated cell.
    • —In order to keep normal Notebook cells clean, they should not contain more than 4/5 lines of code or complex logic, they should ideally just code queries in the form of functional processing and entry points of a logic API.
  • Visualization
    • —Plotting in the notebook with the built in visualization is handy but very rudimental, can only visualize 25 points, we created a Pimp to take any Array[(Double,Double)] and interpolate its values to only 25 points.
    • —Tip: when you visualize a Scala Map with Double keys in the range 0.0 to 1.0, the take(25) method will return already uniform samples in that range and since the x-axis is numerical, the built-in visualization will automatically sort it for you.
    • —Probably we should have investigated advanced libraries like Bokeh or D3 that are already supported in the Notebook.

Check the source code on the GitHub page: https://github.com/gm-spacagna/wordpress-posts-recommender.

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Agile, Classification, Machine Learning, Scala, Spark and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s