Category Archives: Scala

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio

Abstract: Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS. Moreover, there are a … Continue reading

Posted in Agile, Big Data, Machine Learning, Open Source, Scala, Spark | Tagged , , , , , , , | Leave a comment

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour

In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team’s goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework. Continue reading

Posted in Agile, Machine Learning, recommender systems, Scala, Spark | Tagged , , , , , , , , , , | Leave a comment

Surfing and Coding in Lanzarote, the Barclays Data Science hackathon

This post has been published on the Cloudera blog and summurises the results and takeaways of a week-long hackathon happened in Lanzarote in December 2015. The goal was to prototype a recommender systems for retail customers of shops in Bristol in Bristol, UK. The article shows how the stack composed by Scala and Spark was great for quickly writing some prototyping code to run locally in a single laptop and at the same time scalable for larger dataset to process in the cluster. Continue reading

Posted in Agile, Machine Learning, Scala, Spark | Tagged , , , , , , | Leave a comment

Robust and declarative machine learning pipelines for predictive buying

Proof of concept of how to use Scala, Spark and the recent library Sparkz for building production quality machine learning pipelines for predicting buyers of financial products.

The pipelines are implemented through custom declarative APIs that gives us greater control, transparency and testability of the whole process.

The example followed the validation and evaluation principles as defined in The Data Science Manifesto available in beta at http://www.datasciencemanifesto.org Continue reading

Posted in Big Data, Classification, Machine Learning, Scala, Spark | Tagged , , , , , , | Leave a comment

Functional Data Validation using monads and applicative functors

ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data is one of the major factor affecting the final results. In facts, real world data is always messy and inconsistent. Data Validation … Continue reading

Posted in Big Data, Data Munging, Scala, Spark | Tagged , , , , , , | Leave a comment

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are: Data stored in non scalable infrastructure for analysis and processing Data governance and security policies 1. Data often resides into central data warehouse and RDBMS of which many legacy applications … Continue reading

Posted in Agile, Big Data, Scala, Spark | Tagged , , , , , , , | Leave a comment

6 points to compare Python and Scala for Data Science using Apache Spark

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach. I personally believe that Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.

Despite of the different opinions about Spark, let’s assume that a data science team wants to start adopting it as main technology. The choice of programming language is often a dilemma. Shall we build our models in Python or in Scala? Shall we run the exploratory analysis using the iPython notebook or iScala? Continue reading

Posted in Agile, Machine Learning, Python, Scala, Spark | Tagged , , | 13 Comments

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

—At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
—The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics. Continue reading

Posted in Agile, Classification, Machine Learning, Scala, Spark | Tagged , , , , | Leave a comment

What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis

via What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis.

Posted in Big Data, Scala | Tagged , | Leave a comment