Tag Archives: agile

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio

Abstract: Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS. Moreover, there are a … Continue reading

Posted in Agile, Big Data, Machine Learning, Open Source, Scala, Spark | Tagged , , , , , , , | Leave a comment

Lessons learnt from building data-driven production systems at Barclays

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful inside a large organization where, as such, being a productive data scientist is a tough challenge.

Our team works on a mix of descriptive, predictive and prescriptive projects that make use of machine learning and big data technologies, mainly on top of Apache Spark. Even though we deliver per-request insights coming from manual analysis, we primarily build automated and scalable systems to be periodically used either internally for a better decision-making or customer-facing in the form of analytics services (e.g. via the web portal).
In this post series I want to share some of the best practices, tools, methodologies and workflows that we experimented and the lessons learnt from them. Continue reading

Posted in Agile, Big Data, Machine Learning | Tagged , , , , | 4 Comments

The ScrumBan Jira board

Let’s start with one of the core tool of the agile workflow. We use a Jira board for tracking and organizing all of our projects. We developed a custom board which uses the sprints concept of Scrum but in a more flexible way as in Kanban. Continue reading

Posted in Agile, Big Data | Tagged , , , , , | 4 Comments

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are: Data stored in non scalable infrastructure for analysis and processing Data governance and security policies 1. Data often resides into central data warehouse and RDBMS of which many legacy applications … Continue reading

Posted in Agile, Big Data, Scala, Spark | Tagged , , , , , , , | Leave a comment

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

—At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
—The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics. Continue reading

Posted in Agile, Classification, Machine Learning, Scala, Spark | Tagged , , , , | Leave a comment

The complete 18 steps to start a new Agile Data Science project

At the end of the Iteration 0 you have a very solid starting point for your project and you can now follow the typical Agile development cycle, whether you prefer more SCRUM, Kanban, a mix of them or your ad-hoc custom methodology. Continue reading

Posted in Agile | Tagged , | 5 Comments

Agile Data Science Iteration 0: The Hypothesis-Driven Analysis

Agile Data Science even though is production and engineering oriented is not just software engineering. Agile Data Science is Science, thus it must comply with the scientific methodology. Continue reading

Posted in Agile | Tagged , , | 6 Comments

Agile Data Science Iteration 0: The Evaluation Strategy

In this series of posts I would like to share some best practices and procedures for starting a novel Data Science project from scratch in a safe and efficient way. The preparation steps consist on a list of tasks to ideally perform before to write the first line of code for your model. Continue reading

Posted in Agile | Tagged | 6 Comments