Lessons learnt from building data-driven production systems at Barclays

bart-models.gif

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful inside a large organization where, as such, being a productive data scientist is a tough challenge.

The data science team works on a mix of descriptive, predictive and prescriptive projects that make use of machine learning and big data technologies, mainly on top of Apache Spark. Even though we deliver per-request insights coming from manual analysis, we primarily build automated and scalable systems to be periodically used either internally for a better decision-making or customer-facing in the form of analytics services (e.g. via the web portal).
In this post series I want to share some of the best practices, tools, methodologies and workflows that we experimented and the lessons learnt from them. I will skip a few aspects of machine learning systems, since that I found those to be already well covered in other talks and articles, you can find the reference links at the end of this post.
Moreover not all of the data-driven projects require a machine learning component, at least not at every stage. I would like to quote Peter Norvig from a recent article published at KDnuggets:

“Machine Learning development is like the raisins in a raisin bread: 1. You need the bread first 2. It’s just a few tiny raisins but without it you would just have plain bread.”

Please keep in mind that each scenario is different thus there are not strict rules to advocate. Every data science team should come out with the workflow and stack that best suits their needs. Besides, they should be able to quickly adapt to the business and technical changes of their organization.

To conclude, I summarised the main take home knowledge of my experience in Barclays so far. I hope it will serve as an useful guideline or inspiration source for all of those data science teams focusing on building production systems. Many of those best practices still apply to research-oriented teams that focus more on the prototyping of solutions. Our team is a mix of engineering and modelling background, thus defining a little bit of structure and common workflows helped us being collaborative and productive.

The goal was not advocating a single methodology but showing possible other approaches that could fit well within your organization. We expect those practices to conflict amongst different teams. For example in the Xavier’s articles (see links below), he suggests to do all of the experiments using the notebook and use the same tools in production while in our experience we found this to be chaotic and non scalable for our use cases. There is no God law, try different approaches and stick with the most successful ones for your use cases.

***

A related blog post of “How to do Data Science that is both Exploratory and Production Quality” can be found here: https://www.linkedin.com/pulse/how-do-data-science-both-exploratory-production-quality-harry-powell.

Similar articles:

Seven Steps to Success Machine Learning in Practice https://daoudclarke.github.io/guide.pdf

http://technocalifornia.blogspot.co.uk/2014/12/ten-lessons-learned-from-building-real.html

And more recent additional 10 lessons:

https://medium.com/@xamat/10-more-lessons-learned-from-building-real-life-ml-systems-part-i-b309cafc7b5e#.58g9wrnt4

 

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Agile, Big Data, Machine Learning and tagged , , , , . Bookmark the permalink.

4 Responses to Lessons learnt from building data-driven production systems at Barclays

  1. Pingback: Coding practices for data products development | Vademecum of Practical Data Science

  2. Pingback: The ScrumBan Jira board | Vademecum of Practical Data Science

  3. Pingback: The balance of exploratory analysis and development | Vademecum of Practical Data Science

  4. Pingback: Thoughts about data operations | Vademecum of Practical Data Science

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s