Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

I published on GitHub a tutorial on how to implement an algorithm for predictive maintenance using survival analysis theory and gated Recurrent Neural Networks in Keras.

The tutorial is divided into:

  1. Fitting survival distributions and regression survival models using lifelines.
  2. Predicting the distribution of future time-to-failure using raw time-series of covariates as input of a Recurrent Neural Network in keras.

The second part is an extension of the wtte-rnn framework developed by @ragulpr. The original work focused on time-to-event models for churn predictions while we will focus on the time-to-failure variant.

In a time-to-failure model the single sequence will always end with the failure event while in a time-to-event model each sequence will contain multiple target events and the goal is to estimating when the next event will happen. This small simplification allows us to train a RNN of arbitrary lengths to predict only a fixed event in time.

The tutorial is a also a re-adaptation of the work done by @daynebatten on predicting run to failure time of jet engines.

The approach can be used to predict failures of any component in many other application domains or, in general, to predict any time to an event that determines the end of the sequence of observations. Thus, any model predicting a single target time event.

You can find the rest of the tutorial at https://github.com/gm-spacagna/deep-ttf/.

Advertisements
Posted in Uncategorized | Leave a comment

Anomaly Detection using Deep Auto-Encoders

One of the determinants for a good anomaly detector is finding smart data representations that can easily evince deviations from the normal distribution. Traditional supervised approaches would require a strong assumption about what is normal and what not plus a non negligible effort in labeling the training dataset. Deep auto-encoders work very well in learning high-level abstractions and non-linear relationships of the data without requiring data labels. In this talk we will review a few popular techniques used in shallow machine learning and propose two semi-supervised approaches for novelty detection: one based on reconstruction error and another based on lower-dimensional feature compression.

Posted in Machine Learning | Tagged , , , , , | Leave a comment

Demystifying Data Science in the industry

On June the 7th I had a quick introductory talk at AssoLombarda in Milan regarding the role of Data Scientist into the 4th industrial revolution.

My presentation is an introduction to what Data Science in the industry is and what is not.

If you would like to know more about the Data Science Milan community visit www.datasciencemilan.org

Posted in Agile | Tagged , | Leave a comment

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio

Abstract:

Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS.

Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.

We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.

In this talk we will:

* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet

* Explain the iterative exploratory process and advantages of adopting functional programming

* Make a crucial analysis on the issues faced with the existing methodology

* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds

* Discuss some future improvements to the overall architecture

Original meetup event: http://www.meetup.com/Alluxio/events/233453125/

Posted in Agile, Big Data, Machine Learning, Open Source, Scala, Spark | Tagged , , , , , , , | Leave a comment

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour

From Data Science Milan meetup event:

In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team’s goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:

• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.

• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.

• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.

• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.

• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).

• How Scala (and functional programming) helped our cause.

Posted in Agile, Machine Learning, recommender systems, Scala, Spark | Tagged , , , , , , , , , , | Leave a comment

Surfing and Coding in Lanzarote, the Barclays Data Science hackathon

This post has been published on the Cloudera blog and summurises the results and takeaways of a week-long hackathon happened in Lanzarote in December 2015. The goal was to prototype a recommender systems for retail customers of shops in Bristol in Bristol, UK. The article shows how the stack composed by Scala and Spark was great for quickly writing some prototyping code to run locally in a single laptop and at the same time scalable for larger dataset to process in the cluster.

man with laptop on colorful beach of island

Please continue reading at http://blog.cloudera.com/blog/2016/05/the-barclays-data-science-hackathon-using-apache-spark-and-scala-for-rapid-prototyping/.

Posted in Agile, Machine Learning, Scala, Spark | Tagged , , , , , , | Leave a comment

Robust and declarative machine learning pipelines for predictive buying

Proof of concept of how to use Scala, Spark and the recent library Sparkz for building production quality machine learning pipelines for predicting buyers of financial products.

The pipelines are implemented through custom declarative APIs that gives us greater control, transparency and testability of the whole process.

The example followed the validation and evaluation principles as defined in The Data Science Manifesto available in beta at http://www.datasciencemanifesto.org

Posted in Big Data, Classification, Machine Learning, Scala, Spark | Tagged , , , , , , | Leave a comment

Lessons learnt from building data-driven production systems at Barclays

bart-models.gif

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful inside a large organization where, as such, being a productive data scientist is a tough challenge.

The data science team works on a mix of descriptive, predictive and prescriptive projects that make use of machine learning and big data technologies, mainly on top of Apache Spark. Even though we deliver per-request insights coming from manual analysis, we primarily build automated and scalable systems to be periodically used either internally for a better decision-making or customer-facing in the form of analytics services (e.g. via the web portal).
In this post series I want to share some of the best practices, tools, methodologies and workflows that we experimented and the lessons learnt from them. I will skip a few aspects of machine learning systems, since that I found those to be already well covered in other talks and articles, you can find the reference links at the end of this post.
Moreover not all of the data-driven projects require a machine learning component, at least not at every stage. I would like to quote Peter Norvig from a recent article published at KDnuggets:

“Machine Learning development is like the raisins in a raisin bread: 1. You need the bread first 2. It’s just a few tiny raisins but without it you would just have plain bread.”

Please keep in mind that each scenario is different thus there are not strict rules to advocate. Every data science team should come out with the workflow and stack that best suits their needs. Besides, they should be able to quickly adapt to the business and technical changes of their organization.

To conclude, I summarised the main take home knowledge of my experience in Barclays so far. I hope it will serve as an useful guideline or inspiration source for all of those data science teams focusing on building production systems. Many of those best practices still apply to research-oriented teams that focus more on the prototyping of solutions. Our team is a mix of engineering and modelling background, thus defining a little bit of structure and common workflows helped us being collaborative and productive.

The goal was not advocating a single methodology but showing possible other approaches that could fit well within your organization. We expect those practices to conflict amongst different teams. For example in the Xavier’s articles (see links below), he suggests to do all of the experiments using the notebook and use the same tools in production while in our experience we found this to be chaotic and non scalable for our use cases. There is no God law, try different approaches and stick with the most successful ones for your use cases.

***

A related blog post of “How to do Data Science that is both Exploratory and Production Quality” can be found here: https://www.linkedin.com/pulse/how-do-data-science-both-exploratory-production-quality-harry-powell.

Similar articles:

Seven Steps to Success Machine Learning in Practice https://daoudclarke.github.io/guide.pdf

http://technocalifornia.blogspot.co.uk/2014/12/ten-lessons-learned-from-building-real.html

And more recent additional 10 lessons:

https://medium.com/@xamat/10-more-lessons-learned-from-building-real-life-ml-systems-part-i-b309cafc7b5e#.58g9wrnt4

 

Posted in Agile, Big Data, Machine Learning | Tagged , , , , | 4 Comments

Thoughts about data operations

This is the part 4 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Infrastructure

Get the IT team very close to the data scientists. Ideally one member of the team should be a DevOps, or the new title DataOps. Read this article from InfoWorld: DevOps can take data science to the next level.
Finding the right balance between IT workarounds and clean solutions is difficult especially when involves long tedious processes. It is good practice to “sign” contracts with the IT team of what you are about to deliver and what requirements you need in order to do so.
As a general advice you want to operate in your familiar environment where you have available all of the tools you like and proper cluster resources. Unfortunately data is always fragmented into multiple systems. Try to get the data periodically ingested into your Data Lake (typically a Hadoop cluster). When this is not possible make sure you have the permissions to sqoop it yourself. Data virtualization technologies also come particularly handy to create view of a dataset into your Big Data environment.
Don’t implement solutions that are tight to the underlying infrastructure. Spark DataFrame API for example does an excellent job on abstracting away the I/O operations. See this blog post of how to logically map tables from reltational database into a Spark cluster: https://dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon.
Requesting admin rights on a dev cluster will massively affect productivity and will let the team mastering their unix skills. Trustiness and transparency are essential. Security should be enforced during the interviews process by hiring competent and smart people and personnel trainings instead of killing productivity with non-sense restrictions.

Release process

A data products at the end of the day is a software that takes data as input and produce data as output that contains insights consumable either via a visual dashboard or by integrating them inside an existing IT system.

A few options we recommend for releasing are:

  • Continuous delivery. Ideally one pull request per project per day.
  • Continuous integration. That would be ideal, a Jenkins box that runs your tests and automated scripts every single time a new ticket is merged into develop and especially every single time a new release is done in master. If the box can access to a data cluster then can even run the end-to-end evaluation and store the results for you.
  • Every end of sprint should be matched with a new release consisting of:
    • taking the develop branch and merge it back into master (either manually or through an automated script such as the gitflow command line)
    • publishing your package containing source code and scripts to a common repository like Nexus.
    • Reporting latest results in Confluence (see documentation section).
    • Releasing all of the merged tickets from Jira so that they don’t show up in the board but are still accessible for reference.
    • Demo-ing inside the team and/or to your stakeholders if changes are relevant .
    • Celebrating in a pub.

It would make sense to plan the release on the last day of the sprint afternoon (typically Friday) but sometime might be advantageous to release on Thursday so that you can have Friday for hot-fixes if something goes down.

Deployment

Operations1

Very hard to give guidelines here since that each project have its own deployment process that depends on many factors such as the business context and practical issues associated with it.

If your application is deployed end-to-end from external teams of which you don’t have control of the workflow and data sources they are using, you will find extremely helpful to have some Data Sanity checks performed at every single run. Those checks make sure that the people running your application don’t accidentally input data which is not conformed with the schema and/or model assumptions. Throwing an exception with some context information is fundamental to make your system production-ready.
A typical example is validating the values of categorical fields. We packed in our jars the reference files containing all of the possible values and their descriptions. If the specified dataset contains values that don’t find any match, the data sanity check will throw an exception.
These steps of handling incorrect data may be handled during the ETL process and is generally not needed if the training is done by the data science team itself. In this latter case the the deployment only regards the trained model.

Deployment is the stage with the highest number of blockers and technical issues. The final measure of success is by the way only determined upon deployment in production, thus deployment issues should be top priority.

Posted in Big Data, Uncategorized | Tagged , | 1 Comment

The balance of exploratory analysis and development

This is the part 3 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Exploratory Data Analysis / Research

eda

Exploratory analysis should precede and follow any task from the modelling, design and development to the benchmarking. Major problem is how do you share, track and monitor your findings? How do you make your analysis repeatable and scrutinizable from the outside? This is still an open problem.

Notebooks tend to be the best tools for the job, careful though. EDA is an open research/investigation task, thus you need a criteria to draw the line of when to stop. My suggestion is avoiding scope-free analysis but always accompany EDA with a well define goal/task. In a small and clearly defined task you know what you want to achieve, you just don’t know how to and what obstacles you may find.

A proposed sequential but iterative workflow is:

  1. The planned story defines the high level goal you are working towards to.
  2. Then start your EDA and as soon as you find something interesting you can stop investigating and define a development sub-ticket.
  3. You now start developing the minimum amount of code that implements the specified requirements defined during the analysis step.
    Those requirements should not change after the first definition, you want to complete that and then refactor it later into another sub-ticket or in a different iteration.
  4. Before to send it for review and/or solving the sub-ticket you should perform another EDA step to verify that the newly created branch meets the intended requirements. You are not solving the greater problem but you only care about the just-defined sub-problem. It is very dangerous to mix development and analysis at the same time since that you may end up into an infinite loop where you keep changing your requirements as you analyse and never get to an end.
  5. After completion of the subtask, you can switch back to the main workflow thread.

Suggestions are to time box any open-ended task. Say, you are going to spend no more than X hours/days on this research and before then you will come out with some development requirements or insights reporting that move the project towards the final story goal. Remember you will have to solve the story by the end of the sprint. Scope the problems small enough so that you reduce the risks of not meeting the expectations.

Get to an end-to-end as quick as possible and postpone any complication, ideas or new features to the next iterations. EDA/Research is generally a good place for filling your backlog for future scoping.

I leave it as an open question what to do with those notebooks after the investigation is completed. They are a bit tricky to maintain. When you produce a change to the codebase or a new dataset comes in, the notebooks become obsolete. We don’t want to refactor them every time to make sure they still work. I personally see notebooks more as a one-off analysis that are archived after being used.

I tend to translate all of my findings and assumptions in the form of project requirements so that they don’t get lost. In my opinion only the automated tasks should be maintained over time. Results from manual tasks that cannot be automated should be documented, stamped and archived in the wiki.

Evaluation

Unit tests make sure that the code does what is meant to do but that does not imply solving the right problem in an acceptable way. The evaluation strategy typically reflects the real-business scenario in which the model will be used. The choice of performance metrics must have a meaningful explanation within the business context. Metrics should be of easy interpretation from your stakeholders who generally are not data scientists and only speak the company business language.

Good tips is to create a Kaggle-like framework that:

  • defines the APIs reflecting your custom data types
  • use some abstract interface representing the particular implementation (could be split into multiple components, e.g. transformer, trainer, model)
  • knows how to robustly validate the given implementation (e.g. cross-fold, domain specific splitting avoiding data leakage, mix of timestamp and customerId partitioning…)
  • Produce one or a pool of interpretable performance metrics such as: mean average precision @ N, uplift, spam rate, loss rate, retention rate. Avoid abstract concepts like area under the curve or F-score.

Sooner you will find a blog post of our team regarding an offsite in Lanzarote where following the Kaggle-like structure we prototyped 6 different models for a recommender system in less than a week.

When building the evaluation framework, a few questions you want to ask are:

  • What a positive/negative sample represent in this business scenario?
  • Is recall important? Why do you care about accuracy?
  • What actions can be taken upon prediction?
  • In which form the model can be used? How the insights can be presented/visualized? Can it be integrated into an existing IT system?
  • What are the capabilities/practical issues of following the decisions suggested by the model?
  • What is the uplift of the data-driven solution compared to the traditional business as usual performance?
  • How can you test the trained model in the live environment (is A/B testing possible or the bad scenario would cause a lot of damages)?
  • Does the effectiveness of your solution only depends on your model or also from other parties? (e.g. predicting customers to contact for marketing purposes relies on the conversion rate of the marketing team as well)
  • How can you feed-back the results for updating the model? At which rate? Is the model easy to update or must be re-trained for every new collected data? Can you re-train it within the update interval?
  • Will the triggered actions influence the upcoming data (e.g. a recommender system can change the distribution of the future population)? Are there any amplification effect (if you recommend most popular items, those will become even more popular and so on…).

My experience suggests that the more the time spent in implementing a robust and exhaustive evaluation framework the easier and reliable will be maintaining and improving the system later. Time spent here is a good investment and requires a lot of thinking from all of the 3 data science aspects: business, statistics and engineering.

Demo

It is a good practice to demo advances and new results to the team and/or stakeholder at the end of the sprint. Feeling the continuous pace of delivering and improvement is an excellent psychological element and increase trustiness and confidence.

Moreover is the place where scrutiny comes in and you can have your methodology and interpretations challenged. Any deliverable or document presented during the demo should be stored in the wiki with a date associated to it.

Posted in Agile, Machine Learning, Uncategorized | Tagged , | 2 Comments