This is the part 3 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.
Exploratory Data Analysis / Research
Exploratory analysis should precede and follow any task from the modelling, design and development to the benchmarking. Major problem is how do you share, track and monitor your findings? How do you make your analysis repeatable and scrutinizable from the outside? This is still an open problem.
Notebooks tend to be the best tools for the job, careful though. EDA is an open research/investigation task, thus you need a criteria to draw the line of when to stop. My suggestion is avoiding scope-free analysis but always accompany EDA with a well define goal/task. In a small and clearly defined task you know what you want to achieve, you just don’t know how to and what obstacles you may find.
A proposed sequential but iterative workflow is:
- The planned story defines the high level goal you are working towards to.
- Then start your EDA and as soon as you find something interesting you can stop investigating and define a development sub-ticket.
- You now start developing the minimum amount of code that implements the specified requirements defined during the analysis step.
Those requirements should not change after the first definition, you want to complete that and then refactor it later into another sub-ticket or in a different iteration.
- Before to send it for review and/or solving the sub-ticket you should perform another EDA step to verify that the newly created branch meets the intended requirements. You are not solving the greater problem but you only care about the just-defined sub-problem. It is very dangerous to mix development and analysis at the same time since that you may end up into an infinite loop where you keep changing your requirements as you analyse and never get to an end.
- After completion of the subtask, you can switch back to the main workflow thread.
Suggestions are to time box any open-ended task. Say, you are going to spend no more than X hours/days on this research and before then you will come out with some development requirements or insights reporting that move the project towards the final story goal. Remember you will have to solve the story by the end of the sprint. Scope the problems small enough so that you reduce the risks of not meeting the expectations.
Get to an end-to-end as quick as possible and postpone any complication, ideas or new features to the next iterations. EDA/Research is generally a good place for filling your backlog for future scoping.
I leave it as an open question what to do with those notebooks after the investigation is completed. They are a bit tricky to maintain. When you produce a change to the codebase or a new dataset comes in, the notebooks become obsolete. We don’t want to refactor them every time to make sure they still work. I personally see notebooks more as a one-off analysis that are archived after being used.
I tend to translate all of my findings and assumptions in the form of project requirements so that they don’t get lost. In my opinion only the automated tasks should be maintained over time. Results from manual tasks that cannot be automated should be documented, stamped and archived in the wiki.
Unit tests make sure that the code does what is meant to do but that does not imply solving the right problem in an acceptable way. The evaluation strategy typically reflects the real-business scenario in which the model will be used. The choice of performance metrics must have a meaningful explanation within the business context. Metrics should be of easy interpretation from your stakeholders who generally are not data scientists and only speak the company business language.
Good tips is to create a Kaggle-like framework that:
- defines the APIs reflecting your custom data types
- use some abstract interface representing the particular implementation (could be split into multiple components, e.g. transformer, trainer, model)
- knows how to robustly validate the given implementation (e.g. cross-fold, domain specific splitting avoiding data leakage, mix of timestamp and customerId partitioning…)
- Produce one or a pool of interpretable performance metrics such as: mean average precision @ N, uplift, spam rate, loss rate, retention rate. Avoid abstract concepts like area under the curve or F-score.
Sooner you will find a blog post of our team regarding an offsite in Lanzarote where following the Kaggle-like structure we prototyped 6 different models for a recommender system in less than a week.
When building the evaluation framework, a few questions you want to ask are:
- What a positive/negative sample represent in this business scenario?
- Is recall important? Why do you care about accuracy?
- What actions can be taken upon prediction?
- In which form the model can be used? How the insights can be presented/visualized? Can it be integrated into an existing IT system?
- What are the capabilities/practical issues of following the decisions suggested by the model?
- What is the uplift of the data-driven solution compared to the traditional business as usual performance?
- How can you test the trained model in the live environment (is A/B testing possible or the bad scenario would cause a lot of damages)?
- Does the effectiveness of your solution only depends on your model or also from other parties? (e.g. predicting customers to contact for marketing purposes relies on the conversion rate of the marketing team as well)
- How can you feed-back the results for updating the model? At which rate? Is the model easy to update or must be re-trained for every new collected data? Can you re-train it within the update interval?
- Will the triggered actions influence the upcoming data (e.g. a recommender system can change the distribution of the future population)? Are there any amplification effect (if you recommend most popular items, those will become even more popular and so on…).
My experience suggests that the more the time spent in implementing a robust and exhaustive evaluation framework the easier and reliable will be maintaining and improving the system later. Time spent here is a good investment and requires a lot of thinking from all of the 3 data science aspects: business, statistics and engineering.
It is a good practice to demo advances and new results to the team and/or stakeholder at the end of the sprint. Feeling the continuous pace of delivering and improvement is an excellent psychological element and increase trustiness and confidence.
Moreover is the place where scrutiny comes in and you can have your methodology and interpretations challenged. Any deliverable or document presented during the demo should be stored in the wiki with a date associated to it.