The balance of exploratory analysis and development

This is the part 3 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Exploratory Data Analysis / Research

eda

Exploratory analysis should precede and follow any task from the modelling, design and development to the benchmarking. Major problem is how do you share, track and monitor your findings? How do you make your analysis repeatable and scrutinizable from the outside? This is still an open problem.

Notebooks tend to be the best tools for the job, careful though. EDA is an open research/investigation task, thus you need a criteria to draw the line of when to stop. My suggestion is avoiding scope-free analysis but always accompany EDA with a well define goal/task. In a small and clearly defined task you know what you want to achieve, you just don’t know how to and what obstacles you may find.

A proposed sequential but iterative workflow is:

  1. The planned story defines the high level goal you are working towards to.
  2. Then start your EDA and as soon as you find something interesting you can stop investigating and define a development sub-ticket.
  3. You now start developing the minimum amount of code that implements the specified requirements defined during the analysis step.
    Those requirements should not change after the first definition, you want to complete that and then refactor it later into another sub-ticket or in a different iteration.
  4. Before to send it for review and/or solving the sub-ticket you should perform another EDA step to verify that the newly created branch meets the intended requirements. You are not solving the greater problem but you only care about the just-defined sub-problem. It is very dangerous to mix development and analysis at the same time since that you may end up into an infinite loop where you keep changing your requirements as you analyse and never get to an end.
  5. After completion of the subtask, you can switch back to the main workflow thread.

Suggestions are to time box any open-ended task. Say, you are going to spend no more than X hours/days on this research and before then you will come out with some development requirements or insights reporting that move the project towards the final story goal. Remember you will have to solve the story by the end of the sprint. Scope the problems small enough so that you reduce the risks of not meeting the expectations.

Get to an end-to-end as quick as possible and postpone any complication, ideas or new features to the next iterations. EDA/Research is generally a good place for filling your backlog for future scoping.

I leave it as an open question what to do with those notebooks after the investigation is completed. They are a bit tricky to maintain. When you produce a change to the codebase or a new dataset comes in, the notebooks become obsolete. We don’t want to refactor them every time to make sure they still work. I personally see notebooks more as a one-off analysis that are archived after being used.

I tend to translate all of my findings and assumptions in the form of project requirements so that they don’t get lost. In my opinion only the automated tasks should be maintained over time. Results from manual tasks that cannot be automated should be documented, stamped and archived in the wiki.

Evaluation

Unit tests make sure that the code does what is meant to do but that does not imply solving the right problem in an acceptable way. The evaluation strategy typically reflects the real-business scenario in which the model will be used. The choice of performance metrics must have a meaningful explanation within the business context. Metrics should be of easy interpretation from your stakeholders who generally are not data scientists and only speak the company business language.

Good tips is to create a Kaggle-like framework that:

  • defines the APIs reflecting your custom data types
  • use some abstract interface representing the particular implementation (could be split into multiple components, e.g. transformer, trainer, model)
  • knows how to robustly validate the given implementation (e.g. cross-fold, domain specific splitting avoiding data leakage, mix of timestamp and customerId partitioning…)
  • Produce one or a pool of interpretable performance metrics such as: mean average precision @ N, uplift, spam rate, loss rate, retention rate. Avoid abstract concepts like area under the curve or F-score.

Sooner you will find a blog post of our team regarding an offsite in Lanzarote where following the Kaggle-like structure we prototyped 6 different models for a recommender system in less than a week.

When building the evaluation framework, a few questions you want to ask are:

  • What a positive/negative sample represent in this business scenario?
  • Is recall important? Why do you care about accuracy?
  • What actions can be taken upon prediction?
  • In which form the model can be used? How the insights can be presented/visualized? Can it be integrated into an existing IT system?
  • What are the capabilities/practical issues of following the decisions suggested by the model?
  • What is the uplift of the data-driven solution compared to the traditional business as usual performance?
  • How can you test the trained model in the live environment (is A/B testing possible or the bad scenario would cause a lot of damages)?
  • Does the effectiveness of your solution only depends on your model or also from other parties? (e.g. predicting customers to contact for marketing purposes relies on the conversion rate of the marketing team as well)
  • How can you feed-back the results for updating the model? At which rate? Is the model easy to update or must be re-trained for every new collected data? Can you re-train it within the update interval?
  • Will the triggered actions influence the upcoming data (e.g. a recommender system can change the distribution of the future population)? Are there any amplification effect (if you recommend most popular items, those will become even more popular and so on…).

My experience suggests that the more the time spent in implementing a robust and exhaustive evaluation framework the easier and reliable will be maintaining and improving the system later. Time spent here is a good investment and requires a lot of thinking from all of the 3 data science aspects: business, statistics and engineering.

Demo

It is a good practice to demo advances and new results to the team and/or stakeholder at the end of the sprint. Feeling the continuous pace of delivering and improvement is an excellent psychological element and increase trustiness and confidence.

Moreover is the place where scrutiny comes in and you can have your methodology and interpretations challenged. Any deliverable or document presented during the demo should be stored in the wiki with a date associated to it.

Agile Data Science Iteration 0: The Initial Investigation

This is the second post of the Agile Data Science Iteration 0 series:

Previously

What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)

At this stage you should have a clear statement of what problem you are trying to solve and how you can objectively measure the quality of any possible solution regardless of what the final implementation will be. It’s time to start doing some initial research.

The Initial Investigation

4. Stop thinking, start googling!

You will be surprised of how many people have tried to solve similar problems before. Don’t attempt to think about your solution before you actually have a clear view of the state of the arts. It includes: published papers, libraries, blog posts, presentations, tools… You recall a friend working on something similar in his previous company one year ago, ring him!

5. Gather initial dataset into proper infrastructure

Your company dumps data in the corporate Data Warehouse and you have an ODBC driver or an ad-hoc client for running SQL queries before the tedious procedure of getting the right data into the “Analytics Cluster” is completed. Maybe not the best approach!

Never let bad technology slowing down your development. You are a Data Scientist, you want to be Agile, you are familiar with your toolset, you have not time to spend on legacy infrastructures.

Have you got your analytical cluster but there is no data in it or just a sample of “the year 2004 of the 10% of the male population in the age 50-55 that watched Cricket on Friday night“? Go get the dataset yourself!

As an Agile Data Scientist you are also expected to find your way around IT blockers and a minimum of engineering and devops skills.

Wrong technology decisions made early at this stage can cause expensive debts later on. Be wise!

6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset

You might find two antipodes of Data Scientists:

type a) who will spend months on inquiring data and discovering patterns/insights and always finding better ways and showing how improving what the model may look like.
type b) who knows already what they want to do and don’t need to spend time to investigate whether that may work or not.

None of them got it right, you should always be curious, explore, visualise details but never forgetting you are time-bounded and the “best solution ever but delivered in 1 year time” does not bring the same value of an imprecise but quick MVP delivered in a month.

Your initial investigation goal is gaining a quick understanding of the underlying dataset and whether or not a solution proposal may be applicable before to jump straight into an implementation of what you may sooner or later find out to be non suitable.

To conclude, you know should have a high-level overview of what has already been done to approach your problem. You got a big-enough sample of your data into a proper analytical toolkit. You gained an initial understanding of what the data look like, what distributions you can observe, what easy-to-spot correlations you found. You probably have not spent enough time to dive deep into it but you collected enough knowledge to start thinking of your first simple solution. You are now ready to implement your first MVP.

Remember we are working in an agile environment, the quicker we iterate the quicker we will be able to go back and improve. Focus on quickly gathering the minimum amount of information needed and leave the deep investigation for later stages.

***

Details of how to build your first simple data solution will follow on the next post of the “Agile Data Science Iteration 0” series, stay tuned.
Meanwhile, why not sharing or commenting below?

The Evaluation Strategy << prev | next >> The Simple Solution