It is a very common pattern in software development to start a new project in a highly uncertain and chaotic scenario surrounded by plenty of ideas of what features we might want to implement. In Data Science the problem is even more amplified by its nondeterministic nature. In the start-up of a Data Science project we not just don’t know what we are trying to implement, we also don’t know how to implement it and also under which circumstances that would be possible and correct.
This initial lack of structure often is manifested by an initial spike of unnecessary development and later in the project in the form of technical debts and unexplained inconsistencies. You might spend a lot of resources before to find out that the delivered solution simply does not fit the business nature of the problem.
In Agile Data Science the goal should not be producing charts and reports or hacky scripts calling some machine learning library. In Agile Data Science we want to iteratively build production-quality applications that solve the true business needs by extracting hidden knowledge from the data.
In this series of posts I would like to share some best practices and procedures for starting a novel Data Science project from scratch in a safe and efficient way. The preparation steps consist on a list of tasks to ideally perform before to write the first line of code for your model.
The main steps to cover are (in order):
- * The Evaluation Strategy <=
- The Initial Investigation
- The Simple Solution
- The ETL
- The Hypothesis-Driven Analysis (HDA)
- The Final Checklist
Even though it might look like the old-school Waterfall planning, I don’t aim to propose a strict checklist to perform in the declared order. I would rather like to share what I have learned in my experience as a Data Scientist and what I have seen doing from other colleagues. You might argue that having a list of 20 to-do predefined tasks is not Agile, and I would generally agree with that. Nevertheless, I do personally think that all of them shall be addressed (regardless of the order) as a preparation for the Agile development cycle specifically for a Data Science application.
The term “Iteration 0″ oughts to refer to that preparation stage and does not necessary mean all of the steps should happen in a single iteration. By following an Agile approach, we will probably split them into a few iterations and very likely incrementally revise/refactor some of them.
The Business Problem
1. Rigorous definition of the business problem we are attempting to solve and why it is important
A clear statement should answer, at least but not only:
- What’s his basic need
- What are the expectations
- How the business will benefit from it
- What is the current state of the product we trying to improve
- What are the requirements/constraints
- What will be the form of the final deliverable
- How that deliverable will be able to integrate with the business and drive human actions
The QA Strategy
2. Define your objective acceptance criteria
The Acceptance Test should always be one of the vital part of any project. Without an acceptance test you cannot prove if your solution is correct, you cannot compare different solutions, thus you cannot iterate fast and safe. Acceptance tests are your only way of preventing mistakes and avoiding delivering solutions that actually do not meet the true business requirements. And this is by no mean different in Data Science.
If you can’t define it, you can’t measure it. If it can’t be measured, it shouldn’t be reported. Define, then measure, then report.
Acceptance testing is about considering any possible implementation as black-box, running an unbiased validation test that sets some minimum expectations on the model according to some measures that quantify the real business value.
Acceptance tests not necessary must be evaluating the accuracy of a Machine Learning model. It should include the whole end-to-end workflow from the data cleansing to the final deliverable.
Let’s suppose you are implementing a model for resolving and linking entities from one dataset to a reference one. Suppose that internally you will be implementing a predictive model for inferring a missing feature from the N dimensions that represent your entities in your domain space. You may want to evaluate the accuracy of your predictions using cross-folding validation tests, confusion matrixes, ROC curve and so on. But this evaluation would only refer to your particular implementation and only testing a subcomponent of the final solution. Instead, you want to ask yourself: “why does the business care about doing this entity resolution and linkage?”. Maybe the answer is “we need to normalise our dataset to be able to deduplicate and increase the number of patterns detected from our current product”. Thus, you want to evaluate the following measures instead:
- Growth of patterns recognised before and after the normalisation
- Percentage of unlinked records
- Confidence intervals of the linked records
- Stability of your model by adding/removing samples
Obviously having an high AUC of your predictive model used internally will probably improve the overall quality but that is not the final goal you want to achieve and evaluate at this stage.
A Machine Learning specialist would take a single metric or objective function and trying to maximise his performance. A Data Scientist would take the existing product or business case and try to improve it as a whole.
There is no single measure that can evaluate the quality of a given solution but only a pool of different and possible independent measures. AUC means nothing, what is that you want to achieve? Are false positives and false negatives the same? Are you setting no minimum requirements in both the two dimensions? As a Data Scientist you must understand what are the expectations you must meet before that a model can safely go into production without disappointing your users/customers.
3. Develop the validation framework (ergo, the acceptance test)
A few rules here:
- An Acceptance Test should have always at least a binary outcome: Passed or Failed.
- Don’t let you and anyone else cheating! Build a robust and unbiased test that cannot be worked-around regardless of who will be the person implementing the solution. You must consider how easily you could wish to start hacking in conditions of pressure and under rush for delivering a finite semi-working product.
- The validation framework should be able to compare measures related to different implementations. Accuracy of 75% does not tell me so much. Accuracy of 75% given that the current model only gives 58%, quantifies the added value.
Now you have a clear statement of what problem you are trying to solve and how you can objectively measure the quality of any possible solution regardless of what the final implementation will be. Your stakeholders won’t need of understanding the technical details or how your model work because all they will have to understand are the evaluation results.
Please stay tuned for the next post of the “Agile Data Science Iteration 0″ series regarding “The initial investigation“…