It is a very common pattern in software development to start a new project in a highly uncertain and chaotic scenario surrounded by plenty of ideas of what features we might want to implement. In Data Science the problem is even more amplified by its nondeterministic nature. In the start-up of a Data Science project we not just don’t know what we are trying to implement, we also don’t know how to implement it and also under which circumstances that would be possible and correct.
This initial lack of structure often is manifested by an initial spike of unnecessary development and later in the project in the form of technical debts and unexplained inconsistencies. You might spend a lot of resources before to find out that the delivered solution simply does not fit the business nature of the problem.
In Agile Data Science the goal should not be producing charts and reports or hacky scripts calling some machine learning library. In Agile Data Science we want to iteratively build production-quality applications that solve the true business needs by extracting hidden knowledge from the data.
This is the final summarising post of the Agile Data Science Iteration 0 series:
- The problem definition and the evaluation strategy
- The initial investigation
- The simple solution
- The ETL
- The Hypothesis-Driven Analysis (HDA)
- * The complete checklist <=
The Complete Checklist
- Rigorous definition of the business problem we are attempting to solve and why it is important
- Define your objective acceptance criteria
- Develop the validation framework (ergo, the acceptance test)
- Stop thinking, start googling!
- Gather initial dataset into proper infrastructure
- Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
- Define and quickly develop the simplest solution to the problem
- Release/demo first basic solution
- Research of background for ways to improve the basic solution
- Gather additional data into proper infrastructure (if required)
- Ad-hoc Exploratory Data Analysis (EDA)
- Propose better solution minimising potential risks and marginal gain
- Develop the Data Sanity check
- Define the Data Types of your application domain
- Develop the ETL and output the normalised data into a proper infrastructure
- Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified
- Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data
- Analyse the output of the automated HDA to adjust/revise the proposed solution
At the end of the Iteration 0 you have a very solid starting point for your project and you could now follow the typical Agile development cycle, whether you prefer more SCRUM, Kanban, a mix of them or your ad-hoc custom methodology.
Regardless of if you want to use a strict or flexible workflow, keep in mind that the main difference with the Agile iterations for software development consist in the fact that a ticket is typically broad and open-ended. You should not be surprised if the majority of your tickets get then split into multiple sub-tickets after the initial investigation of the problem. You should allow to create subtasks even after the sprint planning. In some cases you may prefer to mark them as blockers and re-scope them into the next sprint or in other cases you want to allow them to affect the current sprint.
What is important is that you should start implementing production-quality code only when the requirements and the acceptance test are well defined. In Data Science this not very likely to happen all the time. Every time you are presented with an open problem to investigate and solve you should try to break it into research/analysis and development subtasks.
What not to do?
- Do not start any development without have done a prior detailed research/investigation
- Do not just deliver analysis code in notebooks, after your investigation move the code into production-quality standards
- Do not blindly trust external libraries or APIs if you don’t know exactly what they do and return, run some tests if needed
- Do not generate manual reports of your finding until the experiments are reproducible and automated
- Do not deploy any model if all of the assumptions haven’t been stated and verified
- Do not be lazy to learn better technologies and methodologies!
To conclude, in this series of posts I just wanted to share some of my experience on starting new Data Science projects and common problems that I have seen to be addressed in a confusional and chaotic way. I hope that by following those guidelines you can reduce the technical debts of the project and the risk of working several months without never delivering a correct and working solution.
More details of the Agile cycle for Data Science applications and in particular how to time-box open-ended questions will be covered into another post. Stay tuned and get ready to run!
The Hypothesis-Driven Analysis << prev