This is the fifth post of the Agile Data Science Iteration 0 series:
- The Evaluation Strategy
- The Initial Investigation
- The Simple Solution
- The ETL
- * The Hypothesis-Driven Analysis (HDA) <=
- The Final Checklist
What we have achieved so far (see previous posts above):
- Rigorous definition of the business problem we are attempting to solve and why it is important
- Define your objective acceptance criteria
- Develop the validation framework (ergo, the acceptance test)
- Stop thinking, start googling!
- Gather initial dataset into proper infrastructure
- Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
- Define and quickly develop the simplest solution to the problem
- Release/demo first basic solution
- Research of background for ways to improve the basic solution
- Gather additional data into proper infrastructure (if required)
- Ad-hoc Exploratory Data Analysis (EDA)
- Propose better solution minimising potential risks and marginal gain
- Develop the Data Sanity check
- Define the Data Types of your application domain
- Develop the ETL and output the normalised data into a proper infrastructure
At this stage you have already modelled some entities of your application logic. You know well the raw data and already have produced a normalised and cleaned version of your dataset. Your data is now sanitised and stored into a proper analytical infrastructure. Ask yourself: what assumptions have I made so far and I am going to make? Agile Data Science even though is production and engineering oriented is not just software engineering. Agile Data Science is Science, thus it must comply with the scientific methodology.
The Oxford dictionary defines “scientific method” as:
“a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses.”
And this is by no mean different in the Data Science methodology.
The Hypothesis-Driven Analysis
16. Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified
In Data Science we implement models and applications in a highly non deterministic context where often we make assumptions to simplify the problem. Assumptions are generally made based on intuitions, common-sense, previous experience, domain knowledge or sometime simply because the model require them.
Even though they might seem appropriate, they are dangerous! Unverified assumptions can easily lead to inconsistencies or, even worse, silently produce wrong results.
We can’t get rid of all of our assumptions and build an assumption-free model but we should try to document them, verify as soon as possible and track them over time. It is fine to have not-yet-fully-verified assumptions at this early stage, but they should not be forgotten and their verification should be planned in the immediate following iterations.
Every time we present any result we should clearly state what are all of the assumptions that have been made and if they have been verified or not.
17. Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data
What if the underlying data set or the observed environment has changed? Are our hypothesis still valid?
It is extremely important to develop an automated framework for running tests and experiments to validate all of the existing hypothesis.
We cannot achieve confidence about our deliverables if we are not sure that our hypothesis are correct and if anything has changed we must be able to find out immediately.
Yet, often is hard to have tests with boolean outcome: Success or Failure. It is a good practice though to have at least an automated job that calculates some key descriptive statistics that can help us understanding the underlying dataset and guiding the validation of our hypothesis. Think carefully of what measures your model would be interested to know in order to understand whether the proposed solution would make sense or not.
18. Analyse the output of the automated HDA to adjust/revise the proposed solution
The output of your HDA framework is your best friend for helping you going back and do the first changes to the proposed solution. You want to account for what the real phenomena are despite of what your original thoughts were.
If you manage to get all of your hypothesis right at the first shot, think twice!
Now you have a very detailed picture of what your solution proposal is and what are all of the requirements. You have gained a deep understanding of any detail you will need during the development and evaluation of your model. You have already built all of the tools to support you on that. You can feel safe to try out whatever you want because you know that your tests will check the validity. You have now reduced to the minimum the risks on this project before to even start implementing the first line of code for your model.
Align with your stakeholders and product owners and define the initial roadmap and expectations you want to meet for the first MVP.
Summary of the complete Agile Data Science Iteration 0″ series will be published soon, stay tuned.
Meanwhile, why not sharing or commenting below?