Agile Data Science Iteration 0: The Hypothesis-Driven Analysis

This is the fifth post of the Agile Data Science Iteration 0 series:

Previously

What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
  7. Define and quickly develop the simplest solution to the problem
  8. Release/demo first basic solution
  9. Research of background for ways to improve the basic solution
  10. Gather additional data into proper infrastructure (if required)
  11. Ad-hoc Exploratory Data Analysis (EDA)
  12. Propose better solution minimising potential risks and marginal gain
  13. Develop the Data Sanity check
  14. Define the Data Types of your application domain
  15. Develop the ETL and output the normalised data into a proper infrastructure

At this stage you have already modelled some entities of your application logic. You know well the raw data and already have produced a normalised and cleaned version of your dataset. Your data is now sanitised and stored into a proper analytical infrastructure. Ask yourself: what assumptions have I made so far and I am going to make? Agile Data Science even though is production and engineering oriented is not just software engineering. Agile Data Science is Science, thus it must comply with the scientific methodology.

The Oxford dictionary defines “scientific method” as:

“a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses.”

And this is by no mean different in the Data Science methodology.

The Hypothesis-Driven Analysis

16. Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified

In Data Science we implement models and applications in a highly non deterministic context where often we make assumptions to simplify the problem. Assumptions are generally made based on intuitions, common-sense, previous experience, domain knowledge or sometime simply because the model require them.

Even though they might seem appropriate, they are dangerous! Unverified assumptions can easily lead to inconsistencies or, even worse, silently produce wrong results.

We can’t get rid of all of our assumptions and build an assumption-free model but we should try to document them, verify as soon as possible and track them over time. It is fine to have not-yet-fully-verified assumptions at this early stage, but they should not be forgotten and their verification should be planned in the immediate following iterations.

Every time we present any result we should clearly state what are all of the assumptions that have been made and if they have been verified or not.

17. Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data

What if the underlying data set or the observed environment has changed? Are our hypothesis still valid?
It is extremely important to develop an automated framework for running tests and experiments to validate all of the existing hypothesis.
We cannot achieve confidence about our deliverables if we are not sure that our hypothesis are correct and if anything has changed we must be able to find out immediately.

Yet, often is hard to have tests with boolean outcome: Success or Failure. It is a good practice though to have at least an automated job that calculates some key descriptive statistics that can help us understanding the underlying dataset and guiding the validation of our hypothesis. Think carefully of what measures your model would be interested to know in order to understand whether the proposed solution would make sense or not.

18. Analyse the output of the automated HDA to adjust/revise the proposed solution

The output of your HDA framework is your best friend for helping you going back and do the first changes to the proposed solution. You want to account for what the real phenomena are despite of what your original thoughts were.
If you manage to get all of your hypothesis right at the first shot, think twice!

Now you have a very detailed picture of what your solution proposal is and what are all of the requirements. You have gained a deep understanding of any detail you will need during the development and evaluation of your model. You have already built all of the tools to support you on that. You can feel safe to try out whatever you want because you know that your tests will check the validity. You have now reduced to the minimum the risks on this project before to even start implementing the first line of code for your model.

Align with your stakeholders and product owners and define the initial roadmap and expectations you want to meet for the first MVP.

***

Summary of the complete Agile Data Science Iteration 0″ series will be published soon, stay tuned.
Meanwhile, why not sharing or commenting below?

The ETL << prev | next >> The Final Checklist

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Agile and tagged , , . Bookmark the permalink.

6 Responses to Agile Data Science Iteration 0: The Hypothesis-Driven Analysis

  1. Pingback: The complete 18 steps to start a new Agile Data Science project | Vademecum of Practical Data Science

  2. Pingback: Agile Data Science Iteration 0: The Evaluation Strategy | Vademecum of Practical Data Science

  3. Pingback: Agile Data Science Iteration 0: The ETL | Vademecum of Practical Data Science

  4. Pingback: AGILE DATA SCIENCE ITERATION 0: The Simple Solution | Vademecum of Practical Data Science

  5. Pingback: Agile Data Science Iteration 0: The Initial Investigation | Vademecum of Practical Data Science

  6. Liam Kane says:

    Hi Gianmari, great series of posts. In terms of agile though, are you applying scrum or kanban? If scrum, how much can you get done in a s single iteration. What about “stories” – would EDA be considered a story with its own acceptance criteria? Can you elaborate? Thanks!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s