Agile Data Science Iteration 0: The Initial Investigation

This is the second post of the Agile Data Science Iteration 0 series:

Previously

What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)

At this stage you should have a clear statement of what problem you are trying to solve and how you can objectively measure the quality of any possible solution regardless of what the final implementation will be. It’s time to start doing some initial research.

The Initial Investigation

4. Stop thinking, start googling!

You will be surprised of how many people have tried to solve similar problems before. Don’t attempt to think about your solution before you actually have a clear view of the state of the arts. It includes: published papers, libraries, blog posts, presentations, tools… You recall a friend working on something similar in his previous company one year ago, ring him!

5. Gather initial dataset into proper infrastructure

Your company dumps data in the corporate Data Warehouse and you have an ODBC driver or an ad-hoc client for running SQL queries before the tedious procedure of getting the right data into the “Analytics Cluster” is completed. Maybe not the best approach!

Never let bad technology slowing down your development. You are a Data Scientist, you want to be Agile, you are familiar with your toolset, you have not time to spend on legacy infrastructures.

Have you got your analytical cluster but there is no data in it or just a sample of “the year 2004 of the 10% of the male population in the age 50-55 that watched Cricket on Friday night“? Go get the dataset yourself!

As an Agile Data Scientist you are also expected to find your way around IT blockers and a minimum of engineering and devops skills.

Wrong technology decisions made early at this stage can cause expensive debts later on. Be wise!

6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset

You might find two antipodes of Data Scientists:

type a) who will spend months on inquiring data and discovering patterns/insights and always finding better ways and showing how improving what the model may look like.
type b) who knows already what they want to do and don’t need to spend time to investigate whether that may work or not.

None of them got it right, you should always be curious, explore, visualise details but never forgetting you are time-bounded and the “best solution ever but delivered in 1 year time” does not bring the same value of an imprecise but quick MVP delivered in a month.

Your initial investigation goal is gaining a quick understanding of the underlying dataset and whether or not a solution proposal may be applicable before to jump straight into an implementation of what you may sooner or later find out to be non suitable.

To conclude, you know should have a high-level overview of what has already been done to approach your problem. You got a big-enough sample of your data into a proper analytical toolkit. You gained an initial understanding of what the data look like, what distributions you can observe, what easy-to-spot correlations you found. You probably have not spent enough time to dive deep into it but you collected enough knowledge to start thinking of your first simple solution. You are now ready to implement your first MVP.

Remember we are working in an agile environment, the quicker we iterate the quicker we will be able to go back and improve. Focus on quickly gathering the minimum amount of information needed and leave the deep investigation for later stages.

***

Details of how to build your first simple data solution will follow on the next post of the “Agile Data Science Iteration 0” series, stay tuned.
Meanwhile, why not sharing or commenting below?

The Evaluation Strategy << prev | next >> The Simple Solution

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Agile and tagged , , . Bookmark the permalink.

6 Responses to Agile Data Science Iteration 0: The Initial Investigation

  1. Pingback: Agile Data Science Iteration 0: The Evaluation Strategy | Vade Mecum of Practical Data Science

  2. Pingback: AGILE DATA SCIENCE ITERATION 0: The Simple Solution | Vademecum of Practical Data Science

  3. Pingback: Agile Data Science Iteration 0: The ETL | Vademecum of Practical Data Science

  4. Pingback: Agile Data Science Iteration 0: The Hypothesis-Driven Analysis | Vademecum of Practical Data Science

  5. Pingback: The complete 18 steps to start a new Agile Data Science project | Vademecum of Practical Data Science

  6. Caro says:

    “In effect Garmin has put up for sale a fully team ready to be ‘acquired’ by the competition. Except they aren’t for sale by Garmin – Garmin doesn’t get anything. It’s a merger and acquisition scenario gone wr&og.n#8221;And a few of us landed right in the hands of Magellan and just announced our first fitness product.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s