This is the second post of the Agile Data Science Iteration 0 series:
- The Evaluation Strategy
- * The Initial Investigation <=
- The Simple Solution
- The ETL
- The Hypothesis-Driven Analysis (HDA)
- The Final Checklist
What we have achieved so far (see previous posts above):
- Rigorous definition of the business problem we are attempting to solve and why it is important
- Define your objective acceptance criteria
- Develop the validation framework (ergo, the acceptance test)
At this stage you should have a clear statement of what problem you are trying to solve and how you can objectively measure the quality of any possible solution regardless of what the final implementation will be. It’s time to start doing some initial research.
The Initial Investigation
4. Stop thinking, start googling!
You will be surprised of how many people have tried to solve similar problems before. Don’t attempt to think about your solution before you actually have a clear view of the state of the arts. It includes: published papers, libraries, blog posts, presentations, tools… You recall a friend working on something similar in his previous company one year ago, ring him!
5. Gather initial dataset into proper infrastructure
Your company dumps data in the corporate Data Warehouse and you have an ODBC driver or an ad-hoc client for running SQL queries before the tedious procedure of getting the right data into the “Analytics Cluster” is completed. Maybe not the best approach!
Never let bad technology slowing down your development. You are a Data Scientist, you want to be Agile, you are familiar with your toolset, you have not time to spend on legacy infrastructures.
Have you got your analytical cluster but there is no data in it or just a sample of “the year 2004 of the 10% of the male population in the age 50-55 that watched Cricket on Friday night“? Go get the dataset yourself!
As an Agile Data Scientist you are also expected to find your way around IT blockers and a minimum of engineering and devops skills.
Wrong technology decisions made early at this stage can cause expensive debts later on. Be wise!
6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
You might find two antipodes of Data Scientists:
type a) who will spend months on inquiring data and discovering patterns/insights and always finding better ways and showing how improving what the model may look like.
type b) who knows already what they want to do and don’t need to spend time to investigate whether that may work or not.
None of them got it right, you should always be curious, explore, visualise details but never forgetting you are time-bounded and the “best solution ever but delivered in 1 year time” does not bring the same value of an imprecise but quick MVP delivered in a month.
Your initial investigation goal is gaining a quick understanding of the underlying dataset and whether or not a solution proposal may be applicable before to jump straight into an implementation of what you may sooner or later find out to be non suitable.
To conclude, you know should have a high-level overview of what has already been done to approach your problem. You got a big-enough sample of your data into a proper analytical toolkit. You gained an initial understanding of what the data look like, what distributions you can observe, what easy-to-spot correlations you found. You probably have not spent enough time to dive deep into it but you collected enough knowledge to start thinking of your first simple solution. You are now ready to implement your first MVP.
Remember we are working in an agile environment, the quicker we iterate the quicker we will be able to go back and improve. Focus on quickly gathering the minimum amount of information needed and leave the deep investigation for later stages.
Details of how to build your first simple data solution will follow on the next post of the “Agile Data Science Iteration 0” series, stay tuned.
Meanwhile, why not sharing or commenting below?