In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio

Abstract:

Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS.

Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.

We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.

In this talk we will:

* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet

* Explain the iterative exploratory process and advantages of adopting functional programming

* Make a crucial analysis on the issues faced with the existing methodology

* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds

* Discuss some future improvements to the overall architecture

Original meetup event: http://www.meetup.com/Alluxio/events/233453125/

Lessons learnt from building data-driven production systems at Barclays

bart-models.gif

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful inside a large organization where, as such, being a productive data scientist is a tough challenge.

The data science team works on a mix of descriptive, predictive and prescriptive projects that make use of machine learning and big data technologies, mainly on top of Apache Spark. Even though we deliver per-request insights coming from manual analysis, we primarily build automated and scalable systems to be periodically used either internally for a better decision-making or customer-facing in the form of analytics services (e.g. via the web portal).
In this post series I want to share some of the best practices, tools, methodologies and workflows that we experimented and the lessons learnt from them. I will skip a few aspects of machine learning systems, since that I found those to be already well covered in other talks and articles, you can find the reference links at the end of this post.
Moreover not all of the data-driven projects require a machine learning component, at least not at every stage. I would like to quote Peter Norvig from a recent article published at KDnuggets:

“Machine Learning development is like the raisins in a raisin bread: 1. You need the bread first 2. It’s just a few tiny raisins but without it you would just have plain bread.”

Please keep in mind that each scenario is different thus there are not strict rules to advocate. Every data science team should come out with the workflow and stack that best suits their needs. Besides, they should be able to quickly adapt to the business and technical changes of their organization.

To conclude, I summarised the main take home knowledge of my experience in Barclays so far. I hope it will serve as an useful guideline or inspiration source for all of those data science teams focusing on building production systems. Many of those best practices still apply to research-oriented teams that focus more on the prototyping of solutions. Our team is a mix of engineering and modelling background, thus defining a little bit of structure and common workflows helped us being collaborative and productive.

The goal was not advocating a single methodology but showing possible other approaches that could fit well within your organization. We expect those practices to conflict amongst different teams. For example in the Xavier’s articles (see links below), he suggests to do all of the experiments using the notebook and use the same tools in production while in our experience we found this to be chaotic and non scalable for our use cases. There is no God law, try different approaches and stick with the most successful ones for your use cases.

***

A related blog post of “How to do Data Science that is both Exploratory and Production Quality” can be found here: https://www.linkedin.com/pulse/how-do-data-science-both-exploratory-production-quality-harry-powell.

Similar articles:

Seven Steps to Success Machine Learning in Practice https://daoudclarke.github.io/guide.pdf

http://technocalifornia.blogspot.co.uk/2014/12/ten-lessons-learned-from-building-real.html

And more recent additional 10 lessons:

https://medium.com/@xamat/10-more-lessons-learned-from-building-real-life-ml-systems-part-i-b309cafc7b5e#.58g9wrnt4

 

The ScrumBan Jira board

This is the part 1 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Agile board

Let’s start with one of the core tool of the agile workflow. We use a Jira board for tracking and organizing all of our projects. We developed a custom board which uses the sprints concept of Scrum but in a more flexible way as in Kanban.

jira-board

 

The Scrumban board is configured as following:

  • Horizontally divided in swimlanes (top-down in order of priority):
    • Critical / Blockers
    • Current work
    • Stories backlog
    • Sub-tickets backlog
    • Completed
  • The columns are:
    • To do
    • In progress
    • In review
    • Done / resolved
    • You can optionally have “Ready to release”
  • Quick filters should at least have one filter for each member of the team filtering on its own assigned tickets.

The idea is that during the planning you select from the backlog which high level stories you want to deliver by the end of the sprint (typically 2 weeks long) and then you create subtasks as-you-need.
Reason is that in data science you don’t know what you are about to implement beforehand. Thus you need to investigate-implement-test all the time and as you do it, you discover what to do next. Important is that whatever subtasks is created it is done by the end of the sprint so that the story is completed.

Define stories with a clear goal and a small scope. They should not span over multiple sprints and since that they come with the uncertainty of what tasks will be required, you really need to break a big problem into smaller well-defined problems that are accomplishable no-matter-what.

Avoid having tasks for exploratory analysis or for adding unit tests. Each task should bring some value, potentially a new feature. Each task will then require an exploratory analysis as well as some development and testing. Those steps are already part of the definition of “Done”. See below sections for more explanations about tests and exploratory analysis.

Plan always less than your capabilities. Delivering your stories a few days earlier is a very good sign. Delaying them is bad. If you manage to get your work done by Thursday, spend the whole day of Friday in a pub celebrating your amazing delivery.

In Jira, you must assign each story to one individual but remember that in an agile team either the whole team succeeds or fails. If that person does not manage to finish his tasks on time, it is a team failure. That’s what you have the morning standup for, to make sure everything is under control and team resources are allocated in a way that the sprint is going to be successful.

Never change the scope of your sprints or add tasks that were not planned, unless are required hotfixes. If you are asked to do something else then invite the product owners to join your next sprint planning and only then you can allocate resources for them.
Remember the goal of a sprint is to have a working, even if simplistic, deliverable not solving sparse tasks.

At the end of the sprint have a retrospective meeting to discuss what went well and what not. Make sure to take actions in order to avoid that blockers may appear again in future.

Documentation

Documentation should be as simple as possible.

  • Releases notes, a page where you can note the major changes since previous version, the list of new tickets that have been merged  (linking to Jira) and a link to a more detailed report.
  • The detailed report contains snapshots of the most recent logs, results, observations, limitations, assumptions and performances of the model/etl/application. Often it contains some charts that can quickly explaining how good the product is. We can use those detailed but concise reports to track how the product is evolving. The release detailed report also contains the help messages of how to run the application and all of the command line interface (CLI) options.
    If all of your tests and procedures are fully automated then this page is simply a copy and paste of the results.
  • The usage of a particular job class or a script with the list of CLI arguments and default values is also accessible using –help argument, many libraries helps you doing that (bash getops, scala Scallop…).
  • Other pages are used to explain the complex part of the logic. Try to reduce those pages only when the logic is very complicated and hard to understand by just reading the code.

Documentation is hard to keep in sync that’s way we want to document what’s new since the last release rather than going through the whole wiki and updating every single page.

Ideally the documentation comes from the source code, unit tests and jira tickets. Individual analysis, findings and insights can be documented separately but they should represent static reports rather than project documentation.

In the hierarchical structure of the pages, we limit the maximum depth to 2. Which means we have the root-level pages with at most one level of children pages. Nested structures make it very hard to find contents when you need them.

Branching and versioning

Code should always and only exist in a git repository. Sparse snippets or random script files should be avoided.

We follow the gitflow branching model where each ticket is mapped as features branch. If you integrate Jira with Stash then from the ticket web page you can automatically create the corresponding branch in the repository using develop as branch base.

You do not need to use the complete gitflow branching model but at least the master, develop and features branches. It’s up to the deployment strategy defining how to handle hotfixes, bugfixes and releases branches. Make sure this strategy is clearly defined and is consistently enforced. See deployment.

Story tickets generally don’t have a branch associated, their sub-tasks have.

Install a git hook that every commit will include as prefix the ticket code (that you can parse out from the branch name). Tracking each commit with the corresponding ticket is a life-saver when in future you will try to reverse engineer what a method is doing and why has been created in first place. Then you can access the whole git history and access the corresponding tickets that touched that piece of code.

Discussions

Discussions of specific tasks should go into the corresponding jira ticket web page. This will make the conversation public, tracked and anyone can jump into the discussion with the full context available. Also reference files or supporting documents should be attached to the jira ticket itself or in the wiki if they serve as a general purpose. Remember each jira ticket can be linked from the releases wiki page, that means we never lose track of them. Moreover the query engine is quite good.

We found emails to be the worst place for discussions to happen, especially for sharing files that will become soon out-of-date.

When someone sends you an Excel file, reply saying that your laptop does not have an Office installation on it. If you are sharing small data files, tsv or json is way to go.
Avoid comma separated files with quotes wrapping text fields. You want to make your file editable using simple bash commands rather than loading into a csv parsing library.

We tried also mounted shared drives, but confluence is a much better collaborative way to share and organize files with an integrated version control and metadata.

Avoid meetings as much as you can, invent some excuse, ask for a clear agenda beforehand. Educate your colleagues to communicate with you by raising issues. Leave meetings only for important discussions and spend your meeting time for presenting and checkpointing with your stakeholders more frequently.

 

 

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are:

  1. Data stored in non scalable infrastructure for analysis and processing
  2. Data governance and security policies

1. Data often resides into central data warehouse and RDBMS of which many legacy applications and analysts depends on.
Data Scientists insteads cannot build their models or perform exploratory analysis by using SQL queries. They need the data to be available into a scalable, programmatic and reactive stack such as Hadoop and Apache Spark and develop their logic using languages such as Python, R, Scala… (for comparison of how Python and Scala compare for Spark, see this post: 6 points to compare Python and Scala for Data Science using Apache Spark).
2. Nevertheless, data cannot just be transferred (in technical terms sqoop-ed) to an Hadoop cluster without incurring into tedious bureaucracy,  ingestion inconsistencies and strict policies. In big corporations that translates to at least a month to decide what tables are interesting and a few more months to write the ETL logic, move the data and test the consistency.

At Barclays we developed a stack to logically map the raw data from the central data warehouse into Spark and use Tachyon for in-memory saving the data for long-term availability. In such stack, we are able to iterate fast with immediate data availability from a scalable Big Data cluster by skipping the data ingestion process and still complying with all of the data policies.

Tachyon was the key enabling technology for us.

Our workflow iteration time decreased from hours to seconds. Tachyon enabled something that was impossible before.

You can find the original article published on DZone in collaboration with Gene Pang, Software Engineer at Tachyon Nexus and Haoyuan Li, CEO of Tachyon Nexus:
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

—At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
—The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics.
Details of the competition available at —https://www.kaggle.com/c/predict-wordpress-likes.

What we want to share is a mix of methodology and tools for:

  • —Investigating Interactively the data; and
  • —Writing quality code in a productive environment; and
  • —Embedding the developed functions into executable entry points; and
  • —Presenting the results in a clean and visual way; and
  • —Meeting the required acceptance criteria.

——AKA: Delivering a Data Science MVP quickly in a complete Agile way!

The topics covered in this workshop are:

  • —DataFrame/RDD conversions and I/O
  • —Exploratory Data Analysis (EDA)
  • —Scalable Feature Engineering
  • —Modelling (MlLib and ML)
  • —End-to-end Evaluation
  • —Agile Methodology for Data Science

At the end of the workshop the lessons learnt are:

  • Spark, Dataframe, RDDs:—
    • DataFrame is great for I/O, schema inference from the sources and when you have flatten schemas. Operations start to be more complicated with nested and array fields.
    • —RDD gives you the flexibility of doing your ETL using the richness of the Scala framework, in the other hand you must be careful on optimizing your execution plans.
      Functional Programming allowed us to express complex logic with a simple and clear code and free of side effects.

      RDD gives you the flexibility of doing your ETL using the richness of the scala framework, in the other hand you must be careful on optimizing your execution plans.

    • —Map joins with broadcast maps is very efficient but we need to make sure to reduce at minimum its size before to broadcast, e.g. applying some filtering to remove the unmatched keys before the join or capping the size of each value in case of size-variable structures (e.g. hash maps).

      Developing in the notebook is very painful and non productive, the more you write code the more become impossible to track and refactor it.

  • ML, MlLib
    • —ETL and feature engineering is the most time-consuming part, once you obtained the data you want in vector format then you can convert back to DataFrame and use the ML APIs.
    • —ML unfortunately does not wrap everything available in MlLib, sometime you have to convert back to RDD[LabeledPoint] or RDD[(Double, Vector)] in order to use the MlLib features (e.g. evaluation metrics).

      Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste

    • —ML pipeline API (Transformer, Estimator, Evaluator) seems cool but for an MVP is a pre-mature abstraction.
  • Modeling
    • —Do not underestimate simple solutions. In the worst case they serve as baseline for benchmarking.
    • —Even tough the Logistic Regression was better on classifying as true or false, the simple model outperformed when running the end-to-end ranking evaluation.
    • —Focus on solving problems rather than models or algorithms.
      Many Data Science problems can be solved with counts and divisions, e.g. Naïve Bayes.
    • —Logistic Regression “raw scores” are NOT probabilities, treat them carefully!
  • Spark Notebook
    • —SparkNotebook is good for EDA and as entry point for calling APIs and presenting results.
    • —Developing in the notebook is non very productive, the more you write code the more become harder to track and refactor previously developed code.
    • —Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste every time into a notebook dedicated cell.
    • —In order to keep normal Notebook cells clean, they should not contain more than 4/5 lines of code or complex logic, they should ideally just code queries in the form of functional processing and entry points of a logic API.
  • Visualization
    • —Plotting in the notebook with the built in visualization is handy but very rudimental, can only visualize 25 points, we created a Pimp to take any Array[(Double,Double)] and interpolate its values to only 25 points.
    • —Tip: when you visualize a Scala Map with Double keys in the range 0.0 to 1.0, the take(25) method will return already uniform samples in that range and since the x-axis is numerical, the built-in visualization will automatically sort it for you.
    • —Probably we should have investigated advanced libraries like Bokeh or D3 that are already supported in the Notebook.

Check the source code on the GitHub page: https://github.com/gm-spacagna/wordpress-posts-recommender.

The complete 18 steps to start a new Agile Data Science project

Introduction

It is a very common pattern in software development to start a new project in a highly uncertain and chaotic scenario surrounded by plenty of ideas of what features we might want to implement. In Data Science the problem is even more amplified by its nondeterministic nature. In the start-up of a Data Science project we not just don’t know what we are trying to implement, we also don’t know how to implement it and also under which circumstances that would be possible and correct.

This initial lack of structure often is manifested by an initial spike of unnecessary development and later in the project in the form of technical debts and unexplained inconsistencies. You might spend a lot of resources before to find out that the delivered solution simply does not fit the business nature of the problem.

In Agile Data Science the goal should not be producing charts and reports or hacky scripts calling some machine learning library. In Agile Data Science we want to iteratively build production-quality applications that solve the true business needs by extracting hidden knowledge from the data.

This is the final summarising post of the Agile Data Science Iteration 0 series:

The Complete Checklist

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
  7. Define and quickly develop the simplest solution to the problem
  8. Release/demo first basic solution
  9. Research of background for ways to improve the basic solution
  10. Gather additional data into proper infrastructure (if required)
  11. Ad-hoc Exploratory Data Analysis (EDA)
  12. Propose better solution minimising potential risks and marginal gain
  13. Develop the Data Sanity check
  14. Define the Data Types of your application domain
  15. Develop the ETL and output the normalised data into a proper infrastructure
  16. Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified
  17. Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data
  18. Analyse the output of the automated HDA to adjust/revise the proposed solution

At the end of the Iteration 0 you have a very solid starting point for your project and you could now follow the typical Agile development cycle, whether you prefer more SCRUM, Kanban, a mix of them or your ad-hoc custom methodology.

Regardless of if you want to use a strict or flexible workflow, keep in mind that the main difference with the Agile iterations for software development consist in the fact that a ticket is typically broad and open-ended. You should not be surprised if the majority of your tickets get then split into multiple sub-tickets after the initial investigation of the problem. You should allow to create subtasks even after the sprint planning. In some cases you may prefer to mark them as blockers and re-scope them into the next sprint or in other cases you want to allow them to affect the current sprint.
What is important is that you should start implementing production-quality code only when the requirements and the acceptance test are well defined. In Data Science this not very likely to happen all the time. Every time you are presented with an open problem to investigate and solve you should try to break it into research/analysis and development subtasks.

What not to do?

  • Do not start any development without have done a prior detailed research/investigation
  • Do not just deliver analysis code in notebooks, after your investigation move the code into production-quality standards
  • Do not blindly trust external libraries or APIs if you don’t know exactly what they do and return, run some tests if needed
  • Do not generate manual reports of your finding until the experiments are reproducible and automated
  • Do not deploy any model if all of the assumptions haven’t been stated and verified
  • Do not be lazy to learn better technologies and methodologies!

To conclude, in this series of posts I just wanted to share some of my experience on starting new Data Science projects and common problems that I have seen to be addressed in a confusional and chaotic way. I hope that by following those guidelines you can reduce the technical debts of the project and the risk of working several months without never delivering a correct and working solution.

More details of the Agile cycle for Data Science applications and in particular how to time-box open-ended questions will be covered into another post. Stay tuned and get ready to run!

***

The Hypothesis-Driven Analysis << prev 

Agile Data Science Iteration 0: The Hypothesis-Driven Analysis

This is the fifth post of the Agile Data Science Iteration 0 series:

Previously

What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
  7. Define and quickly develop the simplest solution to the problem
  8. Release/demo first basic solution
  9. Research of background for ways to improve the basic solution
  10. Gather additional data into proper infrastructure (if required)
  11. Ad-hoc Exploratory Data Analysis (EDA)
  12. Propose better solution minimising potential risks and marginal gain
  13. Develop the Data Sanity check
  14. Define the Data Types of your application domain
  15. Develop the ETL and output the normalised data into a proper infrastructure

At this stage you have already modelled some entities of your application logic. You know well the raw data and already have produced a normalised and cleaned version of your dataset. Your data is now sanitised and stored into a proper analytical infrastructure. Ask yourself: what assumptions have I made so far and I am going to make? Agile Data Science even though is production and engineering oriented is not just software engineering. Agile Data Science is Science, thus it must comply with the scientific methodology.

The Oxford dictionary defines “scientific method” as:

“a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses.”

And this is by no mean different in the Data Science methodology.

The Hypothesis-Driven Analysis

16. Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified

In Data Science we implement models and applications in a highly non deterministic context where often we make assumptions to simplify the problem. Assumptions are generally made based on intuitions, common-sense, previous experience, domain knowledge or sometime simply because the model require them.

Even though they might seem appropriate, they are dangerous! Unverified assumptions can easily lead to inconsistencies or, even worse, silently produce wrong results.

We can’t get rid of all of our assumptions and build an assumption-free model but we should try to document them, verify as soon as possible and track them over time. It is fine to have not-yet-fully-verified assumptions at this early stage, but they should not be forgotten and their verification should be planned in the immediate following iterations.

Every time we present any result we should clearly state what are all of the assumptions that have been made and if they have been verified or not.

17. Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data

What if the underlying data set or the observed environment has changed? Are our hypothesis still valid?
It is extremely important to develop an automated framework for running tests and experiments to validate all of the existing hypothesis.
We cannot achieve confidence about our deliverables if we are not sure that our hypothesis are correct and if anything has changed we must be able to find out immediately.

Yet, often is hard to have tests with boolean outcome: Success or Failure. It is a good practice though to have at least an automated job that calculates some key descriptive statistics that can help us understanding the underlying dataset and guiding the validation of our hypothesis. Think carefully of what measures your model would be interested to know in order to understand whether the proposed solution would make sense or not.

18. Analyse the output of the automated HDA to adjust/revise the proposed solution

The output of your HDA framework is your best friend for helping you going back and do the first changes to the proposed solution. You want to account for what the real phenomena are despite of what your original thoughts were.
If you manage to get all of your hypothesis right at the first shot, think twice!

Now you have a very detailed picture of what your solution proposal is and what are all of the requirements. You have gained a deep understanding of any detail you will need during the development and evaluation of your model. You have already built all of the tools to support you on that. You can feel safe to try out whatever you want because you know that your tests will check the validity. You have now reduced to the minimum the risks on this project before to even start implementing the first line of code for your model.

Align with your stakeholders and product owners and define the initial roadmap and expectations you want to meet for the first MVP.

***

Summary of the complete Agile Data Science Iteration 0″ series will be published soon, stay tuned.
Meanwhile, why not sharing or commenting below?

The ETL << prev | next >> The Final Checklist

Agile Data Science Iteration 0: The Evaluation Strategy

It is a very common pattern in software development to start a new project in a highly uncertain and chaotic scenario surrounded by plenty of ideas of what features we might want to implement. In Data Science the problem is even more amplified by its nondeterministic nature. In the start-up of a Data Science project we not just don’t know what we are trying to implement, we also don’t know how to implement it and also under which circumstances that would be possible and correct.

This initial lack of structure often is manifested by an initial spike of unnecessary development and later in the project in the form of technical debts and unexplained inconsistencies. You might spend a lot of resources before to find out that the delivered solution simply does not fit the business nature of the problem.

In Agile Data Science the goal should not be producing charts and reports or hacky scripts calling some machine learning library. In Agile Data Science we want to iteratively build production-quality applications that solve the true business needs by extracting hidden knowledge from the data.

In this series of posts I would like to share some best practices and procedures for starting a novel Data Science project from scratch in a safe and efficient way. The preparation steps consist on a list of tasks to ideally perform before to write the first line of code for your model.

The main steps to cover are (in order):

Even though it might look like the old-school Waterfall planning, I don’t aim to propose a strict checklist to perform in the declared order. I would rather like to share what I have learned in my experience as a Data Scientist and what I have seen doing from other colleagues. You might argue that having a list of 20 to-do predefined tasks is not Agile, and I would generally agree with that. Nevertheless, I do personally think that all of them shall be addressed (regardless of the order) as a preparation for the Agile development cycle specifically for a Data Science application.

The term “Iteration 0″ oughts to refer to that preparation stage and does not necessary mean all of the steps should happen in a single iteration. By following an Agile approach, we will probably split them into a few iterations and very likely incrementally revise/refactor some of them.

The Business Problem

1. Rigorous definition of the business problem we are attempting to solve and why it is important

A clear statement should answer, at least but not only:

  • What’s his basic need
  • What are the expectations
  • How the business will benefit from it
  • What is the current state of the product we trying to improve
  • What are the requirements/constraints
  • What will be the form of the final deliverable
  • How that deliverable will be able to integrate with the business and drive human actions

The QA Strategy

2. Define your objective acceptance criteria

The Acceptance Test should always be one of the vital part of any project. Without an acceptance test you cannot prove if your solution is correct, you cannot compare different solutions, thus you cannot iterate fast and safe. Acceptance tests are your only way of preventing mistakes and avoiding delivering solutions that actually do not meet the true business requirements. And this is by no mean different in Data Science.

If you can’t define it, you can’t measure it. If it can’t be measured, it shouldn’t be reported. Define, then measure, then report.

— Bryan Hudson

Acceptance testing is about considering any possible implementation as  black-box, running an unbiased validation test that sets some minimum expectations on the model according to some measures that quantify the real business value.

Acceptance tests not necessary must be evaluating the accuracy of a Machine Learning model. It should include the whole end-to-end workflow from the data cleansing to the final deliverable.

Example:
Let’s suppose you are implementing a model for resolving and linking entities from one dataset to a reference one. Suppose that internally you will be implementing a predictive model for inferring a missing feature from the N dimensions that represent your entities in your domain space. You may want to evaluate the accuracy of your predictions using cross-folding validation tests, confusion matrixes, ROC curve and so on. But this evaluation would only refer to your particular implementation and only testing a subcomponent of the final solution. Instead, you want to ask yourself: “why does the business care about doing this entity resolution and linkage?”. Maybe the answer is “we need to normalise our dataset to be able to deduplicate and increase the number of patterns detected from our current product”. Thus, you want to evaluate the following measures instead:

  • Growth of patterns recognised before and after the normalisation
  • Percentage of unlinked records
  • Confidence intervals of the linked records
  • Stability of your model by adding/removing samples

Obviously having an high AUC of your predictive model used internally will probably improve the overall quality but that is not the final goal you want to achieve and evaluate at this stage.

A Machine Learning specialist would take a single metric or objective function and trying to maximise his performance. A Data Scientist would take the existing product or business case and try to improve it as a whole.

There is no single measure that can evaluate the quality of a given solution but only a pool of different and possible independent measures. AUC means nothing, what is that you want to achieve? Are false positives and false negatives the same? Are you setting no minimum requirements in both the two dimensions? As a Data Scientist you must understand what are the expectations you must meet before that a model can safely go into production without disappointing your users/customers.

3. Develop the validation framework (ergo, the acceptance test)

A few rules here:

  • An Acceptance Test should have always at least a binary outcome: Passed or Failed.
  • Don’t let you and anyone else cheating! Build a robust and unbiased test that cannot be worked-around regardless of who will be the person implementing the solution. You must consider how easily you could wish to start hacking in conditions of pressure and under rush for delivering a finite semi-working product.
  • The validation framework should be able to compare measures related to different implementations. Accuracy of 75% does not tell me so much. Accuracy of 75% given that the current model only gives 58%, quantifies the added value.

Now you have a clear statement of what problem you are trying to solve and how you can objectively measure the quality of any possible solution regardless of what the final implementation will be. Your stakeholders won’t need of understanding the technical details or how your model work because all they will have to understand are the evaluation results.

Please stay tuned for the next post of the “Agile Data Science Iteration 0″ series regarding “The initial investigation“…