This is the part 4 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.
Get the IT team very close to the data scientists. Ideally one member of the team should be a DevOps, or the new title DataOps. Read this article from InfoWorld: DevOps can take data science to the next level.
Finding the right balance between IT workarounds and clean solutions is difficult especially when involves long tedious processes. It is good practice to “sign” contracts with the IT team of what you are about to deliver and what requirements you need in order to do so.
As a general advice you want to operate in your familiar environment where you have available all of the tools you like and proper cluster resources. Unfortunately data is always fragmented into multiple systems. Try to get the data periodically ingested into your Data Lake (typically a Hadoop cluster). When this is not possible make sure you have the permissions to sqoop it yourself. Data virtualization technologies also come particularly handy to create view of a dataset into your Big Data environment.
Don’t implement solutions that are tight to the underlying infrastructure. Spark DataFrame API for example does an excellent job on abstracting away the I/O operations. See this blog post of how to logically map tables from reltational database into a Spark cluster: https://dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon.
Requesting admin rights on a dev cluster will massively affect productivity and will let the team mastering their unix skills. Trustiness and transparency are essential. Security should be enforced during the interviews process by hiring competent and smart people and personnel trainings instead of killing productivity with non-sense restrictions.
A data products at the end of the day is a software that takes data as input and produce data as output that contains insights consumable either via a visual dashboard or by integrating them inside an existing IT system.
A few options we recommend for releasing are:
- Continuous delivery. Ideally one pull request per project per day.
- Continuous integration. That would be ideal, a Jenkins box that runs your tests and automated scripts every single time a new ticket is merged into develop and especially every single time a new release is done in master. If the box can access to a data cluster then can even run the end-to-end evaluation and store the results for you.
- Every end of sprint should be matched with a new release consisting of:
- taking the develop branch and merge it back into master (either manually or through an automated script such as the gitflow command line)
- publishing your package containing source code and scripts to a common repository like Nexus.
- Reporting latest results in Confluence (see documentation section).
- Releasing all of the merged tickets from Jira so that they don’t show up in the board but are still accessible for reference.
- Demo-ing inside the team and/or to your stakeholders if changes are relevant .
- Celebrating in a pub.
It would make sense to plan the release on the last day of the sprint afternoon (typically Friday) but sometime might be advantageous to release on Thursday so that you can have Friday for hot-fixes if something goes down.
Very hard to give guidelines here since that each project have its own deployment process that depends on many factors such as the business context and practical issues associated with it.
If your application is deployed end-to-end from external teams of which you don’t have control of the workflow and data sources they are using, you will find extremely helpful to have some Data Sanity checks performed at every single run. Those checks make sure that the people running your application don’t accidentally input data which is not conformed with the schema and/or model assumptions. Throwing an exception with some context information is fundamental to make your system production-ready.
A typical example is validating the values of categorical fields. We packed in our jars the reference files containing all of the possible values and their descriptions. If the specified dataset contains values that don’t find any match, the data sanity check will throw an exception.
These steps of handling incorrect data may be handled during the ETL process and is generally not needed if the training is done by the data science team itself. In this latter case the the deployment only regards the trained model.
Deployment is the stage with the highest number of blockers and technical issues. The final measure of success is by the way only determined upon deployment in production, thus deployment issues should be top priority.