Demystifying Data Science in the industry

On June the 7th I had a quick introductory talk at AssoLombarda in Milan regarding the role of Data Scientist into the 4th industrial revolution.

My presentation is an introduction to what Data Science in the industry is and what is not.

If you would like to know more about the Data Science Milan community visit www.datasciencemilan.org

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio

Abstract:

Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS.

Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.

We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.

In this talk we will:

* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet

* Explain the iterative exploratory process and advantages of adopting functional programming

* Make a crucial analysis on the issues faced with the existing methodology

* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds

* Discuss some future improvements to the overall architecture

Original meetup event: http://www.meetup.com/Alluxio/events/233453125/

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour

From Data Science Milan meetup event:

In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team’s goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:

• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.

• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.

• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.

• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.

• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).

• How Scala (and functional programming) helped our cause.

Surfing and Coding in Lanzarote, the Barclays Data Science hackathon

This post has been published on the Cloudera blog and summurises the results and takeaways of a week-long hackathon happened in Lanzarote in December 2015. The goal was to prototype a recommender systems for retail customers of shops in Bristol in Bristol, UK. The article shows how the stack composed by Scala and Spark was great for quickly writing some prototyping code to run locally in a single laptop and at the same time scalable for larger dataset to process in the cluster.

man with laptop on colorful beach of island

Please continue reading at http://blog.cloudera.com/blog/2016/05/the-barclays-data-science-hackathon-using-apache-spark-and-scala-for-rapid-prototyping/.

Lessons learnt from building data-driven production systems at Barclays

bart-models.gif

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful inside a large organization where, as such, being a productive data scientist is a tough challenge.

The data science team works on a mix of descriptive, predictive and prescriptive projects that make use of machine learning and big data technologies, mainly on top of Apache Spark. Even though we deliver per-request insights coming from manual analysis, we primarily build automated and scalable systems to be periodically used either internally for a better decision-making or customer-facing in the form of analytics services (e.g. via the web portal).
In this post series I want to share some of the best practices, tools, methodologies and workflows that we experimented and the lessons learnt from them. I will skip a few aspects of machine learning systems, since that I found those to be already well covered in other talks and articles, you can find the reference links at the end of this post.
Moreover not all of the data-driven projects require a machine learning component, at least not at every stage. I would like to quote Peter Norvig from a recent article published at KDnuggets:

“Machine Learning development is like the raisins in a raisin bread: 1. You need the bread first 2. It’s just a few tiny raisins but without it you would just have plain bread.”

Please keep in mind that each scenario is different thus there are not strict rules to advocate. Every data science team should come out with the workflow and stack that best suits their needs. Besides, they should be able to quickly adapt to the business and technical changes of their organization.

To conclude, I summarised the main take home knowledge of my experience in Barclays so far. I hope it will serve as an useful guideline or inspiration source for all of those data science teams focusing on building production systems. Many of those best practices still apply to research-oriented teams that focus more on the prototyping of solutions. Our team is a mix of engineering and modelling background, thus defining a little bit of structure and common workflows helped us being collaborative and productive.

The goal was not advocating a single methodology but showing possible other approaches that could fit well within your organization. We expect those practices to conflict amongst different teams. For example in the Xavier’s articles (see links below), he suggests to do all of the experiments using the notebook and use the same tools in production while in our experience we found this to be chaotic and non scalable for our use cases. There is no God law, try different approaches and stick with the most successful ones for your use cases.

***

A related blog post of “How to do Data Science that is both Exploratory and Production Quality” can be found here: https://www.linkedin.com/pulse/how-do-data-science-both-exploratory-production-quality-harry-powell.

Similar articles:

Seven Steps to Success Machine Learning in Practice https://daoudclarke.github.io/guide.pdf

http://technocalifornia.blogspot.co.uk/2014/12/ten-lessons-learned-from-building-real.html

And more recent additional 10 lessons:

https://medium.com/@xamat/10-more-lessons-learned-from-building-real-life-ml-systems-part-i-b309cafc7b5e#.58g9wrnt4

 

The balance of exploratory analysis and development

This is the part 3 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Exploratory Data Analysis / Research

eda

Exploratory analysis should precede and follow any task from the modelling, design and development to the benchmarking. Major problem is how do you share, track and monitor your findings? How do you make your analysis repeatable and scrutinizable from the outside? This is still an open problem.

Notebooks tend to be the best tools for the job, careful though. EDA is an open research/investigation task, thus you need a criteria to draw the line of when to stop. My suggestion is avoiding scope-free analysis but always accompany EDA with a well define goal/task. In a small and clearly defined task you know what you want to achieve, you just don’t know how to and what obstacles you may find.

A proposed sequential but iterative workflow is:

  1. The planned story defines the high level goal you are working towards to.
  2. Then start your EDA and as soon as you find something interesting you can stop investigating and define a development sub-ticket.
  3. You now start developing the minimum amount of code that implements the specified requirements defined during the analysis step.
    Those requirements should not change after the first definition, you want to complete that and then refactor it later into another sub-ticket or in a different iteration.
  4. Before to send it for review and/or solving the sub-ticket you should perform another EDA step to verify that the newly created branch meets the intended requirements. You are not solving the greater problem but you only care about the just-defined sub-problem. It is very dangerous to mix development and analysis at the same time since that you may end up into an infinite loop where you keep changing your requirements as you analyse and never get to an end.
  5. After completion of the subtask, you can switch back to the main workflow thread.

Suggestions are to time box any open-ended task. Say, you are going to spend no more than X hours/days on this research and before then you will come out with some development requirements or insights reporting that move the project towards the final story goal. Remember you will have to solve the story by the end of the sprint. Scope the problems small enough so that you reduce the risks of not meeting the expectations.

Get to an end-to-end as quick as possible and postpone any complication, ideas or new features to the next iterations. EDA/Research is generally a good place for filling your backlog for future scoping.

I leave it as an open question what to do with those notebooks after the investigation is completed. They are a bit tricky to maintain. When you produce a change to the codebase or a new dataset comes in, the notebooks become obsolete. We don’t want to refactor them every time to make sure they still work. I personally see notebooks more as a one-off analysis that are archived after being used.

I tend to translate all of my findings and assumptions in the form of project requirements so that they don’t get lost. In my opinion only the automated tasks should be maintained over time. Results from manual tasks that cannot be automated should be documented, stamped and archived in the wiki.

Evaluation

Unit tests make sure that the code does what is meant to do but that does not imply solving the right problem in an acceptable way. The evaluation strategy typically reflects the real-business scenario in which the model will be used. The choice of performance metrics must have a meaningful explanation within the business context. Metrics should be of easy interpretation from your stakeholders who generally are not data scientists and only speak the company business language.

Good tips is to create a Kaggle-like framework that:

  • defines the APIs reflecting your custom data types
  • use some abstract interface representing the particular implementation (could be split into multiple components, e.g. transformer, trainer, model)
  • knows how to robustly validate the given implementation (e.g. cross-fold, domain specific splitting avoiding data leakage, mix of timestamp and customerId partitioning…)
  • Produce one or a pool of interpretable performance metrics such as: mean average precision @ N, uplift, spam rate, loss rate, retention rate. Avoid abstract concepts like area under the curve or F-score.

Sooner you will find a blog post of our team regarding an offsite in Lanzarote where following the Kaggle-like structure we prototyped 6 different models for a recommender system in less than a week.

When building the evaluation framework, a few questions you want to ask are:

  • What a positive/negative sample represent in this business scenario?
  • Is recall important? Why do you care about accuracy?
  • What actions can be taken upon prediction?
  • In which form the model can be used? How the insights can be presented/visualized? Can it be integrated into an existing IT system?
  • What are the capabilities/practical issues of following the decisions suggested by the model?
  • What is the uplift of the data-driven solution compared to the traditional business as usual performance?
  • How can you test the trained model in the live environment (is A/B testing possible or the bad scenario would cause a lot of damages)?
  • Does the effectiveness of your solution only depends on your model or also from other parties? (e.g. predicting customers to contact for marketing purposes relies on the conversion rate of the marketing team as well)
  • How can you feed-back the results for updating the model? At which rate? Is the model easy to update or must be re-trained for every new collected data? Can you re-train it within the update interval?
  • Will the triggered actions influence the upcoming data (e.g. a recommender system can change the distribution of the future population)? Are there any amplification effect (if you recommend most popular items, those will become even more popular and so on…).

My experience suggests that the more the time spent in implementing a robust and exhaustive evaluation framework the easier and reliable will be maintaining and improving the system later. Time spent here is a good investment and requires a lot of thinking from all of the 3 data science aspects: business, statistics and engineering.

Demo

It is a good practice to demo advances and new results to the team and/or stakeholder at the end of the sprint. Feeling the continuous pace of delivering and improvement is an excellent psychological element and increase trustiness and confidence.

Moreover is the place where scrutiny comes in and you can have your methodology and interpretations challenged. Any deliverable or document presented during the demo should be stored in the wiki with a date associated to it.

Coding practices for data products development

This is the part 2 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Coding practices

Code should be developed in a proper IDE and make use of advanced tools for re-factoring, auto-completion, syntax highlighting and auto-formatters; at least.

Notebooks should use routine libraries from the main codebase. As soon as some code is developed in a notebook and is reusable, it should be moved into a codebase. Rule of thumb might be each notebook cell should not exceed the 10 lines, after that either needs refactoring or it should be pulled away. Only exception is long code used only and specifically for the one off investigation that does not make sense outside that particular context.

Do not introduce unnecessary dependencies in the codebase (e.g. plotting libraries). Keep the code repository lean and add dependencies to your particular use case rather than the project repository.

During development is recommended to do frequent git commits. When the ticket is ready to go, the developer should first run a git diff develop and review its own code before to create the pull request (PR).

The pull request should only contain the minimum amount of code specified in the corresponding ticket requirements. You don’t anticipate functions that you know will need in the future even though this future is a couple of hours later. Avoid abstractions or general-purpose methods. First a working code for your specific use case then you will refactor it.

Agile manifesto says:

“Simplicity–the art of maximizing the amount
of work not done–is essential.”

Make your code structure flat:

  • data containers
  • static classes containing functions/methods/utils
  • entry point classes defining the end-to-end job and putting all of the pieces together

Copy and paste the same code if needed, duplication is not always bad if it makes the design simpler. Only extract methods and abstract classes if you have at least 3 use cases.

Comments in the code is very likely to cause out-of-sync documentation. Clean code, good design and self-explaining namings will make your code self-documenting. The only exception to comments are TODO, FIXME and annotations explaining why an hack was needed and in which conditions the current implementation might fail. Obviously avoiding hacks in the first place is the best solution but sometime we need to cope with them. Abuse of TODOs but do not leave non-working code without annotations.

Extreme attention should be paid to the code style and conventions. Having bad formatted code or inconsistent patterns make the code very hard to read and maintain.

After the PR is sent for review, chase your reviewer to review your code asap. Resist from starting a new task until the review is not finished and the PR merged into the develop branch. Do one thing per time and move to the next only when the previous is 100% done.

Reviewers should not accept justification regarding bad practices. Code reviews is the only way to guarantee a convergence of the team towards the excellence. It definitely pays off in the long term. The process of code reviewing should go forth and back until both the two parties are satisfied.

Testing

notestnobeer

You should always come up with smart ways of testing your code. Laziness or “I know it works” approaches should not be accepted. Only code that may not require tests are one-off analysis since that are humanly supervised and are not going into production.

A code without tests is risky, cannot be refactored and cannot be maintained since that unit tests serve as documentation. If someone changes your code than you can still be blamed and be responsible of the failure even though your code used to work. Tests are the only way of protecting validity of your solutions. Time spent in testing is the greatest long-term investment you can do for your project.

If you spot a bug that was not found in your tests, that is an indicator that this test case should be added. Don’t just fix it, make sure you first have the failing test for it. Debug your code by adding unit tests and breaking down end-to-end methods into smaller composable functions. Debugging by adding unit tests will give you a much safer and repeatable way to make your code robust.

Read-eval-print-loop (REPL) debugging is just another type of exploratory analysis, if you want to follow that way then remember to turn your manual techniques into automated tests.

Obviously all of the above problems would not exist in case of TDD.

When your fantasy of creating manual test cases is about to finish or you are too tired of keeping adding tests that always succeed, consider also adding a few property-based tests with random generators.

Unit tests are necessary but is the whole end-to-end that matters. Make sure you have at least a few integration tests in place. The best is if those integration tests actually maps to real use cases.

Pair working

We found pair working to be much more productive than working as isolated individuals. A data science team generally is cross-functional with people ranging from a more engineering background to more theoretical analytical/statistics background. Good rule is to pair opposite individuals together and swap their competencies so that who is good at coding will do the modelling and vice versa. Code review process still applies as usual even though the code was written together, it might be worth to involve someone else with no priori knowledge of the project to review the code and methodology.

Functional Programming

Function programming offers a few advantages over the other paradigms and we found it to suit very well with Data Munging and Machine Learning algorithms. Just to name few:

  • Implementing any complex logic as combination of simple first-order functions instead of long and non reusable methods.
  • No state, no side-effect, the same code will return the same output at every single cal. No debugging is needed.
  • Close match with math. You can implement any algorithm same way you read them from academic papers.
  • No need to think of how make your code to execute efficiently. Focus on functionalities only.
  • High abstract level, keep your brain trained on lateral thinking instead of following mechanical procedures.
  • Conciseness, you will be surprised of how many algorithms (single node or distributed) can be implemented in a single line.
  • Higher readability, you only needs to understand what the functions aim to do and not what the values of each variable represent at each step.
  • Concurrency for free at no extra cost. Full parallelism.
  • Same code for local implementations magically scales up in a distributed environment. That means you can prototype locally without have to re-engineer your solution for the big data system.
  • Type system, you know what functions can be used and what the form of intermediate transformations are. No need of read-eval-print loops or hacky print calls. Easy to implement, reasoning and refactoring complex algorithms without introducing bugs.
  • No explicit loops, you know how your algorithm is converging via recursion.
  • Flat and minimal structure, no need to create tons of classes or verbose notations. You can use anonymous functions, pattern matching and wildcard notations.

Popular languages in Data Science are not always natively functional but most of them offer their functional extension or some external library does. See for example this project of introducing the functional APIs of Scala to Python collections: http://pedrorodriguez.io/blog/2015/03/14/functional-programming-collections-python/.

If you work in Data Science or Big Data and have never done functional programming before, you should really look into it. You might find it a bit steeply at the beginning but after you master it you will be superbly productive.

The ScrumBan Jira board

This is the part 1 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Agile board

Let’s start with one of the core tool of the agile workflow. We use a Jira board for tracking and organizing all of our projects. We developed a custom board which uses the sprints concept of Scrum but in a more flexible way as in Kanban.

jira-board

 

The Scrumban board is configured as following:

  • Horizontally divided in swimlanes (top-down in order of priority):
    • Critical / Blockers
    • Current work
    • Stories backlog
    • Sub-tickets backlog
    • Completed
  • The columns are:
    • To do
    • In progress
    • In review
    • Done / resolved
    • You can optionally have “Ready to release”
  • Quick filters should at least have one filter for each member of the team filtering on its own assigned tickets.

The idea is that during the planning you select from the backlog which high level stories you want to deliver by the end of the sprint (typically 2 weeks long) and then you create subtasks as-you-need.
Reason is that in data science you don’t know what you are about to implement beforehand. Thus you need to investigate-implement-test all the time and as you do it, you discover what to do next. Important is that whatever subtasks is created it is done by the end of the sprint so that the story is completed.

Define stories with a clear goal and a small scope. They should not span over multiple sprints and since that they come with the uncertainty of what tasks will be required, you really need to break a big problem into smaller well-defined problems that are accomplishable no-matter-what.

Avoid having tasks for exploratory analysis or for adding unit tests. Each task should bring some value, potentially a new feature. Each task will then require an exploratory analysis as well as some development and testing. Those steps are already part of the definition of “Done”. See below sections for more explanations about tests and exploratory analysis.

Plan always less than your capabilities. Delivering your stories a few days earlier is a very good sign. Delaying them is bad. If you manage to get your work done by Thursday, spend the whole day of Friday in a pub celebrating your amazing delivery.

In Jira, you must assign each story to one individual but remember that in an agile team either the whole team succeeds or fails. If that person does not manage to finish his tasks on time, it is a team failure. That’s what you have the morning standup for, to make sure everything is under control and team resources are allocated in a way that the sprint is going to be successful.

Never change the scope of your sprints or add tasks that were not planned, unless are required hotfixes. If you are asked to do something else then invite the product owners to join your next sprint planning and only then you can allocate resources for them.
Remember the goal of a sprint is to have a working, even if simplistic, deliverable not solving sparse tasks.

At the end of the sprint have a retrospective meeting to discuss what went well and what not. Make sure to take actions in order to avoid that blockers may appear again in future.

Documentation

Documentation should be as simple as possible.

  • Releases notes, a page where you can note the major changes since previous version, the list of new tickets that have been merged  (linking to Jira) and a link to a more detailed report.
  • The detailed report contains snapshots of the most recent logs, results, observations, limitations, assumptions and performances of the model/etl/application. Often it contains some charts that can quickly explaining how good the product is. We can use those detailed but concise reports to track how the product is evolving. The release detailed report also contains the help messages of how to run the application and all of the command line interface (CLI) options.
    If all of your tests and procedures are fully automated then this page is simply a copy and paste of the results.
  • The usage of a particular job class or a script with the list of CLI arguments and default values is also accessible using –help argument, many libraries helps you doing that (bash getops, scala Scallop…).
  • Other pages are used to explain the complex part of the logic. Try to reduce those pages only when the logic is very complicated and hard to understand by just reading the code.

Documentation is hard to keep in sync that’s way we want to document what’s new since the last release rather than going through the whole wiki and updating every single page.

Ideally the documentation comes from the source code, unit tests and jira tickets. Individual analysis, findings and insights can be documented separately but they should represent static reports rather than project documentation.

In the hierarchical structure of the pages, we limit the maximum depth to 2. Which means we have the root-level pages with at most one level of children pages. Nested structures make it very hard to find contents when you need them.

Branching and versioning

Code should always and only exist in a git repository. Sparse snippets or random script files should be avoided.

We follow the gitflow branching model where each ticket is mapped as features branch. If you integrate Jira with Stash then from the ticket web page you can automatically create the corresponding branch in the repository using develop as branch base.

You do not need to use the complete gitflow branching model but at least the master, develop and features branches. It’s up to the deployment strategy defining how to handle hotfixes, bugfixes and releases branches. Make sure this strategy is clearly defined and is consistently enforced. See deployment.

Story tickets generally don’t have a branch associated, their sub-tasks have.

Install a git hook that every commit will include as prefix the ticket code (that you can parse out from the branch name). Tracking each commit with the corresponding ticket is a life-saver when in future you will try to reverse engineer what a method is doing and why has been created in first place. Then you can access the whole git history and access the corresponding tickets that touched that piece of code.

Discussions

Discussions of specific tasks should go into the corresponding jira ticket web page. This will make the conversation public, tracked and anyone can jump into the discussion with the full context available. Also reference files or supporting documents should be attached to the jira ticket itself or in the wiki if they serve as a general purpose. Remember each jira ticket can be linked from the releases wiki page, that means we never lose track of them. Moreover the query engine is quite good.

We found emails to be the worst place for discussions to happen, especially for sharing files that will become soon out-of-date.

When someone sends you an Excel file, reply saying that your laptop does not have an Office installation on it. If you are sharing small data files, tsv or json is way to go.
Avoid comma separated files with quotes wrapping text fields. You want to make your file editable using simple bash commands rather than loading into a csv parsing library.

We tried also mounted shared drives, but confluence is a much better collaborative way to share and organize files with an integrated version control and metadata.

Avoid meetings as much as you can, invent some excuse, ask for a clear agenda beforehand. Educate your colleagues to communicate with you by raising issues. Leave meetings only for important discussions and spend your meeting time for presenting and checkpointing with your stakeholders more frequently.

 

 

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are:

  1. Data stored in non scalable infrastructure for analysis and processing
  2. Data governance and security policies

1. Data often resides into central data warehouse and RDBMS of which many legacy applications and analysts depends on.
Data Scientists insteads cannot build their models or perform exploratory analysis by using SQL queries. They need the data to be available into a scalable, programmatic and reactive stack such as Hadoop and Apache Spark and develop their logic using languages such as Python, R, Scala… (for comparison of how Python and Scala compare for Spark, see this post: 6 points to compare Python and Scala for Data Science using Apache Spark).
2. Nevertheless, data cannot just be transferred (in technical terms sqoop-ed) to an Hadoop cluster without incurring into tedious bureaucracy,  ingestion inconsistencies and strict policies. In big corporations that translates to at least a month to decide what tables are interesting and a few more months to write the ETL logic, move the data and test the consistency.

At Barclays we developed a stack to logically map the raw data from the central data warehouse into Spark and use Tachyon for in-memory saving the data for long-term availability. In such stack, we are able to iterate fast with immediate data availability from a scalable Big Data cluster by skipping the data ingestion process and still complying with all of the data policies.

Tachyon was the key enabling technology for us.

Our workflow iteration time decreased from hours to seconds. Tachyon enabled something that was impossible before.

You can find the original article published on DZone in collaboration with Gene Pang, Software Engineer at Tachyon Nexus and Haoyuan Li, CEO of Tachyon Nexus:
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds

6 points to compare Python and Scala for Data Science using Apache Spark

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach. I personally believe that Spark will inevitably become the de-facto Big Data framework for  Machine Learning and Data Science.

Despite of the different opinions about Spark, let’s assume that a data science team wants to start adopting it as main technology. The choice of programming language is often a dilemma. Shall we build our models in Python or in Scala? Shall we run the exploratory analysis using the iPython notebook or iScala?
A common understanding is that Python is the scientific language and Scala is an engineering language seen as a better replacement for Java. Whilst there is truth in that, it does not have to be always the case.

Since that the two languages comparison has already been evaluated in details in other places, I would like to restrict the comparison to the particular use case of building data products leveraging Apache Spark in an agile workflow.

In particular, I can identify 6 important aspects that a Data Science programming language in this context should provide:

  1. Productivity
  2. Safe refactoring
  3. Spark integration
  4. Out-of-the-box Machine Learning/Statistics packages
  5. Documentation / Community
  6. Interactive Exploratory Analysis and built-in visualization tools

Why only Scala and Python?
Apache Spark comes with 4 APIs: Scala, Java, Python and recently R. The reason why I am only considering “PyScala” is because they mostly provides similar features respectively to the other 2 languages (Scala over Java and Python over R) with, in my opinion, better overall scoring. Moreover R is not a general-purpose language and its API is still in an experimental phase.

1. Productivity

Even though coding close to the bare metal produce always the most optimized results, pre-mature optimizations are known to be the root of all evil. Especially in the initial MVP phase we want to achieve high productivity with fewest possible lines of code and possibly be guided by a smart IDE.

Python is a very simple to learn and highly productive language to get things done quickly and from day 1. Scala requires a little bit more of thinking and abstraction due to its high level functional features but as soon as you get familiar with that, your productivity will dramatically boost. Code conciseness are quite comparable, both can be very concise depending on how good you are at coding. Reading Python is more explicit, it shows you step-by-step what your code execution is and the state of each variable. Scala in the other hand will focus more on describing what you are trying to achieve as final result hiding most of the implementation details and execution order. But remember with great power comes great responsibility. Whilst pattern matching is a very cool way to extract variables, advance features like implicits or custom DSLs can be confusing to the non-expert user.

In terms of IDEs, both IntelliJ and PyCharm are smart and productive environments. Nevertheless, Scala can take advantage of the type and compile-time cross-references that can provide some extra functionalities more naturally and without ambiguity, unlike in scripting languages. Just to name few: Find class/methods by name in the project and linked dependencies, find usages, auto-completion based on type compatibility, development-time errors or warnings.
In the other hand, all of those compile-time features comes with a cost: IntelliJ, sbt and all of the related tools are very slow and memory/cpu consuming. You shouldn’t be surprise if 2GB of your RAM is allocated in order to open multiple parallel projects in Scala. Python is more lightweight in this concern.

Conclusion: Both scores very well here, my recommendation is if you are developing simple intuitive logic then Python does the job greatly, if you want to do something more complex than it may be worth investing in learning and writing functional code in Scala.

2. Safe Refactoring

This requirement mainly comes with the agile methodology, we want to safely change the requirements of our code as we perform data explorations and adjust them at each iteration. Very commonly you first write some code with associated tests and immediately after the tests, implementations and APIs are broken. Everytime we perform a refactoring we face the risk of introducing bugs and silently breaking the previous logic.

Both the two languages must require tests (unit tests, integration tests, property based tests, etc…) in order to be safely refactored. Scala being a compile language has a better advantage in that but I am not going to argument the pros and cons of compiled vs scripting languages. So, I will skip that but at least for me I can see some useful benefits from having typed code.

Conclusion: Scala very well, Python average.

3. Spark Integration

Majority of the time and resources are generally spent on loading, cleaning, transforming data and extracting the most informative bits out of it. For that task, what is better than expressing your domain specific logic as combination of functions and do not bother about how it is lazily executed? No wonder that Big Data is turning more and more functional.

You now would expect me to say that Scala does better since that is natively functional. Actually in this scenario, the big difference is made by Spark rather than the programming language. Even though Python is not 100% fully functional (you could make it via external libraries), it wraps the Spark API which is indeed functional.

The implementation of the single map or reduce functions can then be either functional or not but at least the main logic is expressed as a pipe of transformations and operations over the raw data and the execution plan is defined by the computation framework.

You still have to smartly use the different Spark APIs in order to make your code scalable and optimized, but this task is the same for both the two cases. If we consider code execution performance then we all know that JVM compiled code runs faster than Python code but Spark is moving towards language-agnostic abstractions like DataFrame which will optimize most of the work for you producing comparable performance results.

Thus, the solution is “use Spark”. Because of that (and independently from the functional nature), Scala supports it natively which comes particularly handy especially when performing low-level tuning, optimizations and debugging. If you have used the Spark framework you are well familiar with its serialization exceptions. Since that the Python code is wrapped and executed in the JVM, you have less control over what is enclosed in your functions. Moreover some new features in recent Spark releases may only be available in Scala before to be ported as well in Python.

Conclusion: Scala better when comes to engineering, equivalent in terms of Spark integration and functionalities.

4. Out-of-the-box machine learning/statistics packages

When you marry a language, you marry the whole family. And Python has much more to bring on the table when it comes to out-of-the-box packages implementing most of the standard procedures and models you generally find in the literature and/or broadly adopted in the industry. Scala is still way behind in that yet can benefit from the Java libraries compatibility and  the community developing some of the popular machine learning algorithms on their distributed version directly on top of Spark (see MLlib, H20 Sparkling Water, DeepLearning4j …). A little note regarding MLlib, from my experience its implementation is a bit hacky and often hard to be modified or extended due to a mediocre design and non-sense limitations of private fields and classes.

Regarding the Java compatibility honestly I don’t see any Java framework to be anywhere close to what Python today provides with its amazing scikit-learn and related libraries. In the other hand many of those Python implementation only works locally (unless using some bootstrapping/bagging + model ensembling technique, see https://cornercases.wordpress.com/2013/10/23/example-python-machine-learning-algorithm-on-spark/) but their out-of-the-box implementations lack strong scalability when it comes to distributed algorithms. Scala in the other hand provides only a few implementations but already scalable and production-ready.

Nevertheless, do not forget that many big data problems can be reduced in small data problems, especially after an accurate feature selection, filtering and aggregation. It might make sense in some scenarios to crunch your large dataset into a vector space which can perfectly fit in memory and take advantage of the richness and advanced algorithms available in Python.

Conclusion: It really depends of what the size of your data is. Prefer Python every time that it can fit in memory but keep in mind also what are the requirements of your project: Is it just a prototype or is something you want to deploy/maintain in a production system? Python offers a complete selection of already-implemented packages that can satisfy any need. Scala will only provide the basics but in case of “productionisation” is a better engineering choice.

5. Documentation / Community

If we compare the two plain languages (without their external libraries) in terms of community size then Python belongs to the tier1 while Scala right after in tier2, see http://readwrite.com/2010/12/10/ranking-programming-languages. Practically speaking it means both of them have enough tutorials and answers in StackOverflow covering the majority of use cases and how-to’s.

If we consider documentation of the machine learning and statistics frameworks, the Python data science community is more mature and in fact you can find many tutorials and examples of how to solve a lot of problems and cool analysis using most of the Python libraries.

Unfortunately we cannot say the same for Scala. ML and MLlib libraries are very poor, the only way to really understand how they work is by reading the code. Likely with some other open source libraries that I found on GitHub.

Conclusion:
Both of them have a good and comparable community in terms of software development. When we consider data science community and cool data science projects, Python is hard to beat.

6. Interactive Exploratory Analysis and built-in visualization tools

iPython is one the greatest tools ever invented in the scientific world, one year ago it would have been without doubts the oscar winner. Today we can find many implementations of notebooks inspired by the iPython notebook available for any language. Jupyter, the iPython evolution, supports different kernels plus iScala actually re-implement it based on an akka play restful service. If you only consider opening a web-based notebook and start writing and interacting with some code, I think they are very similar.

If we consider using the notebook to interact with Spark, it may be a little more useful to use the Spark Notebook (in Scala) since that it is specifically designed for this purpose and provides a few utils to generates custom spark contexts or stopping the current in progress job without have to access the Spark UI or run commands from command line. While it is a nice to have feature, I don’t think makes a huge difference.

The pain comes when we comes to dependency injection and in that aspect Scala is a true nightmare! Being a compiled JVM language all of the dependencies must be available in the classpath and the kernel required to be restarted every time a jar changes or a new one comes in the path. Moreover using dependency management tools like sbt for some reason generates a whole lot of traffic and all of your dependencies are then packed into a fat jar of the size of hundreds of MBs which then must be loaded by the JVM executing your back-end code. Python here does much better because everything is specified at runtime and you can simply import code or libraries and the interpreter will automatically solves it for you without never restart your kernel. This aspect is extremely important especially when separating the development in the IDE from the exploration in the notebook calling the APIs of your implemented logic from the source folder. I raised this issue with the TypeSafe and SparkNotebook folks hoping that it can be addressed somehow in a more efficient way.

Built-in visualizations: Spark Notebook includes a very rudimental built-in viz library, a simple but acceptable WISP library and few wrappers around javascript technologies such as D3, Rickshaw. Generally speaking, it can render and wrap any javascript library but in a very non friendly nor intuitive. Python without any doubt is superior in the offer and selection of cool and advanced ways of plotting and building interactive dashboards.

Conclusion: Python wins, Scala is not enough mature yet even though the SparkNotebook does a good job. We haven’t yet considered the recent Apache Zeppelin which provides some fancy visualization features and supports the concept of language-agnostic notebook where each cell can represent any type of code: Scala, Python, SQL… and is specifically designed to integrate well with Spark.

Final Verdict

Shall I use Scala or Python? The answer is: Yes!
Give a try to both of them and try to test yourself what better works for your specific use case. As a rule of thumb: Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications. The ideal scenario would be to have a data science team able to be confident with both of them and swap when needed.

Nonetheless, technology choices are often driven by what people are already comfortable with. Pressure to deliver does not give you enough resources to spend on researching new libraries, reading papers or learning new tools and languages. What most data scientists care at the end of the day is to deliver using whatever mean does the job.

If you do have to decide, my view is that if your scope is doing research, then a scripting language is enough complete in terms of experimentation and prototyping. If your goal is to build a product then you want to consider something more robust that gives you both experimentation and at the same delivers a product.

Since that the best solution is never white or black, I encourage trying hybrid approaches that can adapt based on each project specification. A typical scenario could be developing the whole ETL, data cleansing and feature extraction in Scala and then distribute the data over multiple partitions and learning using algorithms written in Python for then collecting the results and presenting in a Jupyter notebook. Moreover since that at the last stage we don’t need Spark anymore, we could even deploy an interactive and stunning dashboard using Shiny by RStudio?

My motto is “the best tool for each task”. Whatever balance you choose, avoid to split into two teams: Data Science Engineers (the Big Data/Scala guys) and Data Science Analysts (the Python and SQL folks). Aim to build a cross-functional team with the full skillset to operate on the full end-to-end development of your product, from the raw data to the manual analysis and from the modelling to a scalable deployment.

I hope that article can be found useful for both experienced data scientists and enthusiasts that want to start their career in this industry. Please consider that the above comparison is mainly specific for the Apache Spark use case which I strongly recommend but in case you are using a different stack and/or languages choice, I think many concepts are still valid and can be extended to the broader families of Compiled Vs. Scripting languages.

***

Related links:

https://www.quora.com/Which-one-should-I-learn-Python-or-Scala

https://www.linkedin.com/pulse/build-tool-pain-why-data-science-isnt-going-typed-sam-savage

https://www.quora.com/Is-Scala-a-better-choice-than-Python-for-Apache-Spark

http://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-python

Scala vs Python

http://datavirtualizer.com/popularity-vs-productivity-vs-performance/

Pro Python:

http://blog.mikiobraun.de/2013/11/how-python-became-the-language-of-choice-for-data-science.html

https://www.quora.com/Why-is-Python-a-language-of-choice-for-data-scientists

I am sorry but majority of comparisons of Python with other languages for data science is mainly Python Vs. R. I could not find so many other pro-python links comparing with Scala.

Pro Scala:

https://tech.coursera.org/blog/2014/02/18/why-we-love-scala-at-coursera/

http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang

https://www.linkedin.com/pulse/data-science-technology-choice-case-study-harry-powell