Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach. I personally believe that Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.
Despite of the different opinions about Spark, let’s assume that a data science team wants to start adopting it as main technology. The choice of programming language is often a dilemma. Shall we build our models in Python or in Scala? Shall we run the exploratory analysis using the iPython notebook or iScala?
A common understanding is that Python is the scientific language and Scala is an engineering language seen as a better replacement for Java. Whilst there is truth in that, it does not have to be always the case.
Since that the two languages comparison has already been evaluated in details in other places, I would like to restrict the comparison to the particular use case of building data products leveraging Apache Spark in an agile workflow.
In particular, I can identify 6 important aspects that a Data Science programming language in this context should provide:
- Safe refactoring
- Spark integration
- Out-of-the-box Machine Learning/Statistics packages
- Documentation / Community
- Interactive Exploratory Analysis and built-in visualization tools
Why only Scala and Python?
Apache Spark comes with 4 APIs: Scala, Java, Python and recently R. The reason why I am only considering “PyScala” is because they mostly provides similar features respectively to the other 2 languages (Scala over Java and Python over R) with, in my opinion, better overall scoring. Moreover R is not a general-purpose language and its API is still in an experimental phase.
Even though coding close to the bare metal produce always the most optimized results, pre-mature optimizations are known to be the root of all evil. Especially in the initial MVP phase we want to achieve high productivity with fewest possible lines of code and possibly be guided by a smart IDE.
Python is a very simple to learn and highly productive language to get things done quickly and from day 1. Scala requires a little bit more of thinking and abstraction due to its high level functional features but as soon as you get familiar with that, your productivity will dramatically boost. Code conciseness are quite comparable, both can be very concise depending on how good you are at coding. Reading Python is more explicit, it shows you step-by-step what your code execution is and the state of each variable. Scala in the other hand will focus more on describing what you are trying to achieve as final result hiding most of the implementation details and execution order. But remember with great power comes great responsibility. Whilst pattern matching is a very cool way to extract variables, advance features like implicits or custom DSLs can be confusing to the non-expert user.
In terms of IDEs, both IntelliJ and PyCharm are smart and productive environments. Nevertheless, Scala can take advantage of the type and compile-time cross-references that can provide some extra functionalities more naturally and without ambiguity, unlike in scripting languages. Just to name few: Find class/methods by name in the project and linked dependencies, find usages, auto-completion based on type compatibility, development-time errors or warnings.
In the other hand, all of those compile-time features comes with a cost: IntelliJ, sbt and all of the related tools are very slow and memory/cpu consuming. You shouldn’t be surprise if 2GB of your RAM is allocated in order to open multiple parallel projects in Scala. Python is more lightweight in this concern.
Conclusion: Both scores very well here, my recommendation is if you are developing simple intuitive logic then Python does the job greatly, if you want to do something more complex than it may be worth investing in learning and writing functional code in Scala.
2. Safe Refactoring
This requirement mainly comes with the agile methodology, we want to safely change the requirements of our code as we perform data explorations and adjust them at each iteration. Very commonly you first write some code with associated tests and immediately after the tests, implementations and APIs are broken. Everytime we perform a refactoring we face the risk of introducing bugs and silently breaking the previous logic.
Both the two languages must require tests (unit tests, integration tests, property based tests, etc…) in order to be safely refactored. Scala being a compile language has a better advantage in that but I am not going to argument the pros and cons of compiled vs scripting languages. So, I will skip that but at least for me I can see some useful benefits from having typed code.
Conclusion: Scala very well, Python average.
3. Spark Integration
Majority of the time and resources are generally spent on loading, cleaning, transforming data and extracting the most informative bits out of it. For that task, what is better than expressing your domain specific logic as combination of functions and do not bother about how it is lazily executed? No wonder that Big Data is turning more and more functional.
You now would expect me to say that Scala does better since that is natively functional. Actually in this scenario, the big difference is made by Spark rather than the programming language. Even though Python is not 100% fully functional (you could make it via external libraries), it wraps the Spark API which is indeed functional.
The implementation of the single map or reduce functions can then be either functional or not but at least the main logic is expressed as a pipe of transformations and operations over the raw data and the execution plan is defined by the computation framework.
You still have to smartly use the different Spark APIs in order to make your code scalable and optimized, but this task is the same for both the two cases. If we consider code execution performance then we all know that JVM compiled code runs faster than Python code but Spark is moving towards language-agnostic abstractions like DataFrame which will optimize most of the work for you producing comparable performance results.
Thus, the solution is “use Spark”. Because of that (and independently from the functional nature), Scala supports it natively which comes particularly handy especially when performing low-level tuning, optimizations and debugging. If you have used the Spark framework you are well familiar with its serialization exceptions. Since that the Python code is wrapped and executed in the JVM, you have less control over what is enclosed in your functions. Moreover some new features in recent Spark releases may only be available in Scala before to be ported as well in Python.
Conclusion: Scala better when comes to engineering, equivalent in terms of Spark integration and functionalities.
4. Out-of-the-box machine learning/statistics packages
When you marry a language, you marry the whole family. And Python has much more to bring on the table when it comes to out-of-the-box packages implementing most of the standard procedures and models you generally find in the literature and/or broadly adopted in the industry. Scala is still way behind in that yet can benefit from the Java libraries compatibility and the community developing some of the popular machine learning algorithms on their distributed version directly on top of Spark (see MLlib, H20 Sparkling Water, DeepLearning4j …). A little note regarding MLlib, from my experience its implementation is a bit hacky and often hard to be modified or extended due to a mediocre design and non-sense limitations of private fields and classes.
Regarding the Java compatibility honestly I don’t see any Java framework to be anywhere close to what Python today provides with its amazing scikit-learn and related libraries. In the other hand many of those Python implementation only works locally (unless using some bootstrapping/bagging + model ensembling technique, see https://cornercases.wordpress.com/2013/10/23/example-python-machine-learning-algorithm-on-spark/) but their out-of-the-box implementations lack strong scalability when it comes to distributed algorithms. Scala in the other hand provides only a few implementations but already scalable and production-ready.
Nevertheless, do not forget that many big data problems can be reduced in small data problems, especially after an accurate feature selection, filtering and aggregation. It might make sense in some scenarios to crunch your large dataset into a vector space which can perfectly fit in memory and take advantage of the richness and advanced algorithms available in Python.
Conclusion: It really depends of what the size of your data is. Prefer Python every time that it can fit in memory but keep in mind also what are the requirements of your project: Is it just a prototype or is something you want to deploy/maintain in a production system? Python offers a complete selection of already-implemented packages that can satisfy any need. Scala will only provide the basics but in case of “productionisation” is a better engineering choice.
5. Documentation / Community
If we compare the two plain languages (without their external libraries) in terms of community size then Python belongs to the tier1 while Scala right after in tier2, see http://readwrite.com/2010/12/10/ranking-programming-languages. Practically speaking it means both of them have enough tutorials and answers in StackOverflow covering the majority of use cases and how-to’s.
If we consider documentation of the machine learning and statistics frameworks, the Python data science community is more mature and in fact you can find many tutorials and examples of how to solve a lot of problems and cool analysis using most of the Python libraries.
Unfortunately we cannot say the same for Scala. ML and MLlib libraries are very poor, the only way to really understand how they work is by reading the code. Likely with some other open source libraries that I found on GitHub.
Both of them have a good and comparable community in terms of software development. When we consider data science community and cool data science projects, Python is hard to beat.
6. Interactive Exploratory Analysis and built-in visualization tools
iPython is one the greatest tools ever invented in the scientific world, one year ago it would have been without doubts the oscar winner. Today we can find many implementations of notebooks inspired by the iPython notebook available for any language. Jupyter, the iPython evolution, supports different kernels plus iScala actually re-implement it based on an akka play restful service. If you only consider opening a web-based notebook and start writing and interacting with some code, I think they are very similar.
If we consider using the notebook to interact with Spark, it may be a little more useful to use the Spark Notebook (in Scala) since that it is specifically designed for this purpose and provides a few utils to generates custom spark contexts or stopping the current in progress job without have to access the Spark UI or run commands from command line. While it is a nice to have feature, I don’t think makes a huge difference.
The pain comes when we comes to dependency injection and in that aspect Scala is a true nightmare! Being a compiled JVM language all of the dependencies must be available in the classpath and the kernel required to be restarted every time a jar changes or a new one comes in the path. Moreover using dependency management tools like sbt for some reason generates a whole lot of traffic and all of your dependencies are then packed into a fat jar of the size of hundreds of MBs which then must be loaded by the JVM executing your back-end code. Python here does much better because everything is specified at runtime and you can simply import code or libraries and the interpreter will automatically solves it for you without never restart your kernel. This aspect is extremely important especially when separating the development in the IDE from the exploration in the notebook calling the APIs of your implemented logic from the source folder. I raised this issue with the TypeSafe and SparkNotebook folks hoping that it can be addressed somehow in a more efficient way.
Conclusion: Python wins, Scala is not enough mature yet even though the SparkNotebook does a good job. We haven’t yet considered the recent Apache Zeppelin which provides some fancy visualization features and supports the concept of language-agnostic notebook where each cell can represent any type of code: Scala, Python, SQL… and is specifically designed to integrate well with Spark.
Shall I use Scala or Python? The answer is: Yes!
Give a try to both of them and try to test yourself what better works for your specific use case. As a rule of thumb: Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications. The ideal scenario would be to have a data science team able to be confident with both of them and swap when needed.
Nonetheless, technology choices are often driven by what people are already comfortable with. Pressure to deliver does not give you enough resources to spend on researching new libraries, reading papers or learning new tools and languages. What most data scientists care at the end of the day is to deliver using whatever mean does the job.
If you do have to decide, my view is that if your scope is doing research, then a scripting language is enough complete in terms of experimentation and prototyping. If your goal is to build a product then you want to consider something more robust that gives you both experimentation and at the same delivers a product.
Since that the best solution is never white or black, I encourage trying hybrid approaches that can adapt based on each project specification. A typical scenario could be developing the whole ETL, data cleansing and feature extraction in Scala and then distribute the data over multiple partitions and learning using algorithms written in Python for then collecting the results and presenting in a Jupyter notebook. Moreover since that at the last stage we don’t need Spark anymore, we could even deploy an interactive and stunning dashboard using Shiny by RStudio?
My motto is “the best tool for each task”. Whatever balance you choose, avoid to split into two teams: Data Science Engineers (the Big Data/Scala guys) and Data Science Analysts (the Python and SQL folks). Aim to build a cross-functional team with the full skillset to operate on the full end-to-end development of your product, from the raw data to the manual analysis and from the modelling to a scalable deployment.
I hope that article can be found useful for both experienced data scientists and enthusiasts that want to start their career in this industry. Please consider that the above comparison is mainly specific for the Apache Spark use case which I strongly recommend but in case you are using a different stack and/or languages choice, I think many concepts are still valid and can be extended to the broader families of Compiled Vs. Scripting languages.
I am sorry but majority of comparisons of Python with other languages for data science is mainly Python Vs. R. I could not find so many other pro-python links comparing with Scala.