This is the part 2 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.
Code should be developed in a proper IDE and make use of advanced tools for re-factoring, auto-completion, syntax highlighting and auto-formatters; at least.
Notebooks should use routine libraries from the main codebase. As soon as some code is developed in a notebook and is reusable, it should be moved into a codebase. Rule of thumb might be each notebook cell should not exceed the 10 lines, after that either needs refactoring or it should be pulled away. Only exception is long code used only and specifically for the one off investigation that does not make sense outside that particular context.
Do not introduce unnecessary dependencies in the codebase (e.g. plotting libraries). Keep the code repository lean and add dependencies to your particular use case rather than the project repository.
During development is recommended to do frequent git commits. When the ticket is ready to go, the developer should first run a git diff develop and review its own code before to create the pull request (PR).
The pull request should only contain the minimum amount of code specified in the corresponding ticket requirements. You don’t anticipate functions that you know will need in the future even though this future is a couple of hours later. Avoid abstractions or general-purpose methods. First a working code for your specific use case then you will refactor it.
Agile manifesto says:
“Simplicity–the art of maximizing the amount
of work not done–is essential.”
Make your code structure flat:
- data containers
- static classes containing functions/methods/utils
- entry point classes defining the end-to-end job and putting all of the pieces together
Copy and paste the same code if needed, duplication is not always bad if it makes the design simpler. Only extract methods and abstract classes if you have at least 3 use cases.
Comments in the code is very likely to cause out-of-sync documentation. Clean code, good design and self-explaining namings will make your code self-documenting. The only exception to comments are TODO, FIXME and annotations explaining why an hack was needed and in which conditions the current implementation might fail. Obviously avoiding hacks in the first place is the best solution but sometime we need to cope with them. Abuse of TODOs but do not leave non-working code without annotations.
Extreme attention should be paid to the code style and conventions. Having bad formatted code or inconsistent patterns make the code very hard to read and maintain.
After the PR is sent for review, chase your reviewer to review your code asap. Resist from starting a new task until the review is not finished and the PR merged into the develop branch. Do one thing per time and move to the next only when the previous is 100% done.
Reviewers should not accept justification regarding bad practices. Code reviews is the only way to guarantee a convergence of the team towards the excellence. It definitely pays off in the long term. The process of code reviewing should go forth and back until both the two parties are satisfied.
You should always come up with smart ways of testing your code. Laziness or “I know it works” approaches should not be accepted. Only code that may not require tests are one-off analysis since that are humanly supervised and are not going into production.
A code without tests is risky, cannot be refactored and cannot be maintained since that unit tests serve as documentation. If someone changes your code than you can still be blamed and be responsible of the failure even though your code used to work. Tests are the only way of protecting validity of your solutions. Time spent in testing is the greatest long-term investment you can do for your project.
If you spot a bug that was not found in your tests, that is an indicator that this test case should be added. Don’t just fix it, make sure you first have the failing test for it. Debug your code by adding unit tests and breaking down end-to-end methods into smaller composable functions. Debugging by adding unit tests will give you a much safer and repeatable way to make your code robust.
Read-eval-print-loop (REPL) debugging is just another type of exploratory analysis, if you want to follow that way then remember to turn your manual techniques into automated tests.
Obviously all of the above problems would not exist in case of TDD.
When your fantasy of creating manual test cases is about to finish or you are too tired of keeping adding tests that always succeed, consider also adding a few property-based tests with random generators.
Unit tests are necessary but is the whole end-to-end that matters. Make sure you have at least a few integration tests in place. The best is if those integration tests actually maps to real use cases.
We found pair working to be much more productive than working as isolated individuals. A data science team generally is cross-functional with people ranging from a more engineering background to more theoretical analytical/statistics background. Good rule is to pair opposite individuals together and swap their competencies so that who is good at coding will do the modelling and vice versa. Code review process still applies as usual even though the code was written together, it might be worth to involve someone else with no priori knowledge of the project to review the code and methodology.
Function programming offers a few advantages over the other paradigms and we found it to suit very well with Data Munging and Machine Learning algorithms. Just to name few:
- Implementing any complex logic as combination of simple first-order functions instead of long and non reusable methods.
- No state, no side-effect, the same code will return the same output at every single cal. No debugging is needed.
- Close match with math. You can implement any algorithm same way you read them from academic papers.
- No need to think of how make your code to execute efficiently. Focus on functionalities only.
- High abstract level, keep your brain trained on lateral thinking instead of following mechanical procedures.
- Conciseness, you will be surprised of how many algorithms (single node or distributed) can be implemented in a single line.
- Higher readability, you only needs to understand what the functions aim to do and not what the values of each variable represent at each step.
- Concurrency for free at no extra cost. Full parallelism.
- Same code for local implementations magically scales up in a distributed environment. That means you can prototype locally without have to re-engineer your solution for the big data system.
- Type system, you know what functions can be used and what the form of intermediate transformations are. No need of read-eval-print loops or hacky print calls. Easy to implement, reasoning and refactoring complex algorithms without introducing bugs.
- No explicit loops, you know how your algorithm is converging via recursion.
- Flat and minimal structure, no need to create tons of classes or verbose notations. You can use anonymous functions, pattern matching and wildcard notations.
Popular languages in Data Science are not always natively functional but most of them offer their functional extension or some external library does. See for example this project of introducing the functional APIs of Scala to Python collections: http://pedrorodriguez.io/blog/2015/03/14/functional-programming-collections-python/.
If you work in Data Science or Big Data and have never done functional programming before, you should really look into it. You might find it a bit steeply at the beginning but after you master it you will be superbly productive.