Coding practices for data products development

This is the part 2 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Coding practices

Code should be developed in a proper IDE and make use of advanced tools for re-factoring, auto-completion, syntax highlighting and auto-formatters; at least.

Notebooks should use routine libraries from the main codebase. As soon as some code is developed in a notebook and is reusable, it should be moved into a codebase. Rule of thumb might be each notebook cell should not exceed the 10 lines, after that either needs refactoring or it should be pulled away. Only exception is long code used only and specifically for the one off investigation that does not make sense outside that particular context.

Do not introduce unnecessary dependencies in the codebase (e.g. plotting libraries). Keep the code repository lean and add dependencies to your particular use case rather than the project repository.

During development is recommended to do frequent git commits. When the ticket is ready to go, the developer should first run a git diff develop and review its own code before to create the pull request (PR).

The pull request should only contain the minimum amount of code specified in the corresponding ticket requirements. You don’t anticipate functions that you know will need in the future even though this future is a couple of hours later. Avoid abstractions or general-purpose methods. First a working code for your specific use case then you will refactor it.

Agile manifesto says:

“Simplicity–the art of maximizing the amount
of work not done–is essential.”

Make your code structure flat:

  • data containers
  • static classes containing functions/methods/utils
  • entry point classes defining the end-to-end job and putting all of the pieces together

Copy and paste the same code if needed, duplication is not always bad if it makes the design simpler. Only extract methods and abstract classes if you have at least 3 use cases.

Comments in the code is very likely to cause out-of-sync documentation. Clean code, good design and self-explaining namings will make your code self-documenting. The only exception to comments are TODO, FIXME and annotations explaining why an hack was needed and in which conditions the current implementation might fail. Obviously avoiding hacks in the first place is the best solution but sometime we need to cope with them. Abuse of TODOs but do not leave non-working code without annotations.

Extreme attention should be paid to the code style and conventions. Having bad formatted code or inconsistent patterns make the code very hard to read and maintain.

After the PR is sent for review, chase your reviewer to review your code asap. Resist from starting a new task until the review is not finished and the PR merged into the develop branch. Do one thing per time and move to the next only when the previous is 100% done.

Reviewers should not accept justification regarding bad practices. Code reviews is the only way to guarantee a convergence of the team towards the excellence. It definitely pays off in the long term. The process of code reviewing should go forth and back until both the two parties are satisfied.

Testing

notestnobeer

You should always come up with smart ways of testing your code. Laziness or “I know it works” approaches should not be accepted. Only code that may not require tests are one-off analysis since that are humanly supervised and are not going into production.

A code without tests is risky, cannot be refactored and cannot be maintained since that unit tests serve as documentation. If someone changes your code than you can still be blamed and be responsible of the failure even though your code used to work. Tests are the only way of protecting validity of your solutions. Time spent in testing is the greatest long-term investment you can do for your project.

If you spot a bug that was not found in your tests, that is an indicator that this test case should be added. Don’t just fix it, make sure you first have the failing test for it. Debug your code by adding unit tests and breaking down end-to-end methods into smaller composable functions. Debugging by adding unit tests will give you a much safer and repeatable way to make your code robust.

Read-eval-print-loop (REPL) debugging is just another type of exploratory analysis, if you want to follow that way then remember to turn your manual techniques into automated tests.

Obviously all of the above problems would not exist in case of TDD.

When your fantasy of creating manual test cases is about to finish or you are too tired of keeping adding tests that always succeed, consider also adding a few property-based tests with random generators.

Unit tests are necessary but is the whole end-to-end that matters. Make sure you have at least a few integration tests in place. The best is if those integration tests actually maps to real use cases.

Pair working

We found pair working to be much more productive than working as isolated individuals. A data science team generally is cross-functional with people ranging from a more engineering background to more theoretical analytical/statistics background. Good rule is to pair opposite individuals together and swap their competencies so that who is good at coding will do the modelling and vice versa. Code review process still applies as usual even though the code was written together, it might be worth to involve someone else with no priori knowledge of the project to review the code and methodology.

Functional Programming

Function programming offers a few advantages over the other paradigms and we found it to suit very well with Data Munging and Machine Learning algorithms. Just to name few:

  • Implementing any complex logic as combination of simple first-order functions instead of long and non reusable methods.
  • No state, no side-effect, the same code will return the same output at every single cal. No debugging is needed.
  • Close match with math. You can implement any algorithm same way you read them from academic papers.
  • No need to think of how make your code to execute efficiently. Focus on functionalities only.
  • High abstract level, keep your brain trained on lateral thinking instead of following mechanical procedures.
  • Conciseness, you will be surprised of how many algorithms (single node or distributed) can be implemented in a single line.
  • Higher readability, you only needs to understand what the functions aim to do and not what the values of each variable represent at each step.
  • Concurrency for free at no extra cost. Full parallelism.
  • Same code for local implementations magically scales up in a distributed environment. That means you can prototype locally without have to re-engineer your solution for the big data system.
  • Type system, you know what functions can be used and what the form of intermediate transformations are. No need of read-eval-print loops or hacky print calls. Easy to implement, reasoning and refactoring complex algorithms without introducing bugs.
  • No explicit loops, you know how your algorithm is converging via recursion.
  • Flat and minimal structure, no need to create tons of classes or verbose notations. You can use anonymous functions, pattern matching and wildcard notations.

Popular languages in Data Science are not always natively functional but most of them offer their functional extension or some external library does. See for example this project of introducing the functional APIs of Scala to Python collections: http://pedrorodriguez.io/blog/2015/03/14/functional-programming-collections-python/.

If you work in Data Science or Big Data and have never done functional programming before, you should really look into it. You might find it a bit steeply at the beginning but after you master it you will be superbly productive.

Advertisements
Posted in Agile, Machine Learning, Software Development | Tagged , , , , | 2 Comments

The ScrumBan Jira board

This is the part 1 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Agile board

Let’s start with one of the core tool of the agile workflow. We use a Jira board for tracking and organizing all of our projects. We developed a custom board which uses the sprints concept of Scrum but in a more flexible way as in Kanban.

jira-board

 

The Scrumban board is configured as following:

  • Horizontally divided in swimlanes (top-down in order of priority):
    • Critical / Blockers
    • Current work
    • Stories backlog
    • Sub-tickets backlog
    • Completed
  • The columns are:
    • To do
    • In progress
    • In review
    • Done / resolved
    • You can optionally have “Ready to release”
  • Quick filters should at least have one filter for each member of the team filtering on its own assigned tickets.

The idea is that during the planning you select from the backlog which high level stories you want to deliver by the end of the sprint (typically 2 weeks long) and then you create subtasks as-you-need.
Reason is that in data science you don’t know what you are about to implement beforehand. Thus you need to investigate-implement-test all the time and as you do it, you discover what to do next. Important is that whatever subtasks is created it is done by the end of the sprint so that the story is completed.

Define stories with a clear goal and a small scope. They should not span over multiple sprints and since that they come with the uncertainty of what tasks will be required, you really need to break a big problem into smaller well-defined problems that are accomplishable no-matter-what.

Avoid having tasks for exploratory analysis or for adding unit tests. Each task should bring some value, potentially a new feature. Each task will then require an exploratory analysis as well as some development and testing. Those steps are already part of the definition of “Done”. See below sections for more explanations about tests and exploratory analysis.

Plan always less than your capabilities. Delivering your stories a few days earlier is a very good sign. Delaying them is bad. If you manage to get your work done by Thursday, spend the whole day of Friday in a pub celebrating your amazing delivery.

In Jira, you must assign each story to one individual but remember that in an agile team either the whole team succeeds or fails. If that person does not manage to finish his tasks on time, it is a team failure. That’s what you have the morning standup for, to make sure everything is under control and team resources are allocated in a way that the sprint is going to be successful.

Never change the scope of your sprints or add tasks that were not planned, unless are required hotfixes. If you are asked to do something else then invite the product owners to join your next sprint planning and only then you can allocate resources for them.
Remember the goal of a sprint is to have a working, even if simplistic, deliverable not solving sparse tasks.

At the end of the sprint have a retrospective meeting to discuss what went well and what not. Make sure to take actions in order to avoid that blockers may appear again in future.

Documentation

Documentation should be as simple as possible.

  • Releases notes, a page where you can note the major changes since previous version, the list of new tickets that have been merged  (linking to Jira) and a link to a more detailed report.
  • The detailed report contains snapshots of the most recent logs, results, observations, limitations, assumptions and performances of the model/etl/application. Often it contains some charts that can quickly explaining how good the product is. We can use those detailed but concise reports to track how the product is evolving. The release detailed report also contains the help messages of how to run the application and all of the command line interface (CLI) options.
    If all of your tests and procedures are fully automated then this page is simply a copy and paste of the results.
  • The usage of a particular job class or a script with the list of CLI arguments and default values is also accessible using –help argument, many libraries helps you doing that (bash getops, scala Scallop…).
  • Other pages are used to explain the complex part of the logic. Try to reduce those pages only when the logic is very complicated and hard to understand by just reading the code.

Documentation is hard to keep in sync that’s way we want to document what’s new since the last release rather than going through the whole wiki and updating every single page.

Ideally the documentation comes from the source code, unit tests and jira tickets. Individual analysis, findings and insights can be documented separately but they should represent static reports rather than project documentation.

In the hierarchical structure of the pages, we limit the maximum depth to 2. Which means we have the root-level pages with at most one level of children pages. Nested structures make it very hard to find contents when you need them.

Branching and versioning

Code should always and only exist in a git repository. Sparse snippets or random script files should be avoided.

We follow the gitflow branching model where each ticket is mapped as features branch. If you integrate Jira with Stash then from the ticket web page you can automatically create the corresponding branch in the repository using develop as branch base.

You do not need to use the complete gitflow branching model but at least the master, develop and features branches. It’s up to the deployment strategy defining how to handle hotfixes, bugfixes and releases branches. Make sure this strategy is clearly defined and is consistently enforced. See deployment.

Story tickets generally don’t have a branch associated, their sub-tasks have.

Install a git hook that every commit will include as prefix the ticket code (that you can parse out from the branch name). Tracking each commit with the corresponding ticket is a life-saver when in future you will try to reverse engineer what a method is doing and why has been created in first place. Then you can access the whole git history and access the corresponding tickets that touched that piece of code.

Discussions

Discussions of specific tasks should go into the corresponding jira ticket web page. This will make the conversation public, tracked and anyone can jump into the discussion with the full context available. Also reference files or supporting documents should be attached to the jira ticket itself or in the wiki if they serve as a general purpose. Remember each jira ticket can be linked from the releases wiki page, that means we never lose track of them. Moreover the query engine is quite good.

We found emails to be the worst place for discussions to happen, especially for sharing files that will become soon out-of-date.

When someone sends you an Excel file, reply saying that your laptop does not have an Office installation on it. If you are sharing small data files, tsv or json is way to go.
Avoid comma separated files with quotes wrapping text fields. You want to make your file editable using simple bash commands rather than loading into a csv parsing library.

We tried also mounted shared drives, but confluence is a much better collaborative way to share and organize files with an integrated version control and metadata.

Avoid meetings as much as you can, invent some excuse, ask for a clear agenda beforehand. Educate your colleagues to communicate with you by raising issues. Leave meetings only for important discussions and spend your meeting time for presenting and checkpointing with your stakeholders more frequently.

 

 

Posted in Agile, Big Data | Tagged , , , , , | 4 Comments

Functional Data Validation using monads and applicative functors


ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data is one of the major factor affecting the final results. In facts, real world data is always messy and inconsistent. Data Validation is a must for enforcing the correctness of the proposed solution and to make sure the underlying data represent the true business scenario.

When performing a data validation, the following issues often arise:

  • We want to track of how much information we lose for debugging and reporting scopes.
  • Sometime we want to cleanse invalid data instead of filtering out.
  • Part of the validation logic depends on the project requirements and/or model assumptions. They change often and the re-factoring may introduce bugs.

In this tutorial we are showing how to use monads, applicative functors and other functional programming concepts to safely and elegantly define the validation logic using a modular pattern. Each rule is defined individually and the final logic is built by using two types of composition:

  • Monad-composition. One rule after the other, if one fails the next rule is not applied.
  • Applicative-composition. All of the rules are applied independently and the validation results are collected and merged together.

Moreover, the data that do not pass the validation tests is not discarded but moved into a separate pipeline with all of the needed meta-data information attached to it explaining why this particular record was discarded. This allows us to:

  • Log all of the specific causes of data loss.
  • Easily recover previously invalidated data if the validation rules change.
  • Re-use part of the discarded data further down in the data pipeline. The whole ETL workflow is aware of what has been discarded before.

Spark, Scala, Scalaz and Sparkz

The tutorial is part of the open source project Sparkz which aims to extend the Apache Spark framework providing more functional APIs. The implementation is in Scala and leverage Scalaz, which is the framework from where Sparkz was inspired from.

Scalaz provides a Validation data structure that is similar to Either (where an object can either be Left/Failure or Right/Success) but is not a monad but an applicative functor because instead of chaining the result from first event to the next, Validation validates all events. Since that in case of a failure there must be at least one error message, we enforce the failure type to be a non empty list of error messages. For this purpose scalaz already provides a data structure ValidationNel to accumulate all of the error messages into a Nel (non empty list).

See this page for documentation: http://eed3si9n.com/learning-scalaz/Validation.html

The concepts and methodology can be applied to any data computation framework and programming language. You will just have to re-implement yourself part of the boiling-plate code you will see in this tutorial that does all of the magic for you.

The user/events data validation use case

The use case we are using for this example is a simple data type consisting of a triple of userId, eventCode and timestamp:


case class UserEvent(userId: Long, eventCode: Int, timestamp: Long)

Each UserEvent can either be marked as correct or as invalid. For the latter case we will wrap it into another case class InvalidEvent containing the invalid event as well as some meta information regarding the error cause:

sealed trait InvalidEventCause

case class InvalidEvent(event: UserEvent, cause: InvalidEventCause)

The goal is to build a function that takes an UserEvent and returns a ValidationNel of either the correct event or the non empty list of all of the causes:


UserEvent => ValidationNel[InvalidEvent, UserEvent]

The reason why we want to return ValidationNel[InvalidEvent, UserEvent] instead of ValidationNel[InvalidEventCause, UserEvent] is because we want to keep the original datum in case of further recovery instead of only storing the error causes. This implies that the same object is duplicated multiple times which is not efficient but we are not addressing optimisation issues in this tutorial, we will leave it for future posts.

Validation Rules

The easiest way to define each rule was via partial functions that map an UserEvent into an InvalidEventCause. A Partial function is a function that is only defined for a subdomain of the input arguments. In our case is a function which tries to invalidate a datum and is not defined for correct records. The full validation logic will be expressed as a List of partial functions such as:


val validationRules: List[PartialFunction[UserEvent, InvalidEventCause]]

In order to reduce the boilerplate code the PartialFunction returns an InvalidEventCause and then our implicit logic will wrap it together with the original UserEvent object into an InvalidEvent container.

Some rules are simply pre-defined, such as checking that a timestamp is in a min-max range or that the userId is in a white list and so on. Others are more complicated and are derived from the underlying raw data (before validation).
The method generating the final validation function takes as argument the RDD with the raw data plus a bunch of parameters and objects used for defining the single rules:

def validationFunction(events: RDD[UserEvent],
 eligibleUsers: Set[Long],
 validEventCodes: Set[Int],
 blackListEventCodes: Set[Int],
 minDate: String, maxDate: String): UserEvent => ValidationNel[InvalidEvent, UserEvent] 

In order to compile the snippets of code you will have to add some dependencies in the imports:

 
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.joda.time.{DateTime, Interval, LocalDate}
import sparkz.utils.Pimps._
import scalaz.Scalaz._
import scalaz.ValidationNel

First thing we grab the SparkContext from the events RDD:


val sc = events.context

Valid event code

We want to filter out all of the events whose code does not belong to the validEventCodes set.

 

case object NonRecognizedEventType extends InvalidEventCause

val validEventCodesBV: Broadcast[Set[Int]] = sc.broadcast(validEventCodes)
val notRecognizedEventCode: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if !validEventCodesBV.value.contains(event.eventCode) => NonRecognizedEventType
}

We could have enclosed the set directly in the partial function but we rather prefered to broadcast it and retrieve it using the value API.

Why using Broadcast variables in Spark explained here: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Eligible users

Just like event codes we want to make sure that we only select data from a pool of predefined eligible users:


case object NonEligibleUser extends InvalidEventCause

val eligibleUsersBV: Broadcast[Set[Long]] = sc.broadcast(eligibleUsers)
val customerNotEligible: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if !eligibleUsersBV.value.contains(event.userId) => NonEligibleUser
}

N.B. that if the eligibleUsers set is large you cannot broadcast it as a shared variable but you want to rather turn into a paired RDD and use the userId as a key for the join. If you want to preserve the information of why you discarded a particular user you will have to perform an outer join instead of the inner join.

Blacklist users

This logic is slightly more complicated. We want to filter out all of the events of those users for which we observed at least one event in the black list. We will have to first scan the raw dataset in order to create the blacklist user ids.


case object BlackListUser extends InvalidEventCause

val blackListEventCodesBV: Broadcast[Set[Int]] = sc.broadcast(blackListEventCodes)
// Users for which we observed a black list event
val blackListUsersBV: Broadcast[Set[Long]] = sc.broadcast(
  events.filter(event => blackListEventCodesBV.value.contains(event.eventCode))
  .map(_.userId).distinct().collect().toSet
)
val customerIsInBlackList: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if blackListUsersBV.value.contains(event.userId) => BlackListUser
}

We first run the distinct() to reduce the size of the RDD before to collect() and turn into a set.

Timestamp out of global interval

We specified the minDate and maxDate as strings in ISO format and we want to filter out all of the timestamps outside this range. For this logic we don’t need any pre-computation we can directly implement it as:


case object OutOfGlobalIntervalEvent extends InvalidEventCause

val eventIsOutOfGlobalInterval: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if !new Interval(DateTime.parse(minDate), DateTime.parse(maxDate)).contains(event.timestamp) =>
    OutOfGlobalIntervalEvent
}

 

We use joda-time to parse strings and timestamp epoch numbers into more manageable classes.

First day to consider of the user

This time we want to have a stricter rule regarding the event timestamps. We want to avoid border effects by removing all of the events on the same date from where we started observing the first event of a particular user. That could be because the first date may be incomplete and do not contain all of the events and can invalidate our data assumptions. Since that we also introduced the concept of global min/max interval, the first date to consider for a particular user is the max between the first date we observed from the data and the start of the global interval.


case object FirstDayToConsiderEvent extends InvalidEventCause

// max between first date we have ever seen a customer event and the global min date
val customersFirstDayToConsiderBV: Broadcast[Map[Long, LocalDate]] =
  sc.broadcast(
    events.keyBy(_.userId)
    .mapValues(personalEvent => new DateTime(personalEvent.timestamp).toLocalDate)
    .reduceByKey((date1, date2) => List(date1, date2).minBy(_.toDateTimeAtStartOfDay.getMillis))
    .mapValues(firstDate => List(firstDate, LocalDate.parse(minDate)).maxBy(_.toDateTimeAtStartOfDay.getMillis))
    .collect().toMap
  )
val eventIsFirstDayToConsider: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if customersFirstDayToConsiderBV.value(event.userId).isEqual(event.timestamp.toLocalDate) =>
    FirstDayToConsiderEvent
}

We reduced the events rdd into the minimum date for each userId using the reduceByKey monoidal aggregation. Then we applied the max function between the minimum reduced date and the global one.

Validation rules composition

All we have to do is take the individually defined partial functions (that in Scala are objects like everything else) and put them into a List.

We realised that eventIsFirstDayToConsider is dependending on eventIsOutOfGlobalInterval. If an event is filtered because outside the global interval there is no need of putting it through the first-day-to-consider rule. Thus we can create a monad of the two rules by using the orElse method on the first partial function which takes as argument the second partial function which is applied if and only if the first one is not defined (in our case is not already outside the global min-max range). The orElse method returns a new partial function which will be treated as a single validation rule and internally combines the two of them.

val validationRules: List[PartialFunction[UserEvent, InvalidEventCause]] =
  List(customerNotEligible, notRecognizedEventCode, customerIsInBlackList,
    eventIsOutOfGlobalInterval.orElse(eventIsFirstDayToConsider)
  )

The order of the validation rules does not matter since all of them will be applied independently even if computationally they will be applied sequentially. If you want to computationally apply them in parallel then you can use the par method of a list that turns it into a parallel collection.

The final validation function will be generated by a couple of syntactic sugar implicits that we implemented in Sparkz.


(event: UserEvent) => validationRules.map(_.toFailureNel(event, InvalidEvent(event, _))).reduce(_ |+++| _)

The magic operators

In the list of imports we specified:

import sparkz.utils.Pimps._

Pimps are a nice pattern used in Scala to implicitly add methods to classes (the action of pimping). It is particularly useful when you want to use the postfix notation to apply a method to a class that do not expose that method.

In order to implement our validation pattern we had to pimp two classes: the PartialFunction and the ValidationNel. The pimped methods hides the boilerplate logic computed behind the scenes.

implicit class PimpedPartialFunction[X, E](pf: PartialFunction[X, E]) {
  def toFailureNel[W](x: X, toW: E => W = identity _): ValidationNel[W, X] =
    pf.andThen(e => toW(e).failureNel[X]).applyOrElse(x, (_: X).successNel[W])

  def toFailureNel(x: X): ValidationNel[E, X] = toFailureNel(x, identity)
}

The toFailureNel method attempt to apply the partial function (X => E) to the element x and in case the function is defined returns a failureNel (a non empty list of a single failure of type E) where the failure object is the one returned by the original function. In case the function is not defined the pimped method creates a successNel of type X.

The more general method takes also an extra argument toW that converts the error object e returned by the original function (if defined) and wraps it into another type W. It acts as a functor for the failure case. This method allows us to encapsulate the information of the original event that generated the error together with its cause.

In our use case the generics types of X, E and W are:

X: UserEvent
E: InvalidEventCause
W: InvalidEvent

Once we have converted the partial functions into ValidationNel instances, we need to reduce them into a single one. Scalaz provides a monoid binary operator +++ that takes two ValidationNel instances and merge them together. In case of failures, appends all of the failures into a single list of type E. In case of success results, applies an external monoid operator on type X. In other words that operator knows how to reduce failures by concatenating the errors but it requires to specify how to combine the correct results when all of them are Success.

Since that in our use case we are not transforming the correct data but either we discard it or we keep it as it is, the monoid operator is straightforward. We assume that the success results contains always the same original object, aka they never transform it. Thus the monoid operator simply returns the object itself where the two objects to reduce just represent a copy of each other.

Our PimpedValidationNel provides a simplified operator that does not require the semigroup monoid defined for the type X:


implicit class PimpedValidationNel[E, X](x1: ValidationNel[E, X]) {
  // Extension of scalaz.Validation.+++ operator, does not require the semigroup defined for X
  def |+++|(x2: ValidationNel[E, X]) = x1 match {
    case Failure(a1) => x2 match {
      case Failure(a2) => Failure(a1 append a2)
      case Success(b2) => x1
    }
    case Success(b1) => x2 match {
      case b2@Failure(_) => b2
      case Success(b2) if b1 == b2 => Success(b1)
      case Success(b2) => throw new IllegalArgumentException(s"$b1 not equals to $b2")
    }
  }
}

This operator also allows us to merge together results coming from different validation pipelines. If we have two RDDs of the same ValidationNel generic types we could theoretically join them and merge the results of the same objects. This operation is very expensive and is much more prefered to join the lazy functions that generates the validation objects and apply the final combined function to each record in a single pass over the dataset.

What you can do with the ValidationNel API

Simplest thing would be getting only the valid records:

def onlyValidEvents(events: RDD[UserEvent],
                    validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): RDD[UserEvent] =
  events.map(validationFunc).flatMap(_.toOption)

Or the opposite thing, getting only the invalid events and flat-mapping them into their corresponding error wrapping class:

def invalidEvents(events: RDD[UserEvent],
 validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): RDD[InvalidEvent] =
 events.map(validationFunc).flatMap(_.swap.toOption).flatMap(_.toList)

Suppose we want to extract all of the original events that failed because of a particular cause, for instance when their timestamp was out of range:

def outOfRangeEvents(events: RDD[UserEvent],
                     validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): RDD[UserEvent] =
  events.map(validationFunc).flatMap(_.swap.toOption).flatMap(_.toSet).flatMap {
    case InvalidEvent(event, OutOfGlobalIntervalEvent) => event.some
    case _ => Nil
  }

N.B. We are exploiting the implicit conversion from an Option to a Iterable to apply a combination of filter and map into a single flatMap operation.

Now, suppose we would like to print a debug message with the count of invalid events by the set of their error causes.

def causeSetToInvalidEventsCount(events: RDD[UserEvent],
                                 validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): Map[Set[InvalidEventCause], Int] =
  events.map(validationFunc)
  .map(_.swap).flatMap(_.toOption).map(_.map(_.cause).toSet -> 1)
  .reduceByKey(_ + _)
  .collect().toMap

The above method will return a map that looks like:

Map(Set(NonEligibleCustomer, NonRecognizedEventType) -> 36018450,
Set(NonEligibleUser) -> 9037691,
Set(NonEligibleUser, BlackListUser, NonRecognizedEventType) -> 137816,
Set(NonEligibleUser) -> 464694973,
Set(BeforeFirstDayToConsiderEvent, NonRecognizedEventType) -> 5147475,
Set(OutOfGlobalIntervalEvent, NonRecognizedEventType) -> 983478).

Please pay attention that we are not counting by the individual cause but we are grouping by the combination of causes that co-occured together. This gives us much more debugging power with no loss of information as opposed to the traditional monad sequential validation where only the first cause would be recorded.

Moreover, what we might really be interested on is the count of how many users we lost as effect of the events validation. In other words how many users we lost because they had no any event left after validation.

def causeSetToUsersLostCount(events: RDD[UserEvent],
                             validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): Map[Set[InvalidEventCause], Int] = {
  val survivedUsersBV: Broadcast[Set[Long]] =
    events.context.broadcast(events.map(validationFunc).flatMap(_.toOption).map(_.userId).distinct().collect().toSet)

  events.map(validationFunc).flatMap(_.swap.toOption)
  .keyBy(_.head.event.userId)
  .filter(_._1 |> (!survivedUsersBV.value(_)))
  .mapValues(_.map(_.cause).toSet)
  .mapValues(Set(_))
  .reduceByKey(_ ++ _)
  .flatMap(_._2)
  .map(_ -> 1)
  .reduceByKey(_ + _)
  .collect().toMap
}

What the above code does is the following:

  1. Compute the set of “survived” users from the correct events after validation.
  2. Filter only the invalid events and of the users who did not survive.
  3. Each list of failures is turned into a Set of failure (so that the order does not matter).
  4. All of the causes set are grouped by userId and deduplicated in such a way that the single combination of causes set only appears once for each userId.
  5. Then count for how many non-survived users each causes set appears.
  6. Returns a map from causes set to an integer representing the count of lost users.

It should return something like:

Map(Set(NonEligibleCustomer, NonRecognizedEventType) -> 1545,
Set(NonEligibleUser) -> 122,
Set(NonEligibleUser, BlackListUser, NonRecognizedEventType) -> 3224,
Set(NonEligibleUser) -> 4,
Set(BeforeFirstDayToConsiderEvent, NonRecognizedEventType) -> 335,
Set(OutOfGlobalIntervalEvent, NonRecognizedEventType) -> 33)

Conclusions

In this tutorial we showed how we can extend Scalaz to provide a functional and elegant way of applying validation rules to clean a raw dataset. We showed how the whole logic is parallelizable and scalable using the Apache Spark framework. The main advantages of this approach is that we never lose information but we are able to split the data into 2 pipelines (valid/invalid) and mark each invalid record with some metadata. We presented a simple tutorial for a common use case showing how to define validation rules and how to use the API of the generalized ValidationNel objects in order to perform debugging and cleansing tasks.

The source code is available at: https://github.com/gm-spacagna/sparkz/blob/master/src/examples/scala/sparkz/DataValidation.scala.

The whole procedure is not fully optimized for efficiency, the immutable objects creation may create an unnecessary overhead for the garbage collector and the way we wrap the original datum in case of failure creates a lot of duplicated clones of the same object. We leave this task of optimizing to future blog posts.

We hope that this tutorial will inspire Data Scientists and Engineers, regardless of the language and/or technology stack, to approach their coding in a more functional way. Functional programming offers the elegance and conciseness of implementing arbitrary complicated logic as a simple combination of reusable high-order functions as opposed to the classic imperative programming paradigm. We found this way of writing code to be much more suitable for implementing math and data transformation algorithms.

***

Similar articles about Data Validation using Scalaz:

https://github.com/FranklinChen/data-validation-demo
https://www.innoq.com/en/blog/validate-your-domain-in-scala/

Posted in Big Data, Data Munging, Scala, Spark | Tagged , , , , , , | 1 Comment

Mapping DataFrame to a typed RDD

I have recently published a blog post on DZone “Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds” which describes the workflow and methodology that we use at Barclays to load data from the raw source (relational database) into the Data Science cluster (Spark). One of the described components is the mapping between DataFrame to a typed RDD of a custom case class.

There are a bunch of reasons why you would like to make your DataFrame typed, the following is a summary:

dataframe-vs-rdd

Examples of when is more convenient to use DataFrame Vs. RDD can be found in this workshop: WordPress Blog Posts Recommender

In this tutorial I have pulled out from the Tachyon blog post the part related to the conversion from DataFrame to RDD. The inverted conversion RDD to DataFrame is straightforward and can be found in the same recommender workshop above mentioned.

Typed Case Class Mapping

After we have constructed the DataFrame collection from the raw source we can now map it into an RDD of our ad-hoc case classes. Since a DataFrame is also an RDD of type org.apache.spark.sql.Row, it already provides the map/flatMap methods.

If there are no null values in any row, we could use pattern matching to extract each column from the Row object:

case class MyClass(a: Long, b: String, c: Int, d: String, e: String)
dataframe.map {
  case Row(a: java.math.BigDecimal, b: String, c: Int, _: String, _: java.sql.Date,
           e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal,
           _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}

This approach will fail for null values due to the casting of the explicit types of each single field in the unapply method of the class Row. You can discard all the rows containing null values by doing:

dataframe.na.drop()

But that will drop records even if the null fields are not the ones we use in our case class.

If you want to handle it using Scala options you could turn the Row object into a List and then use the following pattern:

case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)
dataframe.map(_.toSeq.toList match {
  case List(a: java.math.BigDecimal, b: String, c: Int, _: String, _: java.sql.Date,
            e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal,
            _: String) => MyClass(a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}

If the columns you are interested are sparse, then you could fetch them individually either by index or by column name:

row.getAs[SQLPrimitveType](columnIndex: Int)
row.getAs[SQLPrimitveType](columnName: String)

For the list of mapping of SQL primitive types and their corresponding Java/Scala classes, see: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html.

N.B. The described procedure does not take advantage of the recently released DataSet API (http://spark.apache.org/docs/1.6.0/sql-programming-guide.html#datasets) which should automate the whole process of converting between DataFrames and RDDs. At the time we wrote this note we had not yet tested DataSet. Also there are open-source projects like Frameless (https://github.com/adelbertc/frameless) and an ongoing discussion on its gitter channel of how to leverage the awesome Shapeless (https://github.com/milessabin/shapeless) library to make Spark more functional and compile-time type-safe.

Similar articles:

Type safety on Spark Dataframes: http://www.51zero.com/blog/2016/2/24/type-safety-on-spark-dataframes-part-1

Posted in Spark, Uncategorized | Tagged , , | Leave a comment

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are:

  1. Data stored in non scalable infrastructure for analysis and processing
  2. Data governance and security policies

1. Data often resides into central data warehouse and RDBMS of which many legacy applications and analysts depends on.
Data Scientists insteads cannot build their models or perform exploratory analysis by using SQL queries. They need the data to be available into a scalable, programmatic and reactive stack such as Hadoop and Apache Spark and develop their logic using languages such as Python, R, Scala… (for comparison of how Python and Scala compare for Spark, see this post: 6 points to compare Python and Scala for Data Science using Apache Spark).
2. Nevertheless, data cannot just be transferred (in technical terms sqoop-ed) to an Hadoop cluster without incurring into tedious bureaucracy,  ingestion inconsistencies and strict policies. In big corporations that translates to at least a month to decide what tables are interesting and a few more months to write the ETL logic, move the data and test the consistency.

At Barclays we developed a stack to logically map the raw data from the central data warehouse into Spark and use Tachyon for in-memory saving the data for long-term availability. In such stack, we are able to iterate fast with immediate data availability from a scalable Big Data cluster by skipping the data ingestion process and still complying with all of the data policies.

Tachyon was the key enabling technology for us.

Our workflow iteration time decreased from hours to seconds. Tachyon enabled something that was impossible before.

You can find the original article published on DZone in collaboration with Gene Pang, Software Engineer at Tachyon Nexus and Haoyuan Li, CEO of Tachyon Nexus:
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds

Posted in Agile, Big Data, Scala, Spark | Tagged , , , , , , , | Leave a comment

6 points to compare Python and Scala for Data Science using Apache Spark

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach. I personally believe that Spark will inevitably become the de-facto Big Data framework for  Machine Learning and Data Science.

Despite of the different opinions about Spark, let’s assume that a data science team wants to start adopting it as main technology. The choice of programming language is often a dilemma. Shall we build our models in Python or in Scala? Shall we run the exploratory analysis using the iPython notebook or iScala?
A common understanding is that Python is the scientific language and Scala is an engineering language seen as a better replacement for Java. Whilst there is truth in that, it does not have to be always the case.

Since that the two languages comparison has already been evaluated in details in other places, I would like to restrict the comparison to the particular use case of building data products leveraging Apache Spark in an agile workflow.

In particular, I can identify 6 important aspects that a Data Science programming language in this context should provide:

  1. Productivity
  2. Safe refactoring
  3. Spark integration
  4. Out-of-the-box Machine Learning/Statistics packages
  5. Documentation / Community
  6. Interactive Exploratory Analysis and built-in visualization tools

Why only Scala and Python?
Apache Spark comes with 4 APIs: Scala, Java, Python and recently R. The reason why I am only considering “PyScala” is because they mostly provides similar features respectively to the other 2 languages (Scala over Java and Python over R) with, in my opinion, better overall scoring. Moreover R is not a general-purpose language and its API is still in an experimental phase.

1. Productivity

Even though coding close to the bare metal produce always the most optimized results, pre-mature optimizations are known to be the root of all evil. Especially in the initial MVP phase we want to achieve high productivity with fewest possible lines of code and possibly be guided by a smart IDE.

Python is a very simple to learn and highly productive language to get things done quickly and from day 1. Scala requires a little bit more of thinking and abstraction due to its high level functional features but as soon as you get familiar with that, your productivity will dramatically boost. Code conciseness are quite comparable, both can be very concise depending on how good you are at coding. Reading Python is more explicit, it shows you step-by-step what your code execution is and the state of each variable. Scala in the other hand will focus more on describing what you are trying to achieve as final result hiding most of the implementation details and execution order. But remember with great power comes great responsibility. Whilst pattern matching is a very cool way to extract variables, advance features like implicits or custom DSLs can be confusing to the non-expert user.

In terms of IDEs, both IntelliJ and PyCharm are smart and productive environments. Nevertheless, Scala can take advantage of the type and compile-time cross-references that can provide some extra functionalities more naturally and without ambiguity, unlike in scripting languages. Just to name few: Find class/methods by name in the project and linked dependencies, find usages, auto-completion based on type compatibility, development-time errors or warnings.
In the other hand, all of those compile-time features comes with a cost: IntelliJ, sbt and all of the related tools are very slow and memory/cpu consuming. You shouldn’t be surprise if 2GB of your RAM is allocated in order to open multiple parallel projects in Scala. Python is more lightweight in this concern.

Conclusion: Both scores very well here, my recommendation is if you are developing simple intuitive logic then Python does the job greatly, if you want to do something more complex than it may be worth investing in learning and writing functional code in Scala.

2. Safe Refactoring

This requirement mainly comes with the agile methodology, we want to safely change the requirements of our code as we perform data explorations and adjust them at each iteration. Very commonly you first write some code with associated tests and immediately after the tests, implementations and APIs are broken. Everytime we perform a refactoring we face the risk of introducing bugs and silently breaking the previous logic.

Both the two languages must require tests (unit tests, integration tests, property based tests, etc…) in order to be safely refactored. Scala being a compile language has a better advantage in that but I am not going to argument the pros and cons of compiled vs scripting languages. So, I will skip that but at least for me I can see some useful benefits from having typed code.

Conclusion: Scala very well, Python average.

3. Spark Integration

Majority of the time and resources are generally spent on loading, cleaning, transforming data and extracting the most informative bits out of it. For that task, what is better than expressing your domain specific logic as combination of functions and do not bother about how it is lazily executed? No wonder that Big Data is turning more and more functional.

You now would expect me to say that Scala does better since that is natively functional. Actually in this scenario, the big difference is made by Spark rather than the programming language. Even though Python is not 100% fully functional (you could make it via external libraries), it wraps the Spark API which is indeed functional.

The implementation of the single map or reduce functions can then be either functional or not but at least the main logic is expressed as a pipe of transformations and operations over the raw data and the execution plan is defined by the computation framework.

You still have to smartly use the different Spark APIs in order to make your code scalable and optimized, but this task is the same for both the two cases. If we consider code execution performance then we all know that JVM compiled code runs faster than Python code but Spark is moving towards language-agnostic abstractions like DataFrame which will optimize most of the work for you producing comparable performance results.

Thus, the solution is “use Spark”. Because of that (and independently from the functional nature), Scala supports it natively which comes particularly handy especially when performing low-level tuning, optimizations and debugging. If you have used the Spark framework you are well familiar with its serialization exceptions. Since that the Python code is wrapped and executed in the JVM, you have less control over what is enclosed in your functions. Moreover some new features in recent Spark releases may only be available in Scala before to be ported as well in Python.

Conclusion: Scala better when comes to engineering, equivalent in terms of Spark integration and functionalities.

4. Out-of-the-box machine learning/statistics packages

When you marry a language, you marry the whole family. And Python has much more to bring on the table when it comes to out-of-the-box packages implementing most of the standard procedures and models you generally find in the literature and/or broadly adopted in the industry. Scala is still way behind in that yet can benefit from the Java libraries compatibility and  the community developing some of the popular machine learning algorithms on their distributed version directly on top of Spark (see MLlib, H20 Sparkling Water, DeepLearning4j …). A little note regarding MLlib, from my experience its implementation is a bit hacky and often hard to be modified or extended due to a mediocre design and non-sense limitations of private fields and classes.

Regarding the Java compatibility honestly I don’t see any Java framework to be anywhere close to what Python today provides with its amazing scikit-learn and related libraries. In the other hand many of those Python implementation only works locally (unless using some bootstrapping/bagging + model ensembling technique, see https://cornercases.wordpress.com/2013/10/23/example-python-machine-learning-algorithm-on-spark/) but their out-of-the-box implementations lack strong scalability when it comes to distributed algorithms. Scala in the other hand provides only a few implementations but already scalable and production-ready.

Nevertheless, do not forget that many big data problems can be reduced in small data problems, especially after an accurate feature selection, filtering and aggregation. It might make sense in some scenarios to crunch your large dataset into a vector space which can perfectly fit in memory and take advantage of the richness and advanced algorithms available in Python.

Conclusion: It really depends of what the size of your data is. Prefer Python every time that it can fit in memory but keep in mind also what are the requirements of your project: Is it just a prototype or is something you want to deploy/maintain in a production system? Python offers a complete selection of already-implemented packages that can satisfy any need. Scala will only provide the basics but in case of “productionisation” is a better engineering choice.

5. Documentation / Community

If we compare the two plain languages (without their external libraries) in terms of community size then Python belongs to the tier1 while Scala right after in tier2, see http://readwrite.com/2010/12/10/ranking-programming-languages. Practically speaking it means both of them have enough tutorials and answers in StackOverflow covering the majority of use cases and how-to’s.

If we consider documentation of the machine learning and statistics frameworks, the Python data science community is more mature and in fact you can find many tutorials and examples of how to solve a lot of problems and cool analysis using most of the Python libraries.

Unfortunately we cannot say the same for Scala. ML and MLlib libraries are very poor, the only way to really understand how they work is by reading the code. Likely with some other open source libraries that I found on GitHub.

Conclusion:
Both of them have a good and comparable community in terms of software development. When we consider data science community and cool data science projects, Python is hard to beat.

6. Interactive Exploratory Analysis and built-in visualization tools

iPython is one the greatest tools ever invented in the scientific world, one year ago it would have been without doubts the oscar winner. Today we can find many implementations of notebooks inspired by the iPython notebook available for any language. Jupyter, the iPython evolution, supports different kernels plus iScala actually re-implement it based on an akka play restful service. If you only consider opening a web-based notebook and start writing and interacting with some code, I think they are very similar.

If we consider using the notebook to interact with Spark, it may be a little more useful to use the Spark Notebook (in Scala) since that it is specifically designed for this purpose and provides a few utils to generates custom spark contexts or stopping the current in progress job without have to access the Spark UI or run commands from command line. While it is a nice to have feature, I don’t think makes a huge difference.

The pain comes when we comes to dependency injection and in that aspect Scala is a true nightmare! Being a compiled JVM language all of the dependencies must be available in the classpath and the kernel required to be restarted every time a jar changes or a new one comes in the path. Moreover using dependency management tools like sbt for some reason generates a whole lot of traffic and all of your dependencies are then packed into a fat jar of the size of hundreds of MBs which then must be loaded by the JVM executing your back-end code. Python here does much better because everything is specified at runtime and you can simply import code or libraries and the interpreter will automatically solves it for you without never restart your kernel. This aspect is extremely important especially when separating the development in the IDE from the exploration in the notebook calling the APIs of your implemented logic from the source folder. I raised this issue with the TypeSafe and SparkNotebook folks hoping that it can be addressed somehow in a more efficient way.

Built-in visualizations: Spark Notebook includes a very rudimental built-in viz library, a simple but acceptable WISP library and few wrappers around javascript technologies such as D3, Rickshaw. Generally speaking, it can render and wrap any javascript library but in a very non friendly nor intuitive. Python without any doubt is superior in the offer and selection of cool and advanced ways of plotting and building interactive dashboards.

Conclusion: Python wins, Scala is not enough mature yet even though the SparkNotebook does a good job. We haven’t yet considered the recent Apache Zeppelin which provides some fancy visualization features and supports the concept of language-agnostic notebook where each cell can represent any type of code: Scala, Python, SQL… and is specifically designed to integrate well with Spark.

Final Verdict

Shall I use Scala or Python? The answer is: Yes!
Give a try to both of them and try to test yourself what better works for your specific use case. As a rule of thumb: Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications. The ideal scenario would be to have a data science team able to be confident with both of them and swap when needed.

Nonetheless, technology choices are often driven by what people are already comfortable with. Pressure to deliver does not give you enough resources to spend on researching new libraries, reading papers or learning new tools and languages. What most data scientists care at the end of the day is to deliver using whatever mean does the job.

If you do have to decide, my view is that if your scope is doing research, then a scripting language is enough complete in terms of experimentation and prototyping. If your goal is to build a product then you want to consider something more robust that gives you both experimentation and at the same delivers a product.

Since that the best solution is never white or black, I encourage trying hybrid approaches that can adapt based on each project specification. A typical scenario could be developing the whole ETL, data cleansing and feature extraction in Scala and then distribute the data over multiple partitions and learning using algorithms written in Python for then collecting the results and presenting in a Jupyter notebook. Moreover since that at the last stage we don’t need Spark anymore, we could even deploy an interactive and stunning dashboard using Shiny by RStudio?

My motto is “the best tool for each task”. Whatever balance you choose, avoid to split into two teams: Data Science Engineers (the Big Data/Scala guys) and Data Science Analysts (the Python and SQL folks). Aim to build a cross-functional team with the full skillset to operate on the full end-to-end development of your product, from the raw data to the manual analysis and from the modelling to a scalable deployment.

I hope that article can be found useful for both experienced data scientists and enthusiasts that want to start their career in this industry. Please consider that the above comparison is mainly specific for the Apache Spark use case which I strongly recommend but in case you are using a different stack and/or languages choice, I think many concepts are still valid and can be extended to the broader families of Compiled Vs. Scripting languages.

***

Related links:

https://www.quora.com/Which-one-should-I-learn-Python-or-Scala

https://www.linkedin.com/pulse/build-tool-pain-why-data-science-isnt-going-typed-sam-savage

https://www.quora.com/Is-Scala-a-better-choice-than-Python-for-Apache-Spark

http://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-python

http://statrgy.com/2015/05/05/scala-vs-python/

http://datavirtualizer.com/popularity-vs-productivity-vs-performance/

Pro Python:

http://blog.mikiobraun.de/2013/11/how-python-became-the-language-of-choice-for-data-science.html

https://www.quora.com/Why-is-Python-a-language-of-choice-for-data-scientists

I am sorry but majority of comparisons of Python with other languages for data science is mainly Python Vs. R. I could not find so many other pro-python links comparing with Scala.

Pro Scala:

https://tech.coursera.org/blog/2014/02/18/why-we-love-scala-at-coursera/

http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang

https://www.linkedin.com/pulse/data-science-technology-choice-case-study-harry-powell

 

 

Posted in Agile, Machine Learning, Python, Scala, Spark | Tagged , , | 15 Comments

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

—At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
—The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics.
Details of the competition available at —https://www.kaggle.com/c/predict-wordpress-likes.

What we want to share is a mix of methodology and tools for:

  • —Investigating Interactively the data; and
  • —Writing quality code in a productive environment; and
  • —Embedding the developed functions into executable entry points; and
  • —Presenting the results in a clean and visual way; and
  • —Meeting the required acceptance criteria.

——AKA: Delivering a Data Science MVP quickly in a complete Agile way!

The topics covered in this workshop are:

  • —DataFrame/RDD conversions and I/O
  • —Exploratory Data Analysis (EDA)
  • —Scalable Feature Engineering
  • —Modelling (MlLib and ML)
  • —End-to-end Evaluation
  • —Agile Methodology for Data Science

At the end of the workshop the lessons learnt are:

  • Spark, Dataframe, RDDs:—
    • DataFrame is great for I/O, schema inference from the sources and when you have flatten schemas. Operations start to be more complicated with nested and array fields.
    • —RDD gives you the flexibility of doing your ETL using the richness of the Scala framework, in the other hand you must be careful on optimizing your execution plans.
      Functional Programming allowed us to express complex logic with a simple and clear code and free of side effects.

      RDD gives you the flexibility of doing your ETL using the richness of the scala framework, in the other hand you must be careful on optimizing your execution plans.

    • —Map joins with broadcast maps is very efficient but we need to make sure to reduce at minimum its size before to broadcast, e.g. applying some filtering to remove the unmatched keys before the join or capping the size of each value in case of size-variable structures (e.g. hash maps).

      Developing in the notebook is very painful and non productive, the more you write code the more become impossible to track and refactor it.

  • ML, MlLib
    • —ETL and feature engineering is the most time-consuming part, once you obtained the data you want in vector format then you can convert back to DataFrame and use the ML APIs.
    • —ML unfortunately does not wrap everything available in MlLib, sometime you have to convert back to RDD[LabeledPoint] or RDD[(Double, Vector)] in order to use the MlLib features (e.g. evaluation metrics).

      Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste

    • —ML pipeline API (Transformer, Estimator, Evaluator) seems cool but for an MVP is a pre-mature abstraction.
  • Modeling
    • —Do not underestimate simple solutions. In the worst case they serve as baseline for benchmarking.
    • —Even tough the Logistic Regression was better on classifying as true or false, the simple model outperformed when running the end-to-end ranking evaluation.
    • —Focus on solving problems rather than models or algorithms.
      Many Data Science problems can be solved with counts and divisions, e.g. Naïve Bayes.
    • —Logistic Regression “raw scores” are NOT probabilities, treat them carefully!
  • Spark Notebook
    • —SparkNotebook is good for EDA and as entry point for calling APIs and presenting results.
    • —Developing in the notebook is non very productive, the more you write code the more become harder to track and refactor previously developed code.
    • —Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste every time into a notebook dedicated cell.
    • —In order to keep normal Notebook cells clean, they should not contain more than 4/5 lines of code or complex logic, they should ideally just code queries in the form of functional processing and entry points of a logic API.
  • Visualization
    • —Plotting in the notebook with the built in visualization is handy but very rudimental, can only visualize 25 points, we created a Pimp to take any Array[(Double,Double)] and interpolate its values to only 25 points.
    • —Tip: when you visualize a Scala Map with Double keys in the range 0.0 to 1.0, the take(25) method will return already uniform samples in that range and since the x-axis is numerical, the built-in visualization will automatically sort it for you.
    • —Probably we should have investigated advanced libraries like Bokeh or D3 that are already supported in the Notebook.

Check the source code on the GitHub page: https://github.com/gm-spacagna/wordpress-posts-recommender.

Posted in Agile, Classification, Machine Learning, Scala, Spark | Tagged , , , , | 1 Comment

The complete 18 steps to start a new Agile Data Science project

Introduction

It is a very common pattern in software development to start a new project in a highly uncertain and chaotic scenario surrounded by plenty of ideas of what features we might want to implement. In Data Science the problem is even more amplified by its nondeterministic nature. In the start-up of a Data Science project we not just don’t know what we are trying to implement, we also don’t know how to implement it and also under which circumstances that would be possible and correct.

This initial lack of structure often is manifested by an initial spike of unnecessary development and later in the project in the form of technical debts and unexplained inconsistencies. You might spend a lot of resources before to find out that the delivered solution simply does not fit the business nature of the problem.

In Agile Data Science the goal should not be producing charts and reports or hacky scripts calling some machine learning library. In Agile Data Science we want to iteratively build production-quality applications that solve the true business needs by extracting hidden knowledge from the data.

This is the final summarising post of the Agile Data Science Iteration 0 series:

The Complete Checklist

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
  7. Define and quickly develop the simplest solution to the problem
  8. Release/demo first basic solution
  9. Research of background for ways to improve the basic solution
  10. Gather additional data into proper infrastructure (if required)
  11. Ad-hoc Exploratory Data Analysis (EDA)
  12. Propose better solution minimising potential risks and marginal gain
  13. Develop the Data Sanity check
  14. Define the Data Types of your application domain
  15. Develop the ETL and output the normalised data into a proper infrastructure
  16. Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified
  17. Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data
  18. Analyse the output of the automated HDA to adjust/revise the proposed solution

At the end of the Iteration 0 you have a very solid starting point for your project and you could now follow the typical Agile development cycle, whether you prefer more SCRUM, Kanban, a mix of them or your ad-hoc custom methodology.

Regardless of if you want to use a strict or flexible workflow, keep in mind that the main difference with the Agile iterations for software development consist in the fact that a ticket is typically broad and open-ended. You should not be surprised if the majority of your tickets get then split into multiple sub-tickets after the initial investigation of the problem. You should allow to create subtasks even after the sprint planning. In some cases you may prefer to mark them as blockers and re-scope them into the next sprint or in other cases you want to allow them to affect the current sprint.
What is important is that you should start implementing production-quality code only when the requirements and the acceptance test are well defined. In Data Science this not very likely to happen all the time. Every time you are presented with an open problem to investigate and solve you should try to break it into research/analysis and development subtasks.

What not to do?

  • Do not start any development without have done a prior detailed research/investigation
  • Do not just deliver analysis code in notebooks, after your investigation move the code into production-quality standards
  • Do not blindly trust external libraries or APIs if you don’t know exactly what they do and return, run some tests if needed
  • Do not generate manual reports of your finding until the experiments are reproducible and automated
  • Do not deploy any model if all of the assumptions haven’t been stated and verified
  • Do not be lazy to learn better technologies and methodologies!

To conclude, in this series of posts I just wanted to share some of my experience on starting new Data Science projects and common problems that I have seen to be addressed in a confusional and chaotic way. I hope that by following those guidelines you can reduce the technical debts of the project and the risk of working several months without never delivering a correct and working solution.

More details of the Agile cycle for Data Science applications and in particular how to time-box open-ended questions will be covered into another post. Stay tuned and get ready to run!

***

The Hypothesis-Driven Analysis << prev 

Posted in Agile | Tagged , | 5 Comments

Agile Data Science Iteration 0: The Hypothesis-Driven Analysis

This is the fifth post of the Agile Data Science Iteration 0 series:

Previously

What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
  7. Define and quickly develop the simplest solution to the problem
  8. Release/demo first basic solution
  9. Research of background for ways to improve the basic solution
  10. Gather additional data into proper infrastructure (if required)
  11. Ad-hoc Exploratory Data Analysis (EDA)
  12. Propose better solution minimising potential risks and marginal gain
  13. Develop the Data Sanity check
  14. Define the Data Types of your application domain
  15. Develop the ETL and output the normalised data into a proper infrastructure

At this stage you have already modelled some entities of your application logic. You know well the raw data and already have produced a normalised and cleaned version of your dataset. Your data is now sanitised and stored into a proper analytical infrastructure. Ask yourself: what assumptions have I made so far and I am going to make? Agile Data Science even though is production and engineering oriented is not just software engineering. Agile Data Science is Science, thus it must comply with the scientific methodology.

The Oxford dictionary defines “scientific method” as:

“a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses.”

And this is by no mean different in the Data Science methodology.

The Hypothesis-Driven Analysis

16. Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified

In Data Science we implement models and applications in a highly non deterministic context where often we make assumptions to simplify the problem. Assumptions are generally made based on intuitions, common-sense, previous experience, domain knowledge or sometime simply because the model require them.

Even though they might seem appropriate, they are dangerous! Unverified assumptions can easily lead to inconsistencies or, even worse, silently produce wrong results.

We can’t get rid of all of our assumptions and build an assumption-free model but we should try to document them, verify as soon as possible and track them over time. It is fine to have not-yet-fully-verified assumptions at this early stage, but they should not be forgotten and their verification should be planned in the immediate following iterations.

Every time we present any result we should clearly state what are all of the assumptions that have been made and if they have been verified or not.

17. Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data

What if the underlying data set or the observed environment has changed? Are our hypothesis still valid?
It is extremely important to develop an automated framework for running tests and experiments to validate all of the existing hypothesis.
We cannot achieve confidence about our deliverables if we are not sure that our hypothesis are correct and if anything has changed we must be able to find out immediately.

Yet, often is hard to have tests with boolean outcome: Success or Failure. It is a good practice though to have at least an automated job that calculates some key descriptive statistics that can help us understanding the underlying dataset and guiding the validation of our hypothesis. Think carefully of what measures your model would be interested to know in order to understand whether the proposed solution would make sense or not.

18. Analyse the output of the automated HDA to adjust/revise the proposed solution

The output of your HDA framework is your best friend for helping you going back and do the first changes to the proposed solution. You want to account for what the real phenomena are despite of what your original thoughts were.
If you manage to get all of your hypothesis right at the first shot, think twice!

Now you have a very detailed picture of what your solution proposal is and what are all of the requirements. You have gained a deep understanding of any detail you will need during the development and evaluation of your model. You have already built all of the tools to support you on that. You can feel safe to try out whatever you want because you know that your tests will check the validity. You have now reduced to the minimum the risks on this project before to even start implementing the first line of code for your model.

Align with your stakeholders and product owners and define the initial roadmap and expectations you want to meet for the first MVP.

***

Summary of the complete Agile Data Science Iteration 0″ series will be published soon, stay tuned.
Meanwhile, why not sharing or commenting below?

The ETL << prev | next >> The Final Checklist

Posted in Agile | Tagged , , | 6 Comments

What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis

via What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis.

Posted in Big Data, Scala | Tagged , | Leave a comment