Functional Data Validation using monads and applicative functors


ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data is one of the major factor affecting the final results. In facts, real world data is always messy and inconsistent. Data Validation is a must for enforcing the correctness of the proposed solution and to make sure the underlying data represent the true business scenario.

When performing a data validation, the following issues often arise:

  • We want to track of how much information we lose for debugging and reporting scopes.
  • Sometime we want to cleanse invalid data instead of filtering out.
  • Part of the validation logic depends on the project requirements and/or model assumptions. They change often and the re-factoring may introduce bugs.

In this tutorial we are showing how to use monads, applicative functors and other functional programming concepts to safely and elegantly define the validation logic using a modular pattern. Each rule is defined individually and the final logic is built by using two types of composition:

  • Monad-composition. One rule after the other, if one fails the next rule is not applied.
  • Applicative-composition. All of the rules are applied independently and the validation results are collected and merged together.

Moreover, the data that do not pass the validation tests is not discarded but moved into a separate pipeline with all of the needed meta-data information attached to it explaining why this particular record was discarded. This allows us to:

  • Log all of the specific causes of data loss.
  • Easily recover previously invalidated data if the validation rules change.
  • Re-use part of the discarded data further down in the data pipeline. The whole ETL workflow is aware of what has been discarded before.

Spark, Scala, Scalaz and Sparkz

The tutorial is part of the open source project Sparkz which aims to extend the Apache Spark framework providing more functional APIs. The implementation is in Scala and leverage Scalaz, which is the framework from where Sparkz was inspired from.

Scalaz provides a Validation data structure that is similar to Either (where an object can either be Left/Failure or Right/Success) but is not a monad but an applicative functor because instead of chaining the result from first event to the next, Validation validates all events. Since that in case of a failure there must be at least one error message, we enforce the failure type to be a non empty list of error messages. For this purpose scalaz already provides a data structure ValidationNel to accumulate all of the error messages into a Nel (non empty list).

See this page for documentation: http://eed3si9n.com/learning-scalaz/Validation.html

The concepts and methodology can be applied to any data computation framework and programming language. You will just have to re-implement yourself part of the boiling-plate code you will see in this tutorial that does all of the magic for you.

The user/events data validation use case

The use case we are using for this example is a simple data type consisting of a triple of userId, eventCode and timestamp:


case class UserEvent(userId: Long, eventCode: Int, timestamp: Long)

Each UserEvent can either be marked as correct or as invalid. For the latter case we will wrap it into another case class InvalidEvent containing the invalid event as well as some meta information regarding the error cause:

sealed trait InvalidEventCause

case class InvalidEvent(event: UserEvent, cause: InvalidEventCause)

The goal is to build a function that takes an UserEvent and returns a ValidationNel of either the correct event or the non empty list of all of the causes:


UserEvent => ValidationNel[InvalidEvent, UserEvent]

The reason why we want to return ValidationNel[InvalidEvent, UserEvent] instead of ValidationNel[InvalidEventCause, UserEvent] is because we want to keep the original datum in case of further recovery instead of only storing the error causes. This implies that the same object is duplicated multiple times which is not efficient but we are not addressing optimisation issues in this tutorial, we will leave it for future posts.

Validation Rules

The easiest way to define each rule was via partial functions that map an UserEvent into an InvalidEventCause. A Partial function is a function that is only defined for a subdomain of the input arguments. In our case is a function which tries to invalidate a datum and is not defined for correct records. The full validation logic will be expressed as a List of partial functions such as:


val validationRules: List[PartialFunction[UserEvent, InvalidEventCause]]

In order to reduce the boilerplate code the PartialFunction returns an InvalidEventCause and then our implicit logic will wrap it together with the original UserEvent object into an InvalidEvent container.

Some rules are simply pre-defined, such as checking that a timestamp is in a min-max range or that the userId is in a white list and so on. Others are more complicated and are derived from the underlying raw data (before validation).
The method generating the final validation function takes as argument the RDD with the raw data plus a bunch of parameters and objects used for defining the single rules:

def validationFunction(events: RDD[UserEvent],
 eligibleUsers: Set[Long],
 validEventCodes: Set[Int],
 blackListEventCodes: Set[Int],
 minDate: String, maxDate: String): UserEvent => ValidationNel[InvalidEvent, UserEvent] 

In order to compile the snippets of code you will have to add some dependencies in the imports:

 
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.joda.time.{DateTime, Interval, LocalDate}
import sparkz.utils.Pimps._
import scalaz.Scalaz._
import scalaz.ValidationNel

First thing we grab the SparkContext from the events RDD:


val sc = events.context

Valid event code

We want to filter out all of the events whose code does not belong to the validEventCodes set.

 

case object NonRecognizedEventType extends InvalidEventCause

val validEventCodesBV: Broadcast[Set[Int]] = sc.broadcast(validEventCodes)
val notRecognizedEventCode: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if !validEventCodesBV.value.contains(event.eventCode) => NonRecognizedEventType
}

We could have enclosed the set directly in the partial function but we rather prefered to broadcast it and retrieve it using the value API.

Why using Broadcast variables in Spark explained here: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Eligible users

Just like event codes we want to make sure that we only select data from a pool of predefined eligible users:


case object NonEligibleUser extends InvalidEventCause

val eligibleUsersBV: Broadcast[Set[Long]] = sc.broadcast(eligibleUsers)
val customerNotEligible: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if !eligibleUsersBV.value.contains(event.userId) => NonEligibleUser
}

N.B. that if the eligibleUsers set is large you cannot broadcast it as a shared variable but you want to rather turn into a paired RDD and use the userId as a key for the join. If you want to preserve the information of why you discarded a particular user you will have to perform an outer join instead of the inner join.

Blacklist users

This logic is slightly more complicated. We want to filter out all of the events of those users for which we observed at least one event in the black list. We will have to first scan the raw dataset in order to create the blacklist user ids.


case object BlackListUser extends InvalidEventCause

val blackListEventCodesBV: Broadcast[Set[Int]] = sc.broadcast(blackListEventCodes)
// Users for which we observed a black list event
val blackListUsersBV: Broadcast[Set[Long]] = sc.broadcast(
  events.filter(event => blackListEventCodesBV.value.contains(event.eventCode))
  .map(_.userId).distinct().collect().toSet
)
val customerIsInBlackList: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if blackListUsersBV.value.contains(event.userId) => BlackListUser
}

We first run the distinct() to reduce the size of the RDD before to collect() and turn into a set.

Timestamp out of global interval

We specified the minDate and maxDate as strings in ISO format and we want to filter out all of the timestamps outside this range. For this logic we don’t need any pre-computation we can directly implement it as:


case object OutOfGlobalIntervalEvent extends InvalidEventCause

val eventIsOutOfGlobalInterval: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if !new Interval(DateTime.parse(minDate), DateTime.parse(maxDate)).contains(event.timestamp) =>
    OutOfGlobalIntervalEvent
}

 

We use joda-time to parse strings and timestamp epoch numbers into more manageable classes.

First day to consider of the user

This time we want to have a stricter rule regarding the event timestamps. We want to avoid border effects by removing all of the events on the same date from where we started observing the first event of a particular user. That could be because the first date may be incomplete and do not contain all of the events and can invalidate our data assumptions. Since that we also introduced the concept of global min/max interval, the first date to consider for a particular user is the max between the first date we observed from the data and the start of the global interval.


case object FirstDayToConsiderEvent extends InvalidEventCause

// max between first date we have ever seen a customer event and the global min date
val customersFirstDayToConsiderBV: Broadcast[Map[Long, LocalDate]] =
  sc.broadcast(
    events.keyBy(_.userId)
    .mapValues(personalEvent => new DateTime(personalEvent.timestamp).toLocalDate)
    .reduceByKey((date1, date2) => List(date1, date2).minBy(_.toDateTimeAtStartOfDay.getMillis))
    .mapValues(firstDate => List(firstDate, LocalDate.parse(minDate)).maxBy(_.toDateTimeAtStartOfDay.getMillis))
    .collect().toMap
  )
val eventIsFirstDayToConsider: PartialFunction[UserEvent, InvalidEventCause] = {
  case event if customersFirstDayToConsiderBV.value(event.userId).isEqual(event.timestamp.toLocalDate) =>
    FirstDayToConsiderEvent
}

We reduced the events rdd into the minimum date for each userId using the reduceByKey monoidal aggregation. Then we applied the max function between the minimum reduced date and the global one.

Validation rules composition

All we have to do is take the individually defined partial functions (that in Scala are objects like everything else) and put them into a List.

We realised that eventIsFirstDayToConsider is dependending on eventIsOutOfGlobalInterval. If an event is filtered because outside the global interval there is no need of putting it through the first-day-to-consider rule. Thus we can create a monad of the two rules by using the orElse method on the first partial function which takes as argument the second partial function which is applied if and only if the first one is not defined (in our case is not already outside the global min-max range). The orElse method returns a new partial function which will be treated as a single validation rule and internally combines the two of them.

val validationRules: List[PartialFunction[UserEvent, InvalidEventCause]] =
  List(customerNotEligible, notRecognizedEventCode, customerIsInBlackList,
    eventIsOutOfGlobalInterval.orElse(eventIsFirstDayToConsider)
  )

The order of the validation rules does not matter since all of them will be applied independently even if computationally they will be applied sequentially. If you want to computationally apply them in parallel then you can use the par method of a list that turns it into a parallel collection.

The final validation function will be generated by a couple of syntactic sugar implicits that we implemented in Sparkz.


(event: UserEvent) => validationRules.map(_.toFailureNel(event, InvalidEvent(event, _))).reduce(_ |+++| _)

The magic operators

In the list of imports we specified:

import sparkz.utils.Pimps._

Pimps are a nice pattern used in Scala to implicitly add methods to classes (the action of pimping). It is particularly useful when you want to use the postfix notation to apply a method to a class that do not expose that method.

In order to implement our validation pattern we had to pimp two classes: the PartialFunction and the ValidationNel. The pimped methods hides the boilerplate logic computed behind the scenes.

implicit class PimpedPartialFunction[X, E](pf: PartialFunction[X, E]) {
  def toFailureNel[W](x: X, toW: E => W = identity _): ValidationNel[W, X] =
    pf.andThen(e => toW(e).failureNel[X]).applyOrElse(x, (_: X).successNel[W])

  def toFailureNel(x: X): ValidationNel[E, X] = toFailureNel(x, identity)
}

The toFailureNel method attempt to apply the partial function (X => E) to the element x and in case the function is defined returns a failureNel (a non empty list of a single failure of type E) where the failure object is the one returned by the original function. In case the function is not defined the pimped method creates a successNel of type X.

The more general method takes also an extra argument toW that converts the error object e returned by the original function (if defined) and wraps it into another type W. It acts as a functor for the failure case. This method allows us to encapsulate the information of the original event that generated the error together with its cause.

In our use case the generics types of X, E and W are:

X: UserEvent
E: InvalidEventCause
W: InvalidEvent

Once we have converted the partial functions into ValidationNel instances, we need to reduce them into a single one. Scalaz provides a monoid binary operator +++ that takes two ValidationNel instances and merge them together. In case of failures, appends all of the failures into a single list of type E. In case of success results, applies an external monoid operator on type X. In other words that operator knows how to reduce failures by concatenating the errors but it requires to specify how to combine the correct results when all of them are Success.

Since that in our use case we are not transforming the correct data but either we discard it or we keep it as it is, the monoid operator is straightforward. We assume that the success results contains always the same original object, aka they never transform it. Thus the monoid operator simply returns the object itself where the two objects to reduce just represent a copy of each other.

Our PimpedValidationNel provides a simplified operator that does not require the semigroup monoid defined for the type X:


implicit class PimpedValidationNel[E, X](x1: ValidationNel[E, X]) {
  // Extension of scalaz.Validation.+++ operator, does not require the semigroup defined for X
  def |+++|(x2: ValidationNel[E, X]) = x1 match {
    case Failure(a1) => x2 match {
      case Failure(a2) => Failure(a1 append a2)
      case Success(b2) => x1
    }
    case Success(b1) => x2 match {
      case b2@Failure(_) => b2
      case Success(b2) if b1 == b2 => Success(b1)
      case Success(b2) => throw new IllegalArgumentException(s"$b1 not equals to $b2")
    }
  }
}

This operator also allows us to merge together results coming from different validation pipelines. If we have two RDDs of the same ValidationNel generic types we could theoretically join them and merge the results of the same objects. This operation is very expensive and is much more prefered to join the lazy functions that generates the validation objects and apply the final combined function to each record in a single pass over the dataset.

What you can do with the ValidationNel API

Simplest thing would be getting only the valid records:

def onlyValidEvents(events: RDD[UserEvent],
                    validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): RDD[UserEvent] =
  events.map(validationFunc).flatMap(_.toOption)

Or the opposite thing, getting only the invalid events and flat-mapping them into their corresponding error wrapping class:

def invalidEvents(events: RDD[UserEvent],
 validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): RDD[InvalidEvent] =
 events.map(validationFunc).flatMap(_.swap.toOption).flatMap(_.toList)

Suppose we want to extract all of the original events that failed because of a particular cause, for instance when their timestamp was out of range:

def outOfRangeEvents(events: RDD[UserEvent],
                     validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): RDD[UserEvent] =
  events.map(validationFunc).flatMap(_.swap.toOption).flatMap(_.toSet).flatMap {
    case InvalidEvent(event, OutOfGlobalIntervalEvent) => event.some
    case _ => Nil
  }

N.B. We are exploiting the implicit conversion from an Option to a Iterable to apply a combination of filter and map into a single flatMap operation.

Now, suppose we would like to print a debug message with the count of invalid events by the set of their error causes.

def causeSetToInvalidEventsCount(events: RDD[UserEvent],
                                 validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): Map[Set[InvalidEventCause], Int] =
  events.map(validationFunc)
  .map(_.swap).flatMap(_.toOption).map(_.map(_.cause).toSet -> 1)
  .reduceByKey(_ + _)
  .collect().toMap

The above method will return a map that looks like:

Map(Set(NonEligibleCustomer, NonRecognizedEventType) -> 36018450,
Set(NonEligibleUser) -> 9037691,
Set(NonEligibleUser, BlackListUser, NonRecognizedEventType) -> 137816,
Set(NonEligibleUser) -> 464694973,
Set(BeforeFirstDayToConsiderEvent, NonRecognizedEventType) -> 5147475,
Set(OutOfGlobalIntervalEvent, NonRecognizedEventType) -> 983478).

Please pay attention that we are not counting by the individual cause but we are grouping by the combination of causes that co-occured together. This gives us much more debugging power with no loss of information as opposed to the traditional monad sequential validation where only the first cause would be recorded.

Moreover, what we might really be interested on is the count of how many users we lost as effect of the events validation. In other words how many users we lost because they had no any event left after validation.

def causeSetToUsersLostCount(events: RDD[UserEvent],
                             validationFunc: UserEvent => ValidationNel[InvalidEvent, UserEvent]): Map[Set[InvalidEventCause], Int] = {
  val survivedUsersBV: Broadcast[Set[Long]] =
    events.context.broadcast(events.map(validationFunc).flatMap(_.toOption).map(_.userId).distinct().collect().toSet)

  events.map(validationFunc).flatMap(_.swap.toOption)
  .keyBy(_.head.event.userId)
  .filter(_._1 |> (!survivedUsersBV.value(_)))
  .mapValues(_.map(_.cause).toSet)
  .mapValues(Set(_))
  .reduceByKey(_ ++ _)
  .flatMap(_._2)
  .map(_ -> 1)
  .reduceByKey(_ + _)
  .collect().toMap
}

What the above code does is the following:

  1. Compute the set of “survived” users from the correct events after validation.
  2. Filter only the invalid events and of the users who did not survive.
  3. Each list of failures is turned into a Set of failure (so that the order does not matter).
  4. All of the causes set are grouped by userId and deduplicated in such a way that the single combination of causes set only appears once for each userId.
  5. Then count for how many non-survived users each causes set appears.
  6. Returns a map from causes set to an integer representing the count of lost users.

It should return something like:

Map(Set(NonEligibleCustomer, NonRecognizedEventType) -> 1545,
Set(NonEligibleUser) -> 122,
Set(NonEligibleUser, BlackListUser, NonRecognizedEventType) -> 3224,
Set(NonEligibleUser) -> 4,
Set(BeforeFirstDayToConsiderEvent, NonRecognizedEventType) -> 335,
Set(OutOfGlobalIntervalEvent, NonRecognizedEventType) -> 33)

Conclusions

In this tutorial we showed how we can extend Scalaz to provide a functional and elegant way of applying validation rules to clean a raw dataset. We showed how the whole logic is parallelizable and scalable using the Apache Spark framework. The main advantages of this approach is that we never lose information but we are able to split the data into 2 pipelines (valid/invalid) and mark each invalid record with some metadata. We presented a simple tutorial for a common use case showing how to define validation rules and how to use the API of the generalized ValidationNel objects in order to perform debugging and cleansing tasks.

The source code is available at: https://github.com/gm-spacagna/sparkz/blob/master/src/examples/scala/sparkz/DataValidation.scala.

The whole procedure is not fully optimized for efficiency, the immutable objects creation may create an unnecessary overhead for the garbage collector and the way we wrap the original datum in case of failure creates a lot of duplicated clones of the same object. We leave this task of optimizing to future blog posts.

We hope that this tutorial will inspire Data Scientists and Engineers, regardless of the language and/or technology stack, to approach their coding in a more functional way. Functional programming offers the elegance and conciseness of implementing arbitrary complicated logic as a simple combination of reusable high-order functions as opposed to the classic imperative programming paradigm. We found this way of writing code to be much more suitable for implementing math and data transformation algorithms.

***

Similar articles about Data Validation using Scalaz:

https://github.com/FranklinChen/data-validation-demo
https://www.innoq.com/en/blog/validate-your-domain-in-scala/

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Big Data, Data Munging, Scala, Spark and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s