Mapping DataFrame to a typed RDD

I have recently published a blog post on DZone “Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds” which describes the workflow and methodology that we use at Barclays to load data from the raw source (relational database) into the Data Science cluster (Spark). One of the described components is the mapping between DataFrame to a typed RDD of a custom case class.

There are a bunch of reasons why you would like to make your DataFrame typed, the following is a summary:

dataframe-vs-rdd

Examples of when is more convenient to use DataFrame Vs. RDD can be found in this workshop: WordPress Blog Posts Recommender

In this tutorial I have pulled out from the Tachyon blog post the part related to the conversion from DataFrame to RDD. The inverted conversion RDD to DataFrame is straightforward and can be found in the same recommender workshop above mentioned.

Typed Case Class Mapping

After we have constructed the DataFrame collection from the raw source we can now map it into an RDD of our ad-hoc case classes. Since a DataFrame is also an RDD of type org.apache.spark.sql.Row, it already provides the map/flatMap methods.

If there are no null values in any row, we could use pattern matching to extract each column from the Row object:

case class MyClass(a: Long, b: String, c: Int, d: String, e: String)
dataframe.map {
  case Row(a: java.math.BigDecimal, b: String, c: Int, _: String, _: java.sql.Date,
           e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal,
           _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}

This approach will fail for null values due to the casting of the explicit types of each single field in the unapply method of the class Row. You can discard all the rows containing null values by doing:

dataframe.na.drop()

But that will drop records even if the null fields are not the ones we use in our case class.

If you want to handle it using Scala options you could turn the Row object into a List and then use the following pattern:

case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)
dataframe.map(_.toSeq.toList match {
  case List(a: java.math.BigDecimal, b: String, c: Int, _: String, _: java.sql.Date,
            e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal,
            _: String) => MyClass(a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}

If the columns you are interested are sparse, then you could fetch them individually either by index or by column name:

row.getAs[SQLPrimitveType](columnIndex: Int)
row.getAs[SQLPrimitveType](columnName: String)

For the list of mapping of SQL primitive types and their corresponding Java/Scala classes, see: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html.

N.B. The described procedure does not take advantage of the recently released DataSet API (http://spark.apache.org/docs/1.6.0/sql-programming-guide.html#datasets) which should automate the whole process of converting between DataFrames and RDDs. At the time we wrote this note we had not yet tested DataSet. Also there are open-source projects like Frameless (https://github.com/adelbertc/frameless) and an ongoing discussion on its gitter channel of how to leverage the awesome Shapeless (https://github.com/milessabin/shapeless) library to make Spark more functional and compile-time type-safe.

Similar articles:

Type safety on Spark Dataframes: http://www.51zero.com/blog/2016/2/24/type-safety-on-spark-dataframes-part-1

Advertisements

About Gianmario

Data Scientist with experience on building data-driven solutions and analytics for real business problems. His main focus is on scaling machine learning algorithms over distributed systems. Co-author of the Agile Manifesto for Data Science (datasciencemanifesto.com), he loves evangelising his passion for best practices and effective methodologies amongst the data geeks community.
This entry was posted in Spark, Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s