In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio


Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS.

Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.

We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.

In this talk we will:

* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet

* Explain the iterative exploratory process and advantages of adopting functional programming

* Make a crucial analysis on the issues faced with the existing methodology

* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds

* Discuss some future improvements to the overall architecture

Original meetup event:

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are:

  1. Data stored in non scalable infrastructure for analysis and processing
  2. Data governance and security policies

1. Data often resides into central data warehouse and RDBMS of which many legacy applications and analysts depends on.
Data Scientists insteads cannot build their models or perform exploratory analysis by using SQL queries. They need the data to be available into a scalable, programmatic and reactive stack such as Hadoop and Apache Spark and develop their logic using languages such as Python, R, Scala… (for comparison of how Python and Scala compare for Spark, see this post: 6 points to compare Python and Scala for Data Science using Apache Spark).
2. Nevertheless, data cannot just be transferred (in technical terms sqoop-ed) to an Hadoop cluster without incurring into tedious bureaucracy,  ingestion inconsistencies and strict policies. In big corporations that translates to at least a month to decide what tables are interesting and a few more months to write the ETL logic, move the data and test the consistency.

At Barclays we developed a stack to logically map the raw data from the central data warehouse into Spark and use Tachyon for in-memory saving the data for long-term availability. In such stack, we are able to iterate fast with immediate data availability from a scalable Big Data cluster by skipping the data ingestion process and still complying with all of the data policies.

Tachyon was the key enabling technology for us.

Our workflow iteration time decreased from hours to seconds. Tachyon enabled something that was impossible before.

You can find the original article published on DZone in collaboration with Gene Pang, Software Engineer at Tachyon Nexus and Haoyuan Li, CEO of Tachyon Nexus:
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds