Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are:

  1. Data stored in non scalable infrastructure for analysis and processing
  2. Data governance and security policies

1. Data often resides into central data warehouse and RDBMS of which many legacy applications and analysts depends on.
Data Scientists insteads cannot build their models or perform exploratory analysis by using SQL queries. They need the data to be available into a scalable, programmatic and reactive stack such as Hadoop and Apache Spark and develop their logic using languages such as Python, R, Scala… (for comparison of how Python and Scala compare for Spark, see this post: 6 points to compare Python and Scala for Data Science using Apache Spark).
2. Nevertheless, data cannot just be transferred (in technical terms sqoop-ed) to an Hadoop cluster without incurring into tedious bureaucracy,  ingestion inconsistencies and strict policies. In big corporations that translates to at least a month to decide what tables are interesting and a few more months to write the ETL logic, move the data and test the consistency.

At Barclays we developed a stack to logically map the raw data from the central data warehouse into Spark and use Tachyon for in-memory saving the data for long-term availability. In such stack, we are able to iterate fast with immediate data availability from a scalable Big Data cluster by skipping the data ingestion process and still complying with all of the data policies.

Tachyon was the key enabling technology for us.

Our workflow iteration time decreased from hours to seconds. Tachyon enabled something that was impossible before.

You can find the original article published on DZone in collaboration with Gene Pang, Software Engineer at Tachyon Nexus and Haoyuan Li, CEO of Tachyon Nexus:
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds

Agile Data Science Iteration 0: The ETL

This is the fourth post of the Agile Data Science Iteration 0 series:


What we have achieved so far (see previous posts above):

  1. Rigorous definition of the business problem we are attempting to solve and why it is important
  2. Define your objective acceptance criteria
  3. Develop the validation framework (ergo, the acceptance test)
  4. Stop thinking, start googling!
  5. Gather initial dataset into proper infrastructure
  6. Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
  7. Define and quickly develop the simplest solution to the problem
  8. Release/demo first basic solution
  9. Research of background for ways to improve the basic solution
  10. Gather additional data into proper infrastructure (if required)
  11. Ad-hoc Exploratory Data Analysis (EDA)
  12. Propose better solution minimising potential risks and marginal gain

At this stage you have already a benchmark reference using the simple solution. You have done your research of how to improve and meet the business requirements. You have a good overview of the initial dataset used for solving this problem. You can now start your engineering stage and produce the right dataset according to your application domain and the required data quality.


13. Develop the Data Sanity check

There is no dataset on Earth that does not require a sanity check. Filter out all the malformed, invalid, irrelevant records. Some time a cleansing step is also worth. Instead of throwing everything away you may try to sanitise the bad records.

Make sure to repeat this process every time you are running your model with a different dataset. Make sure this process is:

  • automated
  • logging error messages
  • stopping the execution of your job in case you are handing your application over someone else that might use it with the wrong dataset

As a data scientist you don’t want to be blamed for having implemented a non-working model simply because someone else used it in the wrong way.

14. Define the Data Types of your application domain

Your data types are the first-class citizens of your application. Define them carefully accounting for how you would like to model your data in your domain rather than how the data currently looks like. It might be worth considering here optional fields, structured fields (for example a postcode might be represented as string or as a triple district code, sector code, unit code), identifier may require a long instead of an integer, categorical values might be hard coded using enumerations, timestamps could be stored as epoch time and so on. Avoid to have duplicated information in your data types, use primary keys to join your data collections later on.

Pre-mature optimisations are discouraged but as rule of thumb try to keep your types light. That is, do not use strings for representing numbers or any other expensive data structure. If you need to combine multiple fields into a single identifier use tuples instead of concatenating them into a single expensive object. It might make no difference now but refactoring the code to accommodate a different data type is one of the most expensive and painful task. Moreover, expensive types will cause scalability issues pretty soon. There will be always time for fixing it later but if you have to make a choice now and it requires the same effort, why not doing it well?

15. Develop the ETL and output the normalised data into a proper infrastructure

Your ETL goal is now to produce the desired output according to the previously defined data types so that you don’t want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified.

If the raw data don’t match the desired output format, here is where you want to do all of your transformations.

Any ETL job should always be finalised with the persistence to some data storage. It is discouraged to do the ETL as a pre-processing on-the-fly step for your application. Reason is that you want to quickly repeat all of your analysis on top of the normalised data rather than re-run the ETL every single time.

You can now forget about the original raw data and you can move your focus onto the high-quality dataset meeting your application requirements. Time for developing the model? Not yet. How many assumptions have you made so far and are you going to make in your model? Some data assumptions can be verified during the data check but what about your formulated hypothesis? The goal of delivering a data product is solving the business problem in its real context, unverified assumptions can easily invalidate your solution.


Details of how to perform the Hypothesis-Driven Analysis will follow on the next post of the “Agile Data Science Iteration 0” series, stay tuned.
Meanwhile, why not sharing or commenting below?

The Simple Solution << prev | next >> The Hypothesis-Driven Analysis