Tag Archives: ETL

Logical Data Warehouse for Data Science: map raw data directly from source to Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are: Data stored in non scalable infrastructure for analysis and processing Data governance and security policies 1. Data often resides into central data warehouse and RDBMS of which many legacy applications … Continue reading

Posted in Agile, Big Data, Scala, Spark | Tagged , , , , , , , | Leave a comment

Agile Data Science Iteration 0: The ETL

There is no dataset on Earth that does not require a sanity check.
Your data types are the first-class citizens of your application, define them carefully accounting for how you would like to model your data in your application rather than how the data currently looks like.
Your ETL goal is now to produce the desired output according to the previously defined data types so that you don’t want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified. Continue reading

Posted in Agile, Big Data, Data Munging | Tagged | 5 Comments