Category Archives: Data Munging

Functional Data Validation using monads and applicative functors

ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data is one of the major¬†factor affecting the final results.¬†In facts, real world data is always messy and inconsistent. Data Validation … Continue reading

Posted in Big Data, Data Munging, Scala, Spark | Tagged , , , , , , | 2 Comments

Agile Data Science Iteration 0: The ETL

There is no dataset on Earth that does not require a sanity check.
Your data types are the first-class citizens of your application, define them carefully accounting for how you would like to model your data in your application rather than how the data currently looks like.
Your ETL goal is now to produce the desired output according to the previously defined data types so that you don’t want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified. Continue reading

Posted in Agile, Big Data, Data Munging | Tagged | 5 Comments