In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio

Abstract:

Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS.

Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.

We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.

In this talk we will:

* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet

* Explain the iterative exploratory process and advantages of adopting functional programming

* Make a crucial analysis on the issues faced with the existing methodology

* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds

* Discuss some future improvements to the overall architecture

Original meetup event: http://www.meetup.com/Alluxio/events/233453125/

A Distributed Genetic Evolutionary Tuning for Data Clustering: Part 1

A Distributed Genetic Evolutionary Tuning for Data Clustering: Part 1

This was my original post that was published on the AgilOne blog on June 2013 about the developed framework for self-tuning of data clustering algorithms.

In order for any data analytics service provider to high margin sustainable business has to deal with scalability, multi-tenancy and self-adaptability. Machine learning is a very powerful instrument for Big Data applications but a bad choice of algorithm can lead to poor results of the intended analysis. One way to mitigate this is to automate the tuning process. Such as tuning process should not require a priori knowledge of the data and without human intervention. As a Big Data Engineer at AgilOne, I worked on solutions for the self-tuning open problem. The work led to the development of TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering. The result was a solution that automatically evaluates and tunes data clustering algorithms, so that clustering-based analytics services can self-adapt and scale in a cost-efficient manner. Evaluating clusters For the initial work we choose K-Means as our clustering algorithm. K-Means is a simple but popular algorithm, widely used in many data mining applications.

TunUp is open-source and available at his GitHub page: https://github.com/gm-spacagna/tunup

The original report is available at: http://www.academia.edu/5082681/TunUp_A_Distributed_Cloud-based_Genetic_Evolutionary_Tuning_for_Data_Clustering

Data Clustering? don’t worry about the algorithm.

Data Clustering? don’t worry about the algorithm.

Introduction post of Data Clustering Tuning published on AgilOne blog on May 2013.

We are constantly pushing to improve our underlying algorithms and make them as adaptive as possible. Taking a step back, our problem generally is to fit classes of models and algorithms to customer data sets of varying data quality. In addition, we need to automate this so that we can scale delivery of our offerings from a business perspective.

This high-level business goal boils down to a number of technical requirements. It means we need to find ways of automatically evaluating results based on customer data and adaptations, and we need to do this in many different contexts.

One of our engineers, Gianmario Spacagana[1] took a fresh look at how to tune clustering algorithms. In this blog post, I will briefly introduce validation of clustering algorithms so that you can later more easily appreciate and understand Gianmario’s upcoming blog post.