Category Archives: Big Data
Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach. I personally believe that Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.
Despite of the different opinions about Spark, let’s assume that a data science team wants to start adopting it as main technology. The choice of programming language is often a dilemma. Shall we build our models in Python or in Scala? Shall we run the exploratory analysis using the iPython notebook or iScala? Continue reading
At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics. Continue reading
What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis
via What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis.
There is no dataset on Earth that does not require a sanity check.
Your data types are the first-class citizens of your application, define them carefully accounting for how you would like to model your data in your application rather than how the data currently looks like.
Your ETL goal is now to produce the desired output according to the previously defined data types so that you don’t want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified. Continue reading
HBase is a great technology for real-time querying of data using the rowkey prefix matching as index, but sometimes secondary indexes are required. We can organize our data inserting some columns in the rowkey and the remaining ones as column … Continue reading
Many of you are familiar with HBase. If you are not, HBase is a NoSQL database modeled after Google’s BigTable paper was published and aims to provide a key-value columnar database on top of HDFS, the Hadoop File System. HBase … Continue reading
What happens when a python eats a pig? Or better, when we embed a pig into a python? Well, we are not talking about real animals but about two very powerful technologies: Apache Pig and Python! In this post we … Continue reading