Category Archives: Big Data

6 points to compare Python and Scala for Data Science using Apache Spark

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach. I personally believe that Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.

Despite of the different opinions about Spark, let’s assume that a data science team wants to start adopting it as main technology. The choice of programming language is often a dilemma. Shall we build our models in Python or in Scala? Shall we run the exploratory analysis using the iPython notebook or iScala? Continue reading

Posted in Agile, Machine Learning, Python, Scala, Spark | Tagged , , | 15 Comments

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

—At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.
—The case study is recommending a sequence of WordPress blog posts that the users may like based on their historical likes and blog/post/author characteristics. Continue reading

Posted in Agile, Classification, Machine Learning, Scala, Spark | Tagged , , , , | 1 Comment

What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis

via What is Spark? Six reasons why CIOs should find out (and one why they shouldn’t) – 02 Nov 2015 – Computing Analysis.

Posted in Big Data, Scala | Tagged , | Leave a comment

Agile Data Science Iteration 0: The ETL

There is no dataset on Earth that does not require a sanity check.
Your data types are the first-class citizens of your application, define them carefully accounting for how you would like to model your data in your application rather than how the data currently looks like.
Your ETL goal is now to produce the desired output according to the previously defined data types so that you don’t want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified. Continue reading

Posted in Agile, Big Data, Data Munging | Tagged | 5 Comments

HBase Secondary Indexes using Fuzzy Filter

HBase is a great technology for real-time querying of data using the rowkey prefix matching as index, but sometimes secondary indexes are required. We can organize our data inserting some columns in the rowkey and the remaining ones as column … Continue reading

Posted in Big Data, HBase | Tagged , , , | 1 Comment

Hive mapping of HBase columns containing colon ‘:’ character

Many of you are familiar with HBase. If you are not, HBase is a NoSQL database modeled after Google’s BigTable paper was published and aims to provide a key-value columnar database on top of HDFS, the Hadoop File System. HBase … Continue reading

Posted in Big Data, HBase, Hive | Tagged , , | Leave a comment

Embedding Latin Pig into Python, the third millenium dinosaur!

What happens when a python eats a pig? Or better, when we embed a pig into a python? Well, we are not talking about real animals but about two very powerful technologies: Apache Pig and Python! In this post we … Continue reading

Posted in Big Data, Pig, Python, Software Development | Tagged , , | Leave a comment

Ubiquitous Computing for Big Data Insight: Helpful Tool or Privacy Breaker? The next generation of devices able to access to the Internet and the Web may not be characterized by computers, smartphones, tablets or appliances designed for this specific purpose. … Continue reading

Link | Posted on by | Tagged , | Leave a comment