Latent Panelists Affinities: a Social Science case study

As part of the IBM PartyCloud happened in Milan on 20th September 2018, I gave a talk “A Journey into Data Science & AI” presenting a case study about estimating Panelists Latent Affinities. I showed the components to develop an intelligent social agent able to classify entities and estimate latent affinities. The session also covered good practices and common challenges faced by R&D organizations dealing with Machine Learning products.

If you would like to discuss about how AI technologies can be applied to social science, get in touch!

Thoughts about data operations

This is the part 4 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Infrastructure

Get the IT team very close to the data scientists. Ideally one member of the team should be a DevOps, or the new title DataOps. Read this article from InfoWorld: DevOps can take data science to the next level.
Finding the right balance between IT workarounds and clean solutions is difficult especially when involves long tedious processes. It is good practice to “sign” contracts with the IT team of what you are about to deliver and what requirements you need in order to do so.
As a general advice you want to operate in your familiar environment where you have available all of the tools you like and proper cluster resources. Unfortunately data is always fragmented into multiple systems. Try to get the data periodically ingested into your Data Lake (typically a Hadoop cluster). When this is not possible make sure you have the permissions to sqoop it yourself. Data virtualization technologies also come particularly handy to create view of a dataset into your Big Data environment.
Don’t implement solutions that are tight to the underlying infrastructure. Spark DataFrame API for example does an excellent job on abstracting away the I/O operations. See this blog post of how to logically map tables from reltational database into a Spark cluster: https://dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon.
Requesting admin rights on a dev cluster will massively affect productivity and will let the team mastering their unix skills. Trustiness and transparency are essential. Security should be enforced during the interviews process by hiring competent and smart people and personnel trainings instead of killing productivity with non-sense restrictions.

Release process

A data products at the end of the day is a software that takes data as input and produce data as output that contains insights consumable either via a visual dashboard or by integrating them inside an existing IT system.

A few options we recommend for releasing are:

  • Continuous delivery. Ideally one pull request per project per day.
  • Continuous integration. That would be ideal, a Jenkins box that runs your tests and automated scripts every single time a new ticket is merged into develop and especially every single time a new release is done in master. If the box can access to a data cluster then can even run the end-to-end evaluation and store the results for you.
  • Every end of sprint should be matched with a new release consisting of:
    • taking the develop branch and merge it back into master (either manually or through an automated script such as the gitflow command line)
    • publishing your package containing source code and scripts to a common repository like Nexus.
    • Reporting latest results in Confluence (see documentation section).
    • Releasing all of the merged tickets from Jira so that they don’t show up in the board but are still accessible for reference.
    • Demo-ing inside the team and/or to your stakeholders if changes are relevant .
    • Celebrating in a pub.

It would make sense to plan the release on the last day of the sprint afternoon (typically Friday) but sometime might be advantageous to release on Thursday so that you can have Friday for hot-fixes if something goes down.

Deployment

Operations1

Very hard to give guidelines here since that each project have its own deployment process that depends on many factors such as the business context and practical issues associated with it.

If your application is deployed end-to-end from external teams of which you don’t have control of the workflow and data sources they are using, you will find extremely helpful to have some Data Sanity checks performed at every single run. Those checks make sure that the people running your application don’t accidentally input data which is not conformed with the schema and/or model assumptions. Throwing an exception with some context information is fundamental to make your system production-ready.
A typical example is validating the values of categorical fields. We packed in our jars the reference files containing all of the possible values and their descriptions. If the specified dataset contains values that don’t find any match, the data sanity check will throw an exception.
These steps of handling incorrect data may be handled during the ETL process and is generally not needed if the training is done by the data science team itself. In this latter case the the deployment only regards the trained model.

Deployment is the stage with the highest number of blockers and technical issues. The final measure of success is by the way only determined upon deployment in production, thus deployment issues should be top priority.

The balance of exploratory analysis and development

This is the part 3 of 4 of the “Lessons learnt from building Data Science systems at Barclays” series.

Exploratory Data Analysis / Research

eda

Exploratory analysis should precede and follow any task from the modelling, design and development to the benchmarking. Major problem is how do you share, track and monitor your findings? How do you make your analysis repeatable and scrutinizable from the outside? This is still an open problem.

Notebooks tend to be the best tools for the job, careful though. EDA is an open research/investigation task, thus you need a criteria to draw the line of when to stop. My suggestion is avoiding scope-free analysis but always accompany EDA with a well define goal/task. In a small and clearly defined task you know what you want to achieve, you just don’t know how to and what obstacles you may find.

A proposed sequential but iterative workflow is:

  1. The planned story defines the high level goal you are working towards to.
  2. Then start your EDA and as soon as you find something interesting you can stop investigating and define a development sub-ticket.
  3. You now start developing the minimum amount of code that implements the specified requirements defined during the analysis step.
    Those requirements should not change after the first definition, you want to complete that and then refactor it later into another sub-ticket or in a different iteration.
  4. Before to send it for review and/or solving the sub-ticket you should perform another EDA step to verify that the newly created branch meets the intended requirements. You are not solving the greater problem but you only care about the just-defined sub-problem. It is very dangerous to mix development and analysis at the same time since that you may end up into an infinite loop where you keep changing your requirements as you analyse and never get to an end.
  5. After completion of the subtask, you can switch back to the main workflow thread.

Suggestions are to time box any open-ended task. Say, you are going to spend no more than X hours/days on this research and before then you will come out with some development requirements or insights reporting that move the project towards the final story goal. Remember you will have to solve the story by the end of the sprint. Scope the problems small enough so that you reduce the risks of not meeting the expectations.

Get to an end-to-end as quick as possible and postpone any complication, ideas or new features to the next iterations. EDA/Research is generally a good place for filling your backlog for future scoping.

I leave it as an open question what to do with those notebooks after the investigation is completed. They are a bit tricky to maintain. When you produce a change to the codebase or a new dataset comes in, the notebooks become obsolete. We don’t want to refactor them every time to make sure they still work. I personally see notebooks more as a one-off analysis that are archived after being used.

I tend to translate all of my findings and assumptions in the form of project requirements so that they don’t get lost. In my opinion only the automated tasks should be maintained over time. Results from manual tasks that cannot be automated should be documented, stamped and archived in the wiki.

Evaluation

Unit tests make sure that the code does what is meant to do but that does not imply solving the right problem in an acceptable way. The evaluation strategy typically reflects the real-business scenario in which the model will be used. The choice of performance metrics must have a meaningful explanation within the business context. Metrics should be of easy interpretation from your stakeholders who generally are not data scientists and only speak the company business language.

Good tips is to create a Kaggle-like framework that:

  • defines the APIs reflecting your custom data types
  • use some abstract interface representing the particular implementation (could be split into multiple components, e.g. transformer, trainer, model)
  • knows how to robustly validate the given implementation (e.g. cross-fold, domain specific splitting avoiding data leakage, mix of timestamp and customerId partitioning…)
  • Produce one or a pool of interpretable performance metrics such as: mean average precision @ N, uplift, spam rate, loss rate, retention rate. Avoid abstract concepts like area under the curve or F-score.

Sooner you will find a blog post of our team regarding an offsite in Lanzarote where following the Kaggle-like structure we prototyped 6 different models for a recommender system in less than a week.

When building the evaluation framework, a few questions you want to ask are:

  • What a positive/negative sample represent in this business scenario?
  • Is recall important? Why do you care about accuracy?
  • What actions can be taken upon prediction?
  • In which form the model can be used? How the insights can be presented/visualized? Can it be integrated into an existing IT system?
  • What are the capabilities/practical issues of following the decisions suggested by the model?
  • What is the uplift of the data-driven solution compared to the traditional business as usual performance?
  • How can you test the trained model in the live environment (is A/B testing possible or the bad scenario would cause a lot of damages)?
  • Does the effectiveness of your solution only depends on your model or also from other parties? (e.g. predicting customers to contact for marketing purposes relies on the conversion rate of the marketing team as well)
  • How can you feed-back the results for updating the model? At which rate? Is the model easy to update or must be re-trained for every new collected data? Can you re-train it within the update interval?
  • Will the triggered actions influence the upcoming data (e.g. a recommender system can change the distribution of the future population)? Are there any amplification effect (if you recommend most popular items, those will become even more popular and so on…).

My experience suggests that the more the time spent in implementing a robust and exhaustive evaluation framework the easier and reliable will be maintaining and improving the system later. Time spent here is a good investment and requires a lot of thinking from all of the 3 data science aspects: business, statistics and engineering.

Demo

It is a good practice to demo advances and new results to the team and/or stakeholder at the end of the sprint. Feeling the continuous pace of delivering and improvement is an excellent psychological element and increase trustiness and confidence.

Moreover is the place where scrutiny comes in and you can have your methodology and interpretations challenged. Any deliverable or document presented during the demo should be stored in the wiki with a date associated to it.

Mapping DataFrame to a typed RDD

I have recently published a blog post on DZone “Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds” which describes the workflow and methodology that we use at Barclays to load data from the raw source (relational database) into the Data Science cluster (Spark). One of the described components is the mapping between DataFrame to a typed RDD of a custom case class.

There are a bunch of reasons why you would like to make your DataFrame typed, the following is a summary:

dataframe-vs-rdd

Examples of when is more convenient to use DataFrame Vs. RDD can be found in this workshop: WordPress Blog Posts Recommender

In this tutorial I have pulled out from the Tachyon blog post the part related to the conversion from DataFrame to RDD. The inverted conversion RDD to DataFrame is straightforward and can be found in the same recommender workshop above mentioned.

Typed Case Class Mapping

After we have constructed the DataFrame collection from the raw source we can now map it into an RDD of our ad-hoc case classes. Since a DataFrame is also an RDD of type org.apache.spark.sql.Row, it already provides the map/flatMap methods.

If there are no null values in any row, we could use pattern matching to extract each column from the Row object:

case class MyClass(a: Long, b: String, c: Int, d: String, e: String)
dataframe.map {
  case Row(a: java.math.BigDecimal, b: String, c: Int, _: String, _: java.sql.Date,
           e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal,
           _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}

This approach will fail for null values due to the casting of the explicit types of each single field in the unapply method of the class Row. You can discard all the rows containing null values by doing:

dataframe.na.drop()

But that will drop records even if the null fields are not the ones we use in our case class.

If you want to handle it using Scala options you could turn the Row object into a List and then use the following pattern:

case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)
dataframe.map(_.toSeq.toList match {
  case List(a: java.math.BigDecimal, b: String, c: Int, _: String, _: java.sql.Date,
            e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal,
            _: String) => MyClass(a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}

If the columns you are interested are sparse, then you could fetch them individually either by index or by column name:

row.getAs[SQLPrimitveType](columnIndex: Int)
row.getAs[SQLPrimitveType](columnName: String)

For the list of mapping of SQL primitive types and their corresponding Java/Scala classes, see: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html.

N.B. The described procedure does not take advantage of the recently released DataSet API (http://spark.apache.org/docs/1.6.0/sql-programming-guide.html#datasets) which should automate the whole process of converting between DataFrames and RDDs. At the time we wrote this note we had not yet tested DataSet. Also there are open-source projects like Frameless (https://github.com/adelbertc/frameless) and an ongoing discussion on its gitter channel of how to leverage the awesome Shapeless (https://github.com/milessabin/shapeless) library to make Spark more functional and compile-time type-safe.

Similar articles:

Type safety on Spark Dataframes: http://www.51zero.com/blog/2016/2/24/type-safety-on-spark-dataframes-part-1

Reasoning Under Uncertainty: Do the right thing!

Reasoning Under Uncertainty: Do the right thing!

The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before. Simply recording data is one thing, whereas the ability to utilize it and turn it into a profit is another. Supposing we want to collect as many pieces of information as we can gather from any source, our database will be populated with a lot of sparse, unstructured, and not-explicitly-well-clear correlated data. In this essay we summarized the approach proposed in Chapter IV “Uncertain Knowledge and Representation” of the book “Artificial Intelligence: A Modern Approach” written by Russel S. and Norvig P., showing how the problem of reasoning under uncertainty is applied in data science, and in particular in the recent data revolution scenario. The proposed approach analyzes an extension of the Bayesian networks called Decisions networks that resulted to be a simple but elegant model for reasoning in presence of uncertainty.

Ubiquitous Computing for Big Data Insight: Helpful Tool or Privacy Breaker?

Ubiquitous Computing for Big Data Insight: Helpful Tool or Privacy Breaker?

The next generation of devices able to access to the Internet and the Web may not be characterized by computers, smartphones, tablets or appliances designed for this specific purpose. Every common item of daily life might be able to connect to online services whilst people using them are completely not aware of it. The Apple Research center study describes the future of computing and of people interaction in a networked society rising a new computing concept. The Apple vision is briefly explained by Frank Casanova, who says:
”The Concept of computers as things that you walk up to, sit in front of and turn on will go away. In fact, our goal is to make the computer disappear. We are moving towards a model we think of as a ’personal information cloud’. That cloud has already begun to coalesce in the form of the Internet. The Internet is the big event of the decade […]. We’ll spend the next 10 years making the Net work as it should, making it ubiquitous.” [Frank Casanova].
This new concept has been named by Mark Weiser as Ubiquitous Computing (UC). The original idea can be summarized as follow:
”Dwelling with computers means that they have their place, and we ours, and we co-exist comfortably. Unfortunately, our existing metaphors for computers [are] inadequate to describe the ’dwelling’ re- lationship. Over the next twenty years computers will inhabit the most trivial things: clothes labels (to track washing), coffee cups (to alert cleaning staff to moldy cups), light switches (to save energy if no one is in the room), and pencils (to digitize everything we draw). In such a world, we must dwell with computers, not just interact with them.” [Mark Weiser, 1996].
The aim of UC is to move the computation in many small, embedded and spe- cialized devices present in our dynamic environments while remaining totally trans- parent to the users around. Ubiquitous concept is a new revolutionary paradigm that will widespread information systems inside our society, creating a cutting-edge networked society.
To understand possible candidates of UC devices, it may be useful to mention Dick Rijken’s own vision of UC:
”It can drag interactivity away from technological fascination and wizardry into the realm of human experience and action. What is be- ing designed is no longer a medium or a tool in the traditional sense, but something far more intangible, embedded in a continuously chang- ing environment where everything is connected to everything else.” [Dick Rijken, 1994.].
In short, a product to be defined an UC product has to be present everywhere, to be small and to be aware of the context where he is located. These three char- acteristics permits the user to interact with the devices with a complete freedom of movement and without technical knowledge dependency.
All of this is sustained by the industry interests in ubiquitous applications, their business opportunities and the growing of related technologies such as: Internet of Things, Smart Spaces, Sensors Networks and so on. Yet, there are some limitations due to the hardware costs, maintenance and low scalability, but also and foremost for the distrustful public opinion regarding the massive adoption of these new technologies.
This implies developers and manufacturers of ubiquitous technologies to care- fully consider the social and ethic impact, which may negative influence the busi- ness model and market value of the product. Nevertheless, UC technologies applied in a non-invasive manner can lead to new important instruments.
This essay is strongly concentrated on the treatment of huge amount of data, recently called with the name of Big Data. It will first give an idea of the size and the worth of these Data, then it will focus in deep on the Ubiquitous Computing concept as a new channel for data gathering and his impact in the society and for new business opportunities based on Customer Insight Data. Finally we will conclude with particular importance and carefully understanding of all the related privacy issues and proposed solutions.