Aircloak - anonymized analitycs

sasajuric · December 18, 2017, 9:20am

Hi everyone,

Here’s an overview of what we’re doing with Elixir at Aircloak. Our goal is to tackle the challenge of privacy-sensitive data analytics. This sounds a bit vague, so let’s first explore what is this all about.

Data can be very useful, because we can draw various conclusions from it. Consider a huge amount of geolocations of people’s phones. By analyzing such data, we can, for example, discover traffic bottlenecks, and use that insight to select optimal (shortest) navigational path. On the other hand, such data is also very dangerous. If we have the access to the geolocation data of many people, we can easily track movements of particular individuals, which is clearly a huge violation of privacy.

Our aim is to empower companies to get the analytical benefits of their data, without leaking sensitive information. Our system acts as a proxy between analysts and sensitive data. The analysts issue their queries through the Aircloak system, which in turn retrieves the sensitive data, passes it through our anonymization filter, and finally returns the query result which doesn’t reveal anything sensitive about an individual.

The core part which allows us to do this lies in our anonymization algorithm. While most companies reach for naive masking of sensitive fields (such as first/last name, social security number, …), we fetch the raw data required by the query, evaluate the query ourselves, and then decide which results can we safely emit without leaking sensitive data. The details of the algorithm are published in this paper.

The implementation of our system has quite a lot of interesting technical challenges. Our backend, implemented completely in Elixir, is installed on client premises. The system itself is split into two components, called Air and Cloak.

Cloak is the core component of our system. This component can accept analytical queries, and return anonymized results. In order to do its job, the Cloak component needs an access to the underlying database. To minimize security risks, the Cloak component doesn’t act as a server (no network port needs to be open). Instead, the Cloak establishes a websocket connection to the Air component, which is the frontend system that can be used by analysts to submit their queries. The Air component has no access to the database, and doesn’t even need to be in the same network. This reduces some security issues, where a vulnerability in the Air (which is an externally accessible component) could lead to the leak of all sensitive data.

When an analyst submits a query, the Air component will send a message over the established connection, and Cloak will asynchronously send the result back. This is the first technical curiosty of our system (and one of the first things we’ve implemented): we have two Elixir systems chatting over the websocket connection. For this purpose, we’ve implemented an Elixir client for Phoenix channels protocol, which we’ve open sourced here.

The Cloak component itself consists of a bunch of interesting technical challenges. First, the queries submitted by analysts are written in SQL. Using the Combine library, we’ve implementd an SQL parser, which converts a textual SQL query into an AST describing what the analyst wants to retrieve. We then issue a query (or a set of queries) to the database, to fetch the desired data.

The Cloak can work with different databases, such as PostgreSQL, MySql, SQL Server, MongoDB, SAP HANA, … We offload as much of the query as possible to the underlying database, but we also need to perform some post-processing in the Elixir part. This can happen if the underlying database doesn’t support the features we support in our SQL language. The most obvious example for this is MongoDB, where many of the SQL features (for example joins and subqueries) are not possible, so we need to emulate them in the Elixir layer. In addition, our anonymization algorithm requires that we process some functions (for example aggregation functions) in a special way, which again leads to offloading some parts of the query execution to Elixir.

Supporting different data sources, and executing parts of the query in Elixir is quite challenging. Adding to that challenge is the fact that we deal with large data. We might need to query millions of rows in various ways, fetch the data, perform additional emulation in Elixir, and finally pass everything through our anonymization filter. This can incur a significant performance penalty, so tuning our query engine is a something we frequently need to do. Our query execution engine is certainly the most complex part of our whole system, but that’s not surprising, given that it implements the core part of our system.

Compared to the Cloak, the Air component is a more conventional Phoenix application. It powers a typical web server which allows analysts to issue cloak queries. We have a simple UI which you can use to manage analysts and connected cloaks. The administrator can give permission to analysts, and they can then issue their queries through the component UI. We also expose a simple REST API for submitting queries from 3rd party software.

As a particularly interesting technical aspect, the Air component pretends to be a PostgreSQL database. By understanding the server-side of the PostgreSQL protocol, the Air can accept connections from PostgreSQL aware tools, such as psql. This trick allows us to integrate with analytical tools, such as Tableau, which in turn improves the usability of our system. Our CTO, Sebastian, wrote more about this feature here.

Both components are hosted on client premises. To make the installation as easy as possible for them, we ship them as Docker images. We’ve created a simple ad-hoc build pipeline, which builds each image on the build server, tags it, and pushes it to the private Docker reposity. The clients are granted the access to the repository, and they get a set of simple instructions. In its simplest form, the installation boils down to pulling the image, providing some configuration file, and starting it.

As you can see, there are many interesting technical aspects in our system. We worked hard, but also had a lot of fun building it, and we believe we’ve developed something unique and useful. In the process, we’ve relied extensively on many great libraries and frameworks from the BEAM ecosystem, such as Phoenix, Cowboy/Ranch, Ecto, Combine, and others. These libraries have helped us tremendeously with our work, and we feel very grateful for that.

Aircloak · December 18, 2017, 9:27am

Saša, thank you for this intro. And thank you to the Elixir (Forum) community for the lively exchange and input over the years! We hope you all find something interesting in what we’re doing.

If you have any questions, feel free to hit us up here on the forum or via solutions@aircloak.com.

Qqwy · December 18, 2017, 9:42am

This sounds absolutely insane! Both as a wonderful application in general, as well as something that Elixir is very well suited for.

One thing I am currently wondering about, is how you’ll handle updates of the system: Would people then need to re-install the docker images? Where is the data used by Cloak and Air stored in this case?

Absolutely wonderful that the application is separated in this way (very smart!) and also that the application is hosted on the Client’s machine; immediately after the introduction I was wondering how you would solve the privacy-problems, and this is probably the only viable solution (bar homomorphic encryption which is still in its infancy, that is).

A huge thumbs-up from me! I am definitely going to track this project, and possibly try it out with one of the projects I am doing, if the opportunity arises.

sasajuric · December 18, 2017, 11:10am

Thank you for the positive feedback!

Yes, we currently don’t support live upgrade scenario. You need to take the system down, pull the new versions of images, and restart it. Alternatively, you could do something like blue/green deployment, but so far no one has tried it out in practice.

Cloak persists no data, so you can safely restart it.

Air uses its own PostgreSQL database, and it is the responsibility of the client to manage it. Looks like I skipped that part when talking about the setup

AstonJ · December 18, 2017, 6:39pm

Thanks for sharing what you do and how you’re using Elixir Saša

It’s clearly evident that you and the @Aircloak team are a talented bunch and so many of us have benefited from your wisdom and knowledge over the years - whether that be via your book, posts on the forum, talks, blog posts and even tweets

I’m sure I speak for many of us when I say we are very lucky to have you in the Elixir community and grateful for all the time and effort you put into not just evangelising Elixir… but helping so many of us too - thank you!

And thank you Aircloak for supporting and nurturing such talent - I look forward to seeing the rest of the team on the forum soon!

sasajuric · December 18, 2017, 7:58pm

Thank you for the kind words Aston.

I feel lucky myself for being able to work with smart and nice people on such interesting challenges, and I even get to use my favourite tech And it goes without saying that I’m very happy to be a part of the Elixir community. Everyone is super nice and helpful, and I get to learn a lot myself on this forum.