Hi everyone,
Here’s an overview of what we’re doing with Elixir at Aircloak. Our goal is to tackle the challenge of privacy-sensitive data analytics. This sounds a bit vague, so let’s first explore what is this all about.
Data can be very useful, because we can draw various conclusions from it. Consider a huge amount of geolocations of people’s phones. By analyzing such data, we can, for example, discover traffic bottlenecks, and use that insight to select optimal (shortest) navigational path. On the other hand, such data is also very dangerous. If we have the access to the geolocation data of many people, we can easily track movements of particular individuals, which is clearly a huge violation of privacy.
Our aim is to empower companies to get the analytical benefits of their data, without leaking sensitive information. Our system acts as a proxy between analysts and sensitive data. The analysts issue their queries through the Aircloak system, which in turn retrieves the sensitive data, passes it through our anonymization filter, and finally returns the query result which doesn’t reveal anything sensitive about an individual.
The core part which allows us to do this lies in our anonymization algorithm. While most companies reach for naive masking of sensitive fields (such as first/last name, social security number, …), we fetch the raw data required by the query, evaluate the query ourselves, and then decide which results can we safely emit without leaking sensitive data. The details of the algorithm are published in this paper.
The implementation of our system has quite a lot of interesting technical challenges. Our backend, implemented completely in Elixir, is installed on client premises. The system itself is split into two components, called Air and Cloak.
Cloak is the core component of our system. This component can accept analytical queries, and return anonymized results. In order to do its job, the Cloak component needs an access to the underlying database. To minimize security risks, the Cloak component doesn’t act as a server (no network port needs to be open). Instead, the Cloak establishes a websocket connection to the Air component, which is the frontend system that can be used by analysts to submit their queries. The Air component has no access to the database, and doesn’t even need to be in the same network. This reduces some security issues, where a vulnerability in the Air (which is an externally accessible component) could lead to the leak of all sensitive data.
When an analyst submits a query, the Air component will send a message over the established connection, and Cloak will asynchronously send the result back. This is the first technical curiosty of our system (and one of the first things we’ve implemented): we have two Elixir systems chatting over the websocket connection. For this purpose, we’ve implemented an Elixir client for Phoenix channels protocol, which we’ve open sourced here.
The Cloak component itself consists of a bunch of interesting technical challenges. First, the queries submitted by analysts are written in SQL. Using the Combine library, we’ve implementd an SQL parser, which converts a textual SQL query into an AST describing what the analyst wants to retrieve. We then issue a query (or a set of queries) to the database, to fetch the desired data.
The Cloak can work with different databases, such as PostgreSQL, MySql, SQL Server, MongoDB, SAP HANA, … We offload as much of the query as possible to the underlying database, but we also need to perform some post-processing in the Elixir part. This can happen if the underlying database doesn’t support the features we support in our SQL language. The most obvious example for this is MongoDB, where many of the SQL features (for example joins and subqueries) are not possible, so we need to emulate them in the Elixir layer. In addition, our anonymization algorithm requires that we process some functions (for example aggregation functions) in a special way, which again leads to offloading some parts of the query execution to Elixir.
Supporting different data sources, and executing parts of the query in Elixir is quite challenging. Adding to that challenge is the fact that we deal with large data. We might need to query millions of rows in various ways, fetch the data, perform additional emulation in Elixir, and finally pass everything through our anonymization filter. This can incur a significant performance penalty, so tuning our query engine is a something we frequently need to do. Our query execution engine is certainly the most complex part of our whole system, but that’s not surprising, given that it implements the core part of our system.
Compared to the Cloak, the Air component is a more conventional Phoenix application. It powers a typical web server which allows analysts to issue cloak queries. We have a simple UI which you can use to manage analysts and connected cloaks. The administrator can give permission to analysts, and they can then issue their queries through the component UI. We also expose a simple REST API for submitting queries from 3rd party software.
As a particularly interesting technical aspect, the Air component pretends to be a PostgreSQL database. By understanding the server-side of the PostgreSQL protocol, the Air can accept connections from PostgreSQL aware tools, such as psql
. This trick allows us to integrate with analytical tools, such as Tableau, which in turn improves the usability of our system. Our CTO, Sebastian, wrote more about this feature here.
Both components are hosted on client premises. To make the installation as easy as possible for them, we ship them as Docker images. We’ve created a simple ad-hoc build pipeline, which builds each image on the build server, tags it, and pushes it to the private Docker reposity. The clients are granted the access to the repository, and they get a set of simple instructions. In its simplest form, the installation boils down to pulling the image, providing some configuration file, and starting it.
As you can see, there are many interesting technical aspects in our system. We worked hard, but also had a lot of fun building it, and we believe we’ve developed something unique and useful. In the process, we’ve relied extensively on many great libraries and frameworks from the BEAM ecosystem, such as Phoenix, Cowboy/Ranch, Ecto, Combine, and others. These libraries have helped us tremendeously with our work, and we feel very grateful for that.