Pseudonymization by storing PII in a long-living process

I’m pondering about a new business idea involving sensitive data. It will use an authentication service, and I would like to pseudonymize all PII. The authentication service will provide me with a person’s name and other PII that I will use for presentational purposes.

I got this idea, probably a stupid one, but I am still curious about your thoughts. I’m new to Elixir and OTP, and want to explore what you can do with it.

I read about someone who claimed you can avoid using databases in Erlang and use processes that hold their data instead. What if I saved all the PII from the authentication service in a process? I could map a person’s PII to a UUID, and the UUID could be used in conjunction with the rest of the data in my database.

Is this realistic? Is anyone doing something similar? I guess it would involve hot-swapping and fault tolerance involving several processes. It sounds like a big (fun) challenge.

So basically you want to encrypt the user’s data with a key that they control? I think it doesn’t matter in this case whether you are storing the data in-memory or in an actual database.

When I worked for a company that was developing apps for banks, on mobile they were using Realm to encrypt the user’s cached data with their pin. Even though I don’t think that technology is a good choice for your specific case, there should be other out there to have the set of features for your needs.

I would personally never go with a pure in-memory storage, as it is too dangerous in case of outages or synchronization conflicts, unless that data can be afforded to be discarded.

It is possible, there are already libraries that allow you doing this between different nodes, I don’t remember their names though.

No, I want to associate personal data with a UUID, and the associated data should only live in memory. This way, personal data will not be part of the database.

(Btw, I’ve used Realm before and will never use it again.)

I agree; it sounds like a completely nutty solution to me as well. But I want to explore whether it sounds nutty for people who actually know OTP. If it’s possible to save data reliably, in memory, in redundant processes, it might open up novel ways to store sensitive data. Data that would only be possible to retrieve following the logic of the process.

Maybe I don’t understand, but what is the actual novelty in this? RAM, just as non-volatile storage is just hardware for storing data. When it comes to data security, both have the same risks more or less and are compromized if a bad actor gets access either remotely or physically to the machine. The same thing goes for hardening, you can both encrypt the ram and storage.

The solution you propose is inherently more dangerous than having a single machine in a safe location with redundant storage, as it involves networking due the need to distribute data over multiple datacenters in different geographical regions, at least this is what you want to do if you want any reliability.

you can do that but I am not sure why would you. You’re going to have to save that PII somewhere I assume anyway. Separate database if you’re thinking about separation of that data will do just fine.

If you’re thinking about storing it only in memory that’s asking for trouble. Power will go out. Cloud network will mess up. Cluster will decide to update. Hardware will fail. You’re going to get the PII erased entirely if you only keep it in RAM, this will simply happen no matter how much you’ll try.

3 Likes

It sounds like it’s a bad idea then :slight_smile:

Look, this is just a thought experiment. I like doing them because they can teach a lot.

I am not a security expert at all. Is it easy to look at the data in a process, without using messages? If it is, my reasoning is off. Otherwise, one could imagine different ways to protect the data.

You are probably right.

The whole idea was to not have to save the data anywhere and hide it from myself. I don’t want to see it. As soon as it is in a database I can view it, and I don’t want to.

I can encrypt it, but using brute-force I could still potentially read it. If not now, maybe in the future.

But ok, bad, bad idea.

So basically, a process could only be trusted to hold cached, derived data?

This actually sounds like a good use case, like in a data-pipe. Is this a common way to do things in the OTP world?

No, although it’s a tempting one, but I would call it an antipattern. Much can be said, but OTP primitives and processes in princple are there to model the runtime of your application. Not the data.

If you want to hold data in memory, in most cases you’re better off using something like ETS rather than processes.

1 Like

If you do want to hide PII from yourself, I’d do it in two steps:

  1. verify that you really want to do it, and that it indeeds brings you some legal benefit rather than risk. NOT knowing your customers can be a liability and usually is too, even under regimes like GDPR.

  2. encrypt the data on the database/Ecto layer. It’ll get encrypted whenever you write it, decrypted when you fetch it. I know people do that on per-customer basis, with encryption keys unique for customer, and this is doable. I don’t do it like that, but in some projects use Google KMS + Ecto type to encrypt/decrypt data at rest.

3 Likes

The question is what kind of attacks you are expecting? If your attacker has full control over the machine(or VM), there is nothing that can stop them from dumping RAM contents and eventually decoding it, even if it’s encrypted at application level.

In the IOT world this is one of the first problems you have to deal with. The reason being is that bad actors can get hold of your hardware and you want to make sure that they cannot dump the firmware, or at least the most critical part of data on the storage, the keys used to authenticate and fetch firmware updates.

If this is interesting for you, you can read a great article made by Nerves team on how to secure storage that doesn’t feature security setting by default by adding additional hardware: NervesKey for Raspberry Pi | NervesHub

Industrial servers provide such capabilities without needing additional hardware, so if you plan on keeping the data in a secure environment you will need your own hardware, using cloud providers and especially their machines will never give you any security guarantees, no matter what you do at application level.

1 Like

I don’t know a lot about ETS, but from a quick read, it sounds like a tool for storing process state.

Is it common to use ETS to store cached or derived data?

I don’t know exactly what kind of attacks I’m expecting.

I’m in an early stage. I need to level up my security thinking. For now, I’m focusing on ways to keep PII from the rest of the data, which I learned recently is called pseudonymization.

1 Like

Solid advice.

I will research legal stuff in parallel, and will eventually hire lawyers as well.

I’ve started to look at encrypting data in Ecto as well. I’m using ash, and there is support for Cloak that I will investigate further.

Right now I’m thinking of storing the data from the authentication service in a table in a separate database and refer to it by the id from the other database. Then store the different databases separated from each other. I will also encrypt all fields that could be considered sensitive in any way.

I plan to use Fly.io, that seems to encrypt data at rest.

1 Like

Yes, that’s a very common use cache. Cachex uses it for example.

1 Like

Nice, I’ll take a closer look at it.

I would say that database encryption is only useful if you plan on storing the database in a place that you know it’s not secure, for example a managed database service and keys in a secure place. If you store your decryption keys in the same place as database, this is obviously not doing anything.

There is also the question on who needs to access the data? If the server doesn’t need to see the contents of that encrypted data, then the security model looks a lot more different. Even for a web app, you can have your server send encrypted data to the client and then have it be decrypted on client’s device using his personal decryption keys.

I’m not sure how to think of this.

Let’s say I have a machine/cluster for the application and one for the database. I store secrets (like encryption keys) in environment variables.

What is the likelihood someone gets access to only one of the parts? If I take care to store things in separate databases, on different machines, what is the likelihood that someone gets access to all databases if I use one cloud provider?

Should I use several cloud providers?

And what is the legal view of it?

There are so many questions …

This can vary a lot depending on the requirements. For example in Moldova, banks have the following requirements:

  1. All client cloud data must be stored on hardware owned and hosted in the bank. There are more additional rules when it comes to data encryption, personnel access, storage replacement etc.;
  2. All client local (for example mobile banking cache) data must be encrypted on the device, phones now come with secure storage capabilities and those can be used to “safely” store encryption keys. That is what we were using Realm for.

The legal view is the first one you must consult, as it dictates the rules on what you can and can’t do.

If for example you are storing user’s medical records, I doubt any storage services are allowed (even local companies), usually the first hard requirement is that this is stored on owned hardware in a secured certified datacenter.

I guess I have to consult legal expertise, but I am pretty sure I am allowed to store medical data in the cloud in Sweden.