Ecto-less migrations?

ecto
mnesia
distributed-systems
migrations
riak

#1

Hello everyone!

As you might or might not know, Planga is a thing we’re building: It’s a seamless chat integration service.

Currently, I am busy working out the details of making Planga as scalable as it can be.
As has been discussed before in Mnesia vs Cassandra (vs CouchDB vs ...) - your thoughts?, we’re currently in the process of moving away from Mnesia to some other type of (decentralized!) datastore.

However, this brings two problems with it:

  1. It’s nearly impossible to use Ecto. On one hand, there do exist a couple of Ecto wrappers for these datastores, but they are experimental at best. (And we already are experiencing this with the ecto_mnesia wrapper… migrations are a pain and setting up the database in the testing environment is a mystery I haven’t solved yet). On the other: Distributed datastores are quite far away from what Ecto (2.0) is commonly used for: Queries (some are supported, many are not because of the data format of these decentralized datastores), Migrations (if you have your database in multiple nodes, it is near-impossible to migrate ‘tables’ at the same time).

We will still be using Ecto for its validations, but we probably will have to manage our database insertions/lookups/updates manually.

  1. Migrations. Planga is supposed to scale. We try to plan ahead as good as we can, but I am certain that there will be more fields added to our datastructures in the future.

However, this poses a problem: Once you have millions of messages stored in a (distributed) datastore, data migrations are near-impossible to run.
Instead, I currently think it makes more sense to define a way to version the datastructures, and allow older versions of the data structures to be transformed to the new expected format once it has been fetched from the datastore.
In this way, except in some advanced features where we need to add new indexes to the datastore or something, iterating over all datastructure in the datastore becomes unneccesary.

But now my question: What would be the best way to set up something like this? It feels like something that belongs in the Schema file at least. I am currently thinking about adding multiple heads for a function, one for every ‘version’ of the datastructure, which are then called incrementally (the version being an integer) until the ‘current’ version is reached.

Has someone experience with these kinds of things? I think it is sort of similar to the code_change function that is part of the GenServer behaviour, but then it is part of the data-layer rather than the code-layer.

Help is greatly appreciated!


#2

I think your idea is a good one. It reminds me a little bit of the Haskell library SafeCopy. They use TH (macros) to generate typeclass implementations and I suppose you could possibly do something similar with protocols, but I think a new function head that just knows how to migrate to the next version would make more sense. I’d probably make separate named, versioned structs for each version so it is clear what kind of structures each migration clause is operating on. This may seem like a lot of overhead but you could tuck older versions away in a separate file.