Generating persistent database state from tests

binarypaladin · August 16, 2024, 10:13pm

I work almost entirely from tests. Since I rarely mess with UIs and have the luxury of pure backend bliss. If I’m messing around in iex it’s almost always inside a debugging situation from a test.

An issue I run into enough that it’s a headache though is creating migrations that fix existing data. The tests are always working from a nice clean slate but migrations that need to fix issues in deployed environments don’t have it so simple. You need to test that your migration works though, even if just on a one-off basis.

What I want to be able to do in most cases, to get the state in place, is execute a test and have the data remain in either the test database or I can just point to another one. Whatever. Is there an easy way to say, “Don’t clean up!” Either from the terminal when I execute the test or some snazzy way with a @tag or something?

Using tests to set up state has other applications too. Anyone got a solution?

garrison · August 17, 2024, 4:28pm

The test database is not explicitly cleaned up AFAIK, the tests are just run in a sandbox which wraps them in a transaction and rolls it back at the end. As a result, each test sees the database as whatever state it happens to be in when the tests run (which is generally completely empty).

You could disable the sandbox, but then all of the state from different tests will overlap. Depending on how your application is structured this might be fine, but any issues which do crop up are likely to be quite nondeterminstic if your tests are async (lots of race conditions, probably).

Personally, I don’t think this sounds like a great idea. The general mechanism for seeding the database is to create seed scripts and run them with mix run (there is a default seeds.exs for new Phoenix apps). You could also create helper functions which bring the DB to a particular state and then call them within each test, or seed them into the test database and then run tests against them, and then you don’t have to worry about complex interdependencies between tests.

Also, for the record, I believe the recommended best practice is to keep data migrations out of Ecto migrations and put them in scripts instead. Obviously this is something you can decide on a case-by-case basis, but anything sufficiently complicated or slow (e.g. backfilling a new column based on existing data) would obviously be better off in some sort of script or job.

D4no0 · August 17, 2024, 5:00pm

Migrations should describe table structure changes, not deal with data migrations. Not only it is extremely dangerous to run automated data migrations, your migrations are no longer atomic and cannot be reverted back.

The way I’ve seen this done both in small and very big projects is by writing some custom functions that you can later run in your environments with remote console or other manual input way. This will allow you to test those functions beforehand and even add the capabilities of doing a dry run by reverting back the transaction when running for the first time.

dimitarvp · August 17, 2024, 9:41pm

This is not strictly true, I’ve seen it done successfully a good amount of times, and I have done it part of the times. It’s doable but it’s difficult and almost always a multi-step process (as in, multiple migrations spread out over a week or two).

And I’ve seen what you describe as well. I had customers where we did both: migrations and helper functions to repair / normalize already existing data. One clever trick was to invoke those functions on user data the first time they are referenced past a certain date – it was a hack but a hugely successful hack. Then we found out a month later that about 11% of the users still haven’t signed in so we haven’t migrated their data yet so I just authored an Oban job that only fetched a single non-normalized record, changed it, and then rescheduled itself to run again in 5 secs. Within an hour all data was migrated.

So I’ll agree it’s a bit frightening to resort to migrations for data changes but it’s feasible part of the time.

dimitarvp · August 17, 2024, 9:42pm

One thing I’ve done a few months ago was to have a local Docker container for my DB and include an initialization script for it (PostgreSQL supports this; if you put a shell script inside a special directory, it will run it before reporting that it has started – I used this to run psql with seeding test data).