Should I stop my application to run a DB migration in prod?

Hello !

Should I stop my application to run a DB migration in prod ?

What is the good practice ?

Thanks !

It depends on many factors. I think you should Google “migrations and downtime” or something similar.

For example, lets say you want to rename a database column. Well, if you deploy code first, then things will crash because that column doesn’t exist. If you migrate first, well things will still crash because the currently running code will be trying to use the old column.

From personal experience, we’ve mostly opted to let our application crash for 1-2 minutes while we’re deploying/migrating. But if that’s not acceptable, then you gotta do things like deploy defensive code first (that can handle both situations), then migrate, then deploy again to remove the defensive code.

1 Like

For example, lets say you want to rename a database column. Well, if you deploy code first, then things will crash because that column doesn’t exist.

Yeah, or you could migrate the database when the application supervisor starts, before ecto starts.

From personal experience, we’ve mostly opted to let our application crash for 1-2 minutes while we’re deploying/migrating.

This is very big number, of course if we talk about CD in applications where availability matters.

In general there is no silver bullet approach, however there is one important thing to consider: if the migration is locking a resource or is causing problems at runtime when migrating you should avoid automation, it better to make a big migration step by step manually than to have big downtime on the system.

1 Like

I think given how the question was asked, I will recommend reviewing 12factor.net and if you could deliberate each point with personal experience.

Otherwise make the decision based on the trade-offs.

3 Likes

Didn’t consider that. Thanks for the recommendation

Yes, and in our case, it didn’t matter. In the beginning, our app was pretty much only used during business hours, so we opted to deploy migrations at night and didn’t care that the app crashed for a while during the deploy. Like we didn’t even care to put a maintenance page up… :joy: We almost always opted to forgo the engineering effort to do true zero downtime migrations.

Now our app is used 24/6.5 and we’re faced with terabyte scale migrations… and we still mostly opt to take the site offline to do so, rather than deal with headache of dual write, multi deploy type migrations.

Totally depends on use case. Our clients don’t mind scheduled outages on weekends. Lucky us! :slight_smile:

1 Like

This is a good approach, never complicate things without a good reason, focus on what is important.

1 Like

I really liked @dbern’s article on fly.io on Safe Eco Recipes. Lots of smart advice about how things can be done to minimise impact.

“good practice” generally means running migrations that don’t affect your end-users. In some applications that means going into “maintanence mode” (turn app off) in the middle of the night since no one’s using the app at that time, and running migrations. Or it means using some database techniques to avoid any downtime at all, which what I see more often.

The article Safe Eco Recipes is a resource that collects no-downtime migration techniques (thanks com for the reference!). If you’d rather watch a video there’s some of it here: ElixirConf 2022 - David Bernheisel - Ecto in Production: Migration Edition - YouTube

If you’re migrating a table with 10 rows in it, there’s virtually no downtime regardless of you using “good” or “bad” techniques. If you’re migrating a table with 10 million rows you’ll need to consider no-downtime techniques more seriously.

Maybe more important than any of that ^ is having a boot-up process that can successfully run migrations, checking for success, and if success then booting the rest of the app, and lastly shifting traffic to the new nodes. (see part 2 of the Safe Ecto Migrations guide)

2 Likes

:point_up_2: Yes!! And I feel like “Rails Migrations circa 2010” kind of assumed the former.

And it sucks giving “it depends” kinda answer… but… :man_shrugging: