Create a database seed for PR preview environments 🌱

I want to set up preview environments (ad-hoc server + database for pending pull requests).
The main issue we will have to tackle is database seeding: I want it to be minimal to prevent huge bills :money_with_wings: and lengthy deployments :zzz:

I guess the best option would be a script to create a purged, anonymized seed from our production database. Is there any tooling to make it easier? (dealing with foreign keys …)

Thanks for sharing your thoughts :pray:

5 Likes

My google results so far:

I will try these tools and let you know how it goes

3 Likes

I personally like avoiding production DB imitation and creating data from scratch, but I have never executed on this ideal.

Prod data keeps growing, which will make preview DB creation slower over time, unless you trim some data as well as anonimize. Maintaining a data set that is similar to production will keep you sharp about production data usage patterns.

On the flip side, prod imitation gives you better testing ground for schema and data migrations.

2 Likes

That’s why I’m looking for data subsetters that would be able to trim production data.

I’m already getting some interesting results with condenser which was able to create a 6MB dump file from my 80GB database

2 Likes

Here’s another approach you may want to consider: writing fake data functions (which I also use in tests anyway) which are simple wrappers around functions in contexts, eg:

  def fake_user!(opts) do
    custom_username = opts[:username]
    attrs = opts
    |> Map.put_new_lazy(:email,    &Faker.Internet.email/0)
    |> Map.put_new_lazy(:username, &Faker.Internet.user_name/0)

    with {:ok, user} <- Users.create(attrs) do
      user
    else
      {:error, %Ecto.Changeset{}} when is_binary(custom_username) ->
        Users.by_username!(custom_username)
    end
  end

And then I use phil_columns | Hex to write data migrations calling those functions.

4 Likes

It would be interesting if you can keep randomized subset.

2 Likes

:joy:

PhilColumns

2 Likes

There’s a few reasons why I like scripting explicit test seed data for your use case (and similar ones):

  1. It’s minimal, so therefore both financially cheap to host (e.g. in a minimal/small sized DB server/instance), and quick to seed.
  2. The seed data can be as friendly as you want. You can give your test objects meaningful, and distinguishable, names. You can create example objects for specific scenarios, e.g. for common issues. You can create example objects for specific bugs, or features, or for testing ‘plumbing’ changes.
  3. Because you’re relying on this seed data, there are obvious points during development and maintenance where it makes sense to add or update (or even delete) seed data, and this is much easier in these kinds of scripts (that you wrote yourself) than a ‘DB backup’ script (IME).

I’ve often run into the problem of not having good test/seed/example data when fixing bugs, adding features, etc… If that data wasn’t already in some purged/anonymized seed data – and findable (or knowable) in it – then I’d either have to do without, manually mangle some to fit my needs (and then lose all of my changes when/if I reseeded my (local) dev/test DBs), or write seed scripts anyways.

I’ve found it to be very helpful to know – off the top of my head – exactly where/how to find good example data in my test/dev/preview DBs. I know the user info and the key example objects and they have ‘big dumb’ names like “Happy Patterson” (for a ‘happy path’ customer). (For local DBs, all the user passwords are “password”.)

Another very helpful aspect of scripting seed data is that you can leverage your existing app/site/system code. Not only does that test that code, but updating that code is automatically ‘incorporated’ into the seed data (when you reseed a DB). If you change your ‘data schemas’, you don’t also need to (‘manually’) modify your seed data scripts to match (or even forget to do this).

Scripts also work nice if your app/site/system needs to interface with test/demo environments of any third party services. I’m not sure how you’d handle that with a ‘DB backup’ data script – probably another script.

Ironically, I started writing seed scripts because I didn’t think I had enough time to work on purging and anonymizing a copy of the production DB.

5 Likes

Thank you @kenny-evitt for your thorough answer! :pray:

I agree and I think that we will eventually evolve using our own custom seeding code, based on our existing application code.

But we have quite a large codebase/domain model, and writing those seeds from scratch will be laborious! So I think for now we will stick with an automated subset of your production database until we write our seeding code.

I will probably publish a blog post on the topic. I found out that deploying pull requests on ad-hoc environments was way more complicated than what PAAS vendors are telling us :wink:

2 Likes

You’re welcome! :slight_smile:

Funny enough, I really only got started writing custom seed code for my own (big) Elixir project after I couldn’t restore a backup of my coworker’s own ‘subset of prod’!

I found that only the first ‘big slice’ of custom seed data was particularly time-consuming (and even then only took a few hours over 1-2 days maybe). While adding that first slice, I added helpers (and refactored a few things), and since then adding new seed data has been much easier.

I would definitely suggest just starting with enough seed data for your current projects (i.e. issues/tickets/stories/whatever) – not writing ‘all the seeds you (think you’ll) need’ all at once.

I will probably publish a blog post on the topic. I found out that deploying pull requests on ad-hoc environments was way more complicated than what PAAS vendors are telling us :wink:

GitLab’s version of this uses Docker and K8s and, so I’m guessing, assumes you can deploy an arbitrary environment already.