Create a database seed for PR preview environments 🌱

cblavier · August 22, 2022, 9:29am

I want to set up preview environments (ad-hoc server + database for pending pull requests).
The main issue we will have to tackle is database seeding: I want it to be minimal to prevent huge bills and lengthy deployments

I guess the best option would be a script to create a purged, anonymized seed from our production database. Is there any tooling to make it easier? (dealing with foreign keys …)

Thanks for sharing your thoughts

cblavier · August 22, 2022, 11:33am

My google results so far:

I will try these tools and let you know how it goes

kanishka · August 22, 2022, 1:23pm

I personally like avoiding production DB imitation and creating data from scratch, but I have never executed on this ideal.

Prod data keeps growing, which will make preview DB creation slower over time, unless you trim some data as well as anonimize. Maintaining a data set that is similar to production will keep you sharp about production data usage patterns.

On the flip side, prod imitation gives you better testing ground for schema and data migrations.

cblavier · August 22, 2022, 1:52pm

That’s why I’m looking for data subsetters that would be able to trim production data.

I’m already getting some interesting results with condenser which was able to create a 6MB dump file from my 80GB database

mayel · August 22, 2022, 8:27pm

Here’s another approach you may want to consider: writing fake data functions (which I also use in tests anyway) which are simple wrappers around functions in contexts, eg:

  def fake_user!(opts) do
    custom_username = opts[:username]
    attrs = opts
    |> Map.put_new_lazy(:email,    &Faker.Internet.email/0)
    |> Map.put_new_lazy(:username, &Faker.Internet.user_name/0)

    with {:ok, user} <- Users.create(attrs) do
      user
    else
      {:error, %Ecto.Changeset{}} when is_binary(custom_username) ->
        Users.by_username!(custom_username)
    end
  end

And then I use phil_columns | Hex to write data migrations calling those functions.

kanishka · August 23, 2022, 12:31am

It would be interesting if you can keep randomized subset.

cblavier · August 23, 2022, 8:29am

PhilColumns

kenny-evitt · August 28, 2022, 7:11pm

There’s a few reasons why I like scripting explicit test seed data for your use case (and similar ones):

It’s minimal, so therefore both financially cheap to host (e.g. in a minimal/small sized DB server/instance), and quick to seed.
The seed data can be as friendly as you want. You can give your test objects meaningful, and distinguishable, names. You can create example objects for specific scenarios, e.g. for common issues. You can create example objects for specific bugs, or features, or for testing ‘plumbing’ changes.
Because you’re relying on this seed data, there are obvious points during development and maintenance where it makes sense to add or update (or even delete) seed data, and this is much easier in these kinds of scripts (that you wrote yourself) than a ‘DB backup’ script (IME).

I’ve often run into the problem of not having good test/seed/example data when fixing bugs, adding features, etc… If that data wasn’t already in some purged/anonymized seed data – and findable (or knowable) in it – then I’d either have to do without, manually mangle some to fit my needs (and then lose all of my changes when/if I reseeded my (local) dev/test DBs), or write seed scripts anyways.

I’ve found it to be very helpful to know – off the top of my head – exactly where/how to find good example data in my test/dev/preview DBs. I know the user info and the key example objects and they have ‘big dumb’ names like “Happy Patterson” (for a ‘happy path’ customer). (For local DBs, all the user passwords are “password”.)

Another very helpful aspect of scripting seed data is that you can leverage your existing app/site/system code. Not only does that test that code, but updating that code is automatically ‘incorporated’ into the seed data (when you reseed a DB). If you change your ‘data schemas’, you don’t also need to (‘manually’) modify your seed data scripts to match (or even forget to do this).

Scripts also work nice if your app/site/system needs to interface with test/demo environments of any third party services. I’m not sure how you’d handle that with a ‘DB backup’ data script – probably another script.

Ironically, I started writing seed scripts because I didn’t think I had enough time to work on purging and anonymizing a copy of the production DB.

cblavier · August 30, 2022, 11:41am

Thank you @kenny-evitt for your thorough answer!

I agree and I think that we will eventually evolve using our own custom seeding code, based on our existing application code.

But we have quite a large codebase/domain model, and writing those seeds from scratch will be laborious! So I think for now we will stick with an automated subset of your production database until we write our seeding code.

I will probably publish a blog post on the topic. I found out that deploying pull requests on ad-hoc environments was way more complicated than what PAAS vendors are telling us

kenny-evitt · August 30, 2022, 6:09pm

You’re welcome!

Funny enough, I really only got started writing custom seed code for my own (big) Elixir project after I couldn’t restore a backup of my coworker’s own ‘subset of prod’!

I found that only the first ‘big slice’ of custom seed data was particularly time-consuming (and even then only took a few hours over 1-2 days maybe). While adding that first slice, I added helpers (and refactored a few things), and since then adding new seed data has been much easier.

I would definitely suggest just starting with enough seed data for your current projects (i.e. issues/tickets/stories/whatever) – not writing ‘all the seeds you (think you’ll) need’ all at once.

I will probably publish a blog post on the topic. I found out that deploying pull requests on ad-hoc environments was way more complicated than what PAAS vendors are telling us

GitLab’s version of this uses Docker and K8s and, so I’m guessing, assumes you can deploy an arbitrary environment already.

roryfahy · August 26, 2023, 11:59pm

Hey @cblavier, just wondering if you ever got around to that blog post? I’d be very interested in reading it!

roryfahy · August 26, 2023, 11:59pm

Hey @kenny-evitt would you by any chance have an example script you could share of how to go about this?

kenny-evitt · August 28, 2023, 3:26pm

I can’t share the examples as-is but I can share examples that I think convey the gist.

I have a Mix alias in mix.exs like this:

  defp aliases do

      ...

      "ecto.setup": [
        ...
        "run priv/repo/seeds.exs"
      ],

      ...

You could also run the seeds script by itself:

$ mix run priv/repo/seeds.exs

My priv/repo/seeds.exs file looks like:

Code.eval_file("priv/repo/seed/something.exs")

Code.eval_file("priv/repo/seed/something-else.exs")

...

But it’s totally reasonable to just start with everything in one file.

priv/repo/seed/something.exs would look like:

something_1 =
  %MyApp.Something{
    name: "Blah blah blah",
    ...
  }
  |> Repo.insert!()

...

Note that you can access your app’s code, so you can, e.g. call ‘regular’ functions in your app’s modules or its dependencies or extract code or data into new modules:

password = RepoSeedsInfo.password()

user_1_email = RepoSeedsInfo.user_1_email()

something_1 =
  MyApp.User.changeset(
    %MyApp.User{},
    %{
      email:    user_1_email,
      password: password
    }
  )
  |> Repo.insert!()

My main use case is for populating :dev and :test DBs but you could split up or organize your code to handle whatever else you want.

roryfahy · August 28, 2023, 4:17pm

Thank you @kenny-evitt. How do you handle wanting to generate a lot of Something? or do you not often find yourself needing larger samples to operate on?

kenny-evitt · August 28, 2023, 7:02pm

You’re welcome!

Pragmatically, I’ve mostly been slowly accumulating seed data for specific purposes, so I’ve mostly been adding small ‘bespoke’ samples.

Generating larger samples is, in general, arbitrarily hard (or can be), e.g. ‘mimicking production’.

But again, in practice, it’s been easy enough to write a function like MyApp.Test.AThing.random_thing/x if you do need lots of seed items. I’ve found lots of shared uses between code like that that’s useful for tests and generating seed data. This has mostly been helpful for, e.g. testing UI lists or bounds/filters of complex DB queries, and not (much at all) for ‘simulating production’.