Create a database seed for PR preview environments šŸŒ±

I want to set up preview environments (ad-hoc server + database for pending pull requests).
The main issue we will have to tackle is database seeding: I want it to be minimal to prevent huge bills :money_with_wings: and lengthy deployments :zzz:

I guess the best option would be a script to create a purged, anonymized seed from our production database. Is there any tooling to make it easier? (dealing with foreign keys ā€¦)

Thanks for sharing your thoughts :pray:

5 Likes

My google results so far:

I will try these tools and let you know how it goes

3 Likes

I personally like avoiding production DB imitation and creating data from scratch, but I have never executed on this ideal.

Prod data keeps growing, which will make preview DB creation slower over time, unless you trim some data as well as anonimize. Maintaining a data set that is similar to production will keep you sharp about production data usage patterns.

On the flip side, prod imitation gives you better testing ground for schema and data migrations.

2 Likes

Thatā€™s why Iā€™m looking for data subsetters that would be able to trim production data.

Iā€™m already getting some interesting results with condenser which was able to create a 6MB dump file from my 80GB database

3 Likes

Hereā€™s another approach you may want to consider: writing fake data functions (which I also use in tests anyway) which are simple wrappers around functions in contexts, eg:

  def fake_user!(opts) do
    custom_username = opts[:username]
    attrs = opts
    |> Map.put_new_lazy(:email,    &Faker.Internet.email/0)
    |> Map.put_new_lazy(:username, &Faker.Internet.user_name/0)

    with {:ok, user} <- Users.create(attrs) do
      user
    else
      {:error, %Ecto.Changeset{}} when is_binary(custom_username) ->
        Users.by_username!(custom_username)
    end
  end

And then I use phil_columns | Hex to write data migrations calling those functions.

4 Likes

It would be interesting if you can keep randomized subset.

2 Likes

:joy:

PhilColumns

4 Likes

Thereā€™s a few reasons why I like scripting explicit test seed data for your use case (and similar ones):

  1. Itā€™s minimal, so therefore both financially cheap to host (e.g. in a minimal/small sized DB server/instance), and quick to seed.
  2. The seed data can be as friendly as you want. You can give your test objects meaningful, and distinguishable, names. You can create example objects for specific scenarios, e.g. for common issues. You can create example objects for specific bugs, or features, or for testing ā€˜plumbingā€™ changes.
  3. Because youā€™re relying on this seed data, there are obvious points during development and maintenance where it makes sense to add or update (or even delete) seed data, and this is much easier in these kinds of scripts (that you wrote yourself) than a ā€˜DB backupā€™ script (IME).

Iā€™ve often run into the problem of not having good test/seed/example data when fixing bugs, adding features, etcā€¦ If that data wasnā€™t already in some purged/anonymized seed data ā€“ and findable (or knowable) in it ā€“ then Iā€™d either have to do without, manually mangle some to fit my needs (and then lose all of my changes when/if I reseeded my (local) dev/test DBs), or write seed scripts anyways.

Iā€™ve found it to be very helpful to know ā€“ off the top of my head ā€“ exactly where/how to find good example data in my test/dev/preview DBs. I know the user info and the key example objects and they have ā€˜big dumbā€™ names like ā€œHappy Pattersonā€ (for a ā€˜happy pathā€™ customer). (For local DBs, all the user passwords are ā€œpasswordā€.)

Another very helpful aspect of scripting seed data is that you can leverage your existing app/site/system code. Not only does that test that code, but updating that code is automatically ā€˜incorporatedā€™ into the seed data (when you reseed a DB). If you change your ā€˜data schemasā€™, you donā€™t also need to (ā€˜manuallyā€™) modify your seed data scripts to match (or even forget to do this).

Scripts also work nice if your app/site/system needs to interface with test/demo environments of any third party services. Iā€™m not sure how youā€™d handle that with a ā€˜DB backupā€™ data script ā€“ probably another script.

Ironically, I started writing seed scripts because I didnā€™t think I had enough time to work on purging and anonymizing a copy of the production DB.

5 Likes

Thank you @kenny-evitt for your thorough answer! :pray:

I agree and I think that we will eventually evolve using our own custom seeding code, based on our existing application code.

But we have quite a large codebase/domain model, and writing those seeds from scratch will be laborious! So I think for now we will stick with an automated subset of your production database until we write our seeding code.

I will probably publish a blog post on the topic. I found out that deploying pull requests on ad-hoc environments was way more complicated than what PAAS vendors are telling us :wink:

2 Likes

Youā€™re welcome! :slight_smile:

Funny enough, I really only got started writing custom seed code for my own (big) Elixir project after I couldnā€™t restore a backup of my coworkerā€™s own ā€˜subset of prodā€™!

I found that only the first ā€˜big sliceā€™ of custom seed data was particularly time-consuming (and even then only took a few hours over 1-2 days maybe). While adding that first slice, I added helpers (and refactored a few things), and since then adding new seed data has been much easier.

I would definitely suggest just starting with enough seed data for your current projects (i.e. issues/tickets/stories/whatever) ā€“ not writing ā€˜all the seeds you (think youā€™ll) needā€™ all at once.

I will probably publish a blog post on the topic. I found out that deploying pull requests on ad-hoc environments was way more complicated than what PAAS vendors are telling us :wink:

GitLabā€™s version of this uses Docker and K8s and, so Iā€™m guessing, assumes you can deploy an arbitrary environment already.

Hey @cblavier, just wondering if you ever got around to that blog post? Iā€™d be very interested in reading it!

1 Like

Hey @kenny-evitt would you by any chance have an example script you could share of how to go about this?

1 Like

I canā€™t share the examples as-is but I can share examples that I think convey the gist.

I have a Mix alias in mix.exs like this:

  defp aliases do

      ...

      "ecto.setup": [
        ...
        "run priv/repo/seeds.exs"
      ],

      ...

You could also run the seeds script by itself:

$ mix run priv/repo/seeds.exs

My priv/repo/seeds.exs file looks like:

Code.eval_file("priv/repo/seed/something.exs")

Code.eval_file("priv/repo/seed/something-else.exs")

...

But itā€™s totally reasonable to just start with everything in one file.

priv/repo/seed/something.exs would look like:

something_1 =
  %MyApp.Something{
    name: "Blah blah blah",
    ...
  }
  |> Repo.insert!()

...

Note that you can access your appā€™s code, so you can, e.g. call ā€˜regularā€™ functions in your appā€™s modules or its dependencies or extract code or data into new modules:

password = RepoSeedsInfo.password()

user_1_email = RepoSeedsInfo.user_1_email()

something_1 =
  MyApp.User.changeset(
    %MyApp.User{},
    %{
      email:    user_1_email,
      password: password
    }
  )
  |> Repo.insert!()

My main use case is for populating :dev and :test DBs but you could split up or organize your code to handle whatever else you want.

Thank you @kenny-evitt. How do you handle wanting to generate a lot of Something? or do you not often find yourself needing larger samples to operate on?

1 Like

Youā€™re welcome!

Pragmatically, Iā€™ve mostly been slowly accumulating seed data for specific purposes, so Iā€™ve mostly been adding small ā€˜bespokeā€™ samples.

Generating larger samples is, in general, arbitrarily hard (or can be), e.g. ā€˜mimicking productionā€™.

But again, in practice, itā€™s been easy enough to write a function like MyApp.Test.AThing.random_thing/x if you do need lots of seed items. Iā€™ve found lots of shared uses between code like that thatā€™s useful for tests and generating seed data. This has mostly been helpful for, e.g. testing UI lists or bounds/filters of complex DB queries, and not (much at all) for ā€˜simulating productionā€™.