Experimenting with running mysql tests in parallel

Hey all,

On my day job, we have a big elixir mono repo and large test suite. Most of our apps uses mysql as database. Given that ecto sandbox does not support async tests for mysql database. Our test suite takes a long time to complete. Basically, the async: true won’t work because of the way mysql transactions work

Approach we are experimenting

We are trying to work around the issue caused by the mysql transaction by using the database itself as an isolation layer. That is, we create multiple databases in the single MySQL instance, having the same structure and configure each connection to connect a different database. For example, if pool_size is 10, we connect each connection to 10 different databases with the same database structure. Since each connection is isolated by database itself instead of transaction, there won’t be deadlock or any such issues.

We wrote a wrapper for ecto.create and ecto.migrate to bring up multiple database for test env. We also made it to run in parallel so that the CI pipeline is fast.

Now we configure a separate database for each connection using configure/1

config :my_app, DummyRepo,
  adapter: Ecto.Adapters.MyXQL,
  ....
  hostname: "localhost",
  port: 3306,
  pool: Ecto.Adapters.SQL.Sandbox,
  pool_size: 10,
  configure: fn args ->
    Keyword.put(args, :database, "dummy_#{args[:pool_index]}") # dummy_1, dummy_2 ... dummy_10
  end

Then we switched our tests to async: true

This is working fine, and we are seeing the speedup we were aiming.

The concern here though is that pool_index is undocumented, and I think it is not meant to be used. But replacing it would be tricky. So I wanted to know if there is any suggestion to handle this better. Perhaps ecto team can help?

Or are you guys following a totally different approach altogether to run MySQL tests in parallel?

1 Like

There’s a similar approach baked into mix test, but it’s partitioned at the level of OS processes instead of per-connection. The default configuration generated by Phoenix in config/test.exs appends the partition to the database name:

However, this process-level strategy seems like it only makes sense for splitting tests in a CI-like environment (where each MIX_TEST_PARTITION is a separate node that runs migrations etc) versus running them locally.

Or are you guys following a totally different approach altogether to run MySQL tests in parallel?

In another programming language, I created a new Postgres database for every single test. It made test parallelization very easy. While I’m sure there was some overhead, the speedup from parallelization clearly made up for it. I used Postgres’ “template database” feature; maybe something similar exists in MySQL.

This doesn’t help with your Ecto question though :crazy_face:

We looked into MIX_TEST_PARTITION. But as you mentioned, it depends on user invoking multiple instances. So it is hard to make use of it for local development. And It is more heavy weight comparatively – running 10 beam-vm vs running one beam-vm with 10 databases. We do use process-level strategy, but bit differently, without using MIX_TEST_PARTITION. We split all of our apps into 2 groups and run app tests for these 2 set of groups in-parallel by making use of separate CI job.

I used Postgres’ “template database” feature; maybe something similar exists in MySQL.

We looked into this as well, but unfortunately there is no equivalent in MySQL space. The only way to do this is to dump and restore mysql structure. And DDL commands in mysql are slow. There are some interesting way to make it faster but as it is, it is hard to get this benefit without Postgrs template secret sauce I think.

Out of curiosity, do you know why is that the case? What is the characteristic of mysql transactions that make them unsuitable for async tests?

Even though we are using postgres at work, we are limited by ecto sandbox as well.

Is there any chance your current solutions/wrappers/scripts might be released as some hex package?

If you don’t want to rely on undocumented behaviour (I personally wouldn’t for anything serious) you could create multiple repos with different values for :database.

I think there are multiple scenarios which can cause deadlock. I don’t remember all details, one such case was where you are trying to insert multiple rows using a column which has unique-index. In innodb inserts in a transaction takes a gap-lock on index value range if the column as uniqueness index.

    Task.async_stream(1..2, fn _ ->
      Repo.transaction(fn ->
        for num <- 9..1 do
          Repo.insert!(%Post{title: "foo_#{num}"}) # :title has unique-index
        end

        Repo.rollback(:ok)
      end)
    end)
    |> Stream.run()

@NobbZ I’m would be happy to share it. But I’m not sure how to package it (perhaps Ecto is the right place for something like this?), and the question about pool_index still remains. I’ll just create draft in GH, let’s see from there.

@joey_the_snake can you expand on this? You mean there will be multiple Ecto.Repo instances connecting to different database with same structure? If so then the code under test dynamically need to choose one of these repo right?

Btw, we only need to do this for test env. So the worst case is it breaks pipeline, and we switch back to old approach.

1 Like

Yeah you got it. It would be similar to the replica setup here: Replicas and dynamic repositories — Ecto v3.11.1.

TBH I don’t understand your current method too well because I’m not sure how configure/1 or the :pool_index works. But this is one way to have multiple databases ready to be queried.

IMO you already handle it well enough.

The only thing I’d change if I had a day to burn on this would likely be to make my own docker-compose.yml file that spins up 20 - 50 separate Elixir containers, each with its own DB, and control them with environment variables (or indeed with MIX_TEST_PARTITION), and see if that accelerates things even further. I doubt it but it could be fun to try because I/O bounded code usually can parallelize super well on something like 5x your CPU threads.

And I wouldn’t worry about using undocumented mechanisms here. Just use that until it’s there and if one version breaks it you can then make a call to either stick to the older version for a while or move to a Docker [Compose] based testing.

If I was in your team I’d simply tell you “good job, you did really well”.