If you're using `Oban.Notifiers.Postgres`, use `postgrex >= 0.20.0`

nathanl · February 5, 2025, 6:23pm

We were seeing bugs where 1) jobs would sit indefinitely in “available” and 2) the Oban Web dashboard would sometimes display a count of 0 for all queues.

It seems the root cause for both was:

Oban.Notifiers.Postgres uses Postgrex.SimpleConnection
If the connection to PostgreSQL “goes bad” (stays connected, but stops
responding)…
Then Postgrex.SimpleConnection would seem to reconnect but
actually would not establish a new connection to PostgreSQL (as shown
in the pg_stat_activity table)…
and from then on, that node would not run Oban jobs and its Oban Web
UI would show count 0 for every queue

This issue has been fixed in postgrex as of v0.20.0, released today.

nathanl · February 5, 2025, 6:31pm

Details on how we reproduced the bug and confirmed the fix:

Modify Oban.Notifiers.Postgres to look for an ENV var, and, if present, connect to PostgreSQL via a different port, proxied by GitHub - Shopify/toxiproxy: ⏰ A TCP proxy to simulate network and system conditions for chaos and resiliency testing
Start the Phoenix app locally in a 3-instance cluster, each listening on a different HTTP port, and give 2 instances the “use toxiproxy” ENV var
In psql, see that there are 3 connections to the database whose last executed query was a LISTEN using SELECT state_change, client_addr, client_port, pid FROM pg_stat_activity WHERE state = 'idle' and datname = 'gridpoint_dev' AND query LIKE '%LISTEN%' ORDER BY state_change DESC;
Insert many jobs
As they begin to be processed, turn on a “toxic” in toxiproxy, telling it to leave those connections open but block all data from flowing to the client: curl -X POST -H "Content-Type: application/json" -d '{ "name": "immediate_timeout", "type": "timeout", "toxicity": 1.0, "stream": "downstream", "attributes": { "timeout": 0 } }' http://localhost:8474/proxies/postgres/toxics
After a few seconds, see that the pg_stat_activity query now shows only 1 connection
See that Oban Web on the host not using toxiproxy shows jobs waiting in “available”, but on one of the hosts using toxiproxy, shows 0 counts for all queues. (Note: in a deployed environment with hosts behind a load balancer this is not apparent.)
Turn off the “toxic” in toxiproxy: curl -X DELETE http://localhost:8474/proxies/postgres/toxics/immediate_timeout
Observe different behavior:
- With postgrex v0.19.3 (2024-11-12), pg_stat_activity does not show the other 2 connections get reestablished, the “available” jobs remain available, and Oban Web on the formerly toxiproxied host keeps showing 0 for queue counts until/unless the Oban.Notifiers.Postgres process is killed and restarted, at which point that host’s jobs are processed and Oban Web on that host shows correct queue counts
- With postgrex v0.20.0 (2025-02-05), pg_stat_activity shows the other 2 connections get reestablished, the “available” jobs get processed, and Oban Web on the formerly toxiproxied hosts show correct queue counts

To set up toxiproxy for the usage above:

# start toxiproxy
# Make the control API available on port 8474 and the proxy port on port 8666:
docker run -d --rm --name toxiproxy -p 8474:8474 -p 8666:8666 shopify/toxiproxy

# verify it's running
docker ps | grep toxiproxy

# create a proxy for postgres
curl -X POST -H "Content-Type: application/json" \
  -d '{
    "name": "postgres",
    "listen": "0.0.0.0:8666",
    "upstream": "host.docker.internal:5432"
  }' \
  http://localhost:8474/proxies

# verify you can connect to postgres through it
psql -h localhost -p 8666 -U postgres