We were seeing bugs where 1) jobs would sit indefinitely in “available” and 2) the Oban Web dashboard would sometimes display a count of 0 for all queues.
It seems the root cause for both was:
Oban.Notifiers.Postgres
uses Postgrex.SimpleConnection
- If the connection to PostgreSQL “goes bad” (stays connected, but stops
responding)…
- Then
Postgrex.SimpleConnection
would seem to reconnect but
actually would not establish a new connection to PostgreSQL (as shown
in the pg_stat_activity
table)…
- and from then on, that node would not run Oban jobs and its Oban Web
UI would show count 0 for every queue
This issue has been fixed in postgrex as of v0.20.0, released today.
3 Likes
Details on how we reproduced the bug and confirmed the fix:
- Modify
Oban.Notifiers.Postgres
to look for an ENV var, and, if present, connect to PostgreSQL via a different port, proxied by GitHub - Shopify/toxiproxy: ⏰ A TCP proxy to simulate network and system conditions for chaos and resiliency testing
- Start the Phoenix app locally in a 3-instance cluster, each listening on a different HTTP port, and give 2 instances the “use toxiproxy” ENV var
- In
psql
, see that there are 3 connections to the database whose last executed query was a LISTEN
using SELECT state_change, client_addr, client_port, pid FROM pg_stat_activity WHERE state = 'idle' and datname = 'gridpoint_dev' AND query LIKE '%LISTEN%' ORDER BY state_change DESC;
- Insert many jobs
- As they begin to be processed, turn on a “toxic” in toxiproxy, telling it to leave those connections open but block all data from flowing to the client:
curl -X POST -H "Content-Type: application/json" -d '{ "name": "immediate_timeout", "type": "timeout", "toxicity": 1.0, "stream": "downstream", "attributes": { "timeout": 0 } }' http://localhost:8474/proxies/postgres/toxics
- After a few seconds, see that the
pg_stat_activity
query now shows only 1 connection
- See that Oban Web on the host not using toxiproxy shows jobs waiting in “available”, but on one of the hosts using toxiproxy, shows 0 counts for all queues. (Note: in a deployed environment with hosts behind a load balancer this is not apparent.)
- Turn off the “toxic” in toxiproxy:
curl -X DELETE http://localhost:8474/proxies/postgres/toxics/immediate_timeout
- Observe different behavior:
- With postgrex v0.19.3 (2024-11-12),
pg_stat_activity
does not show the other 2 connections get reestablished, the “available” jobs remain available, and Oban Web on the formerly toxiproxied host keeps showing 0 for queue counts until/unless the Oban.Notifiers.Postgres
process is killed and restarted, at which point that host’s jobs are processed and Oban Web on that host shows correct queue counts
- With postgrex v0.20.0 (2025-02-05),
pg_stat_activity
shows the other 2 connections get reestablished, the “available” jobs get processed, and Oban Web on the formerly toxiproxied hosts show correct queue counts
To set up toxiproxy for the usage above:
# start toxiproxy
# Make the control API available on port 8474 and the proxy port on port 8666:
docker run -d --rm --name toxiproxy -p 8474:8474 -p 8666:8666 shopify/toxiproxy
# verify it's running
docker ps | grep toxiproxy
# create a proxy for postgres
curl -X POST -H "Content-Type: application/json" \
-d '{
"name": "postgres",
"listen": "0.0.0.0:8666",
"upstream": "host.docker.internal:5432"
}' \
http://localhost:8474/proxies
# verify you can connect to postgres through it
psql -h localhost -p 8666 -U postgres
3 Likes