Automatic Ecto/Postgrex recovery after Aurora failover

cjbottaro · November 1, 2023, 6:38pm

Are there any callbacks that I can hook into to reconnect, including redoing DNS resolution?

We’ve having some strange issues with Aurora. Not sure exactly what’s going on, but it looks like it’s failing over. When this happens, the primaries become read only and all our Ecto connections start raising exceptions about the database being read only.

ERROR 25006 (read_only_sql_transaction) cannot execute UPDATE in a read-only transaction

Restarting all our pods fixes the issue because we get fresh new connections.

Looking for a way to automatically handle this scenario. Currently we get alerted to all the failures, then we manually cycle out our pods.

Thanks for the help.

christhekeele · November 1, 2023, 7:27pm

It looks like module-based Ecto.Repos only have a few callbacks. I would have expected to see something analogous to the traditional terminate/1 callback here if this were possible, but perhaps I’m looking in the wrong place?

adw632 · November 1, 2023, 8:33pm

Postgrex disconnect_all might help.

Also review your setup using pgbouncer:

cjbottaro · November 2, 2023, 4:11pm

We used RDS Proxy, but since Rails sets state on connections, it completely broke the proxy.

I’m not sure if Postgrex/Ecto sets any state on the connections that would cause that pinning. If not, then maybe we could use RDS Proxy for our Elixir apps only.

maartenvanvliet · November 3, 2023, 7:01am

Did you explore the options Postgrex has to handle failovers?

E.g. * :disconnect_on_error_codes, which can me combined with multiple endpoints for dealing with failover

See Postgrex — Postgrex v0.17.3