Database connection errors: ssl send failed

We are using Elixir 1.7.4 with ecto 3.0.0 and OTP 19 with PostgreSQL 9.6 on Azure RDS

We face frequent db connection issues where the database just stops accepting connections after a while and the application becomes un-responsive. Some of the kinds of errors which are logged are given below,


GenServer Cluster.Strategy.Postgres.Notifications terminating
** (stop) %DBConnection.ConnectionError{message: “ssl async recv: closed”, severity: :error}
Last message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.121>, :tls_connection, :undefined}, [#PID<0.3474.0>, #PID<0.3473.0>]}}
State: %Postgrex.Notifications{idle_timeout: 5000, listener_channels: %{“cluster” => %{#Reference<0.3015757464.3877371905.67662> => #PID<0.3481.0>}}, listeners: %{#Reference<0.3015757464.3877371905.67662> => {“cluster”, #PID<0.3481.0>}}, parameters: nil, protocol: %Postgrex.Protocol{buffer: :active_once, connection_id: 585712, connection_key: 419216344, disconnect_on_error_codes: [], null: nil, parameters: #Reference<0.3015757464.3877371905.67659>, peer: {{191, 238, 6, 43}, 5432}, postgres: :idle, queries: nil, sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.121>, :tls_connection, :undefined}, [#PID<0.3474.0>, #PID<0.3473.0>]}}, timeout: 15000, transactions: :naive, types: nil}}


[libcluster:titan] notify failed: %Protocol.UndefinedError{description: “”, protocol: String.Chars, value: %DBConnection.ConnectionError{message: “ssl send: closed”, severity: :error}}


Postgrex.Protocol (#PID<0.3470.0>) disconnected: ** (DBConnection.ConnectionError) ssl send: closed


GenServer #PID<0.3481.0> terminating
** (stop) exited in: :gen_server.call(Cluster.Strategy.Postgres.Notifications, {:listen, “sync”}, 5000)
** (EXIT) %DBConnection.ConnectionError{message: “ssl recv: closed”, severity: :error}
(stdlib) gen_server.erl:223: :gen_server.call/3
(titan) lib/libcluster/strategy/postgres/worker.ex:48: Cluster.Strategy.Postgres.Worker.handle_info/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: :sync
State: %Cluster.Strategy.State{config: [hostname: “titan.postgres.database.azure.com”, database: “titan”, username: “xxx”, password: “xxx”, port: 5432, ssl: true], connect: {:net_kernel, :connect_node, []}, disconnect: {:erlang, :disconnect_node, []}, list_nodes: {:erlang, :nodes, [:connected]}, meta: %{channel: “cluster”, connection: Cluster.Strategy.Postgres.Connection, notifications: Cluster.Strategy.Postgres.Notifications, subscription: #Reference<0.982578244.1168113666.181066>}, topology: :titan}


GenServer #PID<0.3481.0> terminating
** (stop) exited in: :gen_server.call(Cluster.Strategy.Postgres.Notifications, {:listen, “sync”}, 5000)
** (EXIT) no process: the process is not alive or there’s no process currently associated with the given name, possibly because its application isn’t started
(stdlib) gen_server.erl:223: :gen_server.call/3
(titan) lib/libcluster/strategy/postgres/worker.ex:48: Cluster.Strategy.Postgres.Worker.handle_info/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: :sync
State: %Cluster.Strategy.State{config: [hostname: “titan.postgres.database.azure.com”, database: “titan”, username: “xxx”, password: “xxx”, port: 5432, ssl: true], connect: {:net_kernel, :connect_node, []}, disconnect: {:erlang, :disconnect_node, []}, list_nodes: {:erlang, :nodes, [:connected]}, meta: %{channel: “cluster”, connection: Cluster.Strategy.Postgres.Connection, notifications: Cluster.Strategy.Postgres.Notifications, subscription: #Reference<0.3015757464.3877371905.67662>}, topology: :titan}

This has been happening too frequently. We are not sure of what is going wrong.Is it an issue with the version of software we are using (not updated since a while), connection pool settings or some other thing. Can someone guide on how we can debug and resolve such issues?

Erlang’s SSL implementation has been evolving with time and maybe you need to upgrade Erlang because it’s likely that Amazon RDS changed something.

Just a blind guess.