Gen_tcp connection goes "reconnect crazy"?

nickdichev · May 30, 2019, 8:23pm

Hi all,

I’m working on an Elixir syslog client library which supports a dynamic number of destination servers. I’m using a GenServer as a thin wrapper around the TCP and SSL connections to the destinations servers. Things are mostly working, however, I’m running into some issues when a gen_tcp connection is closed.

The behavior I see is that when I shutdown a destination syslog server the GenServer connected to that server goes “reconnect crazy” because it connects to the destination server immediately then gets another :tcp_closed message. This repeats until the connection starts getting econnrefused errors. From my logs it looks a little like:


12:46:17.457 [warn]  Lost connection to localhost:514
 
12:46:17.459 [info]  Connected to localhost:514
 
12:46:17.463 [warn]  Lost connection to localhost:514

(The preceding log lines repeat ~50 times)

12:46:17.465 [warn]  Could not connect to localhost:514 -- econnrefused.
 
12:46:17.465 [warn]  Waiting 0 seconds for localhost:514.
 
12:46:17.466 [warn]  Could not connect to localhost:514 -- econnrefused.
 
12:46:17.466 [warn]  Waiting 1 seconds for localhost:514.
 
12:46:18.467 [info]  Connected to localhost:514

I have another abstraction around a SSL connection which handles this case just fine, when these connections are closed the Gen Server doesn’t display this behavior. Maybe the ssl_connection_sup handles something that I don’t?

I’m using these options when opening the TCP connections:

    conn_opts = [
      # Use active: :once to get a :tcp_closed message if the connection is closed
      active: :once,
      keepalive: true,
      reuseaddr: true,
      send_timeout: 1000,
      send_timeout_close: true
    ]

And some other sections of the code related to this code path:

`{:tcp,:ssl}_closed message handling

Connection handling

Anyone see anything I’m doing wrong here? Or some debugging tips? Thanks for looking!

jmitchell · May 30, 2019, 9:25pm

Try having handle_close/1 return {:noreply, state} instead of {:noreply, state, {:continue, {state.host, state.port}}}. As it is, I think you’re telling the GenServer to run handle_continue/1 immediately after this.

If that’s not it, maybe process restarts explain it. To be sure whether that’s the case you could include #{inspect self()} in the repeating log messages and check if the PID changes each time.

Another thing I wonder: how long between repeated log messages?

EDIT: Ah, nevermind my last question. I see now the elided ~50 iterations of messages are followed by a timestamp 0.002s later, so it must be a rapid-fire loop.

nickdichev · May 30, 2019, 11:40pm

If handle_close/1 doesn’t call handle_continue/2 then the reconnect never happens. The way it is setup now is that all the connection logic (including backoff) happens in handle_continue/2. I set this up this way so that the callers of start_link/1 aren’t blocked waiting for the connection. I thought this was a pretty nice way to handle the reconnect logic since we can just jump back into the handle_continue/2 loop when the GenServer gets a :tcp_closed or a :ssl_closed message.

I posted the full output here which shows that the PID never changes as the reconnect loop happens. The GenServer isn’t currently configured to be restarted if the connection goes down. The supervisor only restarts the GenServer if it exits with an abnormal reason.

jmitchell · May 31, 2019, 12:02am

That does sound like a nice approach.

It looks like the backoff logic only kicks in when the internal connect call returns {:error, reason}, but it sounds like you also want it to when an existing connection closes unexpectedly. To accomplish that you could check whether the state passed to handle_continue/2 has a non-nil :socket, in which case treat it the same way you do errors.

That’s a good question. I can’t explain why the TCP and SSL connections would behave differently. I haven’t used the :ssl module yet.

For anyone else following along, I noticed the SSL abstraction applies active: false whereas the TCP abstraction sets active: :once.

EDIT: Is active: false for SSL blocking the {:ssl_closed, socket}? That could mean the corresponding handle_info/2 never gets called, but the TCP one does.

nickdichev · May 31, 2019, 7:57pm

Is it no good to immediately try to reconnect if the TCP socket is closed? I don’t want to start backing off before I have to. To me it seems like the TCP connection is somehow opened to the down server in handle_continue/2 since the connection error path isn’t being hit for ~1 second.

I wonder if this has something to do with Docker… to test this I’m just restarting the Docker container running syslog-ng. Not sure if its possible for the connection to be accepted while the container is shutting down? I played around with this a bit and adding a sleep before trying to reconnect doesn’t show the reconnection loop behavior – it immediately gets the econnrefused error and starts backing off as expected.

I don’t know exactly what the ssl_connection_sup in the :ssl module does to monitor the connection. What I’ve noticed is that it successfully handles the connection being closed and sends the GenServer a :ssl_closed message (note this is with the sleep before reconnecting removed):

iex(6)> 
12:57:09.158 [warn]  Lost connection to localhost:6514
 
12:57:09.160 [warn]  Could not connect to localhost:6514 -- closed.
 
12:57:09.160 [warn]  Waiting 0 seconds for localhost:6514.
 
12:57:09.162 [warn]  Could not connect to localhost:6514 -- closed.
 
12:57:09.162 [warn]  Waiting 1 seconds for localhost:6514.
 
12:57:11.190 [warn]  Could not connect to localhost:6514 -- closed.
 
12:57:11.190 [warn]  Waiting 2 seconds for localhost:6514.
 
12:57:13.202 [info]  Connected to localhost:6514

Maybe its time for me to read the source code for the connection monitoring in the :ssl module…

jmitchell · June 1, 2019, 9:23pm

Retrying once without delay makes sense, but if it doesn’t work I’d expect backoff to kick in right away. The log you shared shows immediately after the connection is made it gets dropped, and then immediately after that the connection is restarted. This rapid-fire connect-disconnect pattern continues for 0.314 ms since the first “Lost connection” message, at which point you finally get an econnrefused refused error instead of a new connection.

I’m frankly puzzled. A couple ideas, though:

Although there’s a NSyslog.Protocol.TCP.close/1 function which closes the :gen_tcp socket, I don’t see it used. Does explicitly closing it in the {:tcp_closed, _} info handler before continuing to reconnect change anything?
For each connect-disconnect iteration are packets being sent and received? More specifically, does the server receive a SYN, reply with a SYN-ACK, and then receive an ACK? Tools like wireshark should help here.

Hope that helps too. Good luck!