Unexpected :ssh_sftp :timeout after SIGINT

mashton · May 15, 2024, 5:34pm

I’m trying to understand an unexpected :timeout during :ssh_sftp.start_channel/3. This is the flow of events:

A Broadway pipeline is running, using a BroadwayKafka producer
The application receives SIGINT and starts to shutdown
Broadway batchers execute Broadway.handle_batch/4, kicking off a series of synchronous events
The last of which is writing a file to a remote SFTP server

The precise timing:

SIGINT received: 15:39:26.382
First SFTP connection attempted: 15:39:27.618
First SFTP connection timeout: 15:39:32.620
The last Kafka consumer leaves the group and the app exits with 0: 15:39:52.711

So the connection “times out” before the app fully shuts down.

I have two threads I’m not sure how to pull on any further:

The fact that the timeouts occur about exactly 5s after the attempt makes me think there’s some other underlying timeout. But why would we only hit this timeout on SIGINT?
Maybe :timeout is a misleading error here and there’s something else going on entirely

Anybody have any ideas? I’m hoping there’s a magic “Ah you forgot about !” solution here.

D4no0 · May 15, 2024, 7:13pm

There are some more timeouts involved in the ssh client, like for example send_timeout for gen_tcp that is used by ssh client.

Have you tried running ssh with debug messages logging to console? Erlang -- ssh

mashton · May 15, 2024, 7:57pm

I’ll give the debug func a shot. My understanding is that the default for send_timeout here is :infinity, so I wouldn’t expect that to be the issue here. Is that correct?

D4no0 · May 15, 2024, 7:57pm

One more thing, I always was using this config option by default for all sftp connections (don’t remember what it is for, copied it over from a sftp wrapper in elixir):

rekey_limit: 1000000000000

D4no0 · May 15, 2024, 8:08pm

Cannot confirm that for sure, I can say though that I used to have problems with sftp connections hanging randomly on some specific servers.

I would also check if the timeout is not coming from OTP when it tries to shutdown the VM gracefully.

mashton · May 16, 2024, 12:32pm

One of the subtleties of this problem is that these timeouts do not occur during normal operation. They only occur during the shutdown.

I’ve added the ssh debug logging as suggested, and I’ll hopefully get some more info from that today.
@D4no0 , as you suggest:

I would also check if the timeout is not coming from OTP when it tries to shutdown the VM gracefully.

Where would I start digging to find out about this?

I’d also note that the fact that the timeout seems to be 5s, and the default timeout for GenServer.call is 5s, makes me suspicious. But then, why would this only occur during shutdown?

mashton · May 16, 2024, 12:46pm

Sorry, I should have included this in the first place. Here’s the stack trace for the timeout:

[
  {:gen_server, :call, 2, [file: ~c"gen_server.erl", line: 370]}, 
  {:ssh_transport, :default_algorithms, 0, [file: ~c"ssh_transport.erl", line: 96]}, 
  {:ssh_options, :default, 1, [file: ~c"ssh_options.erl", line: 727]}, 
  {:ssh_options, :default, 1, [file: ~c"ssh_options.erl", line: 602]}, 
  {:ssh_options, :handle_options, 3, [file: ~c"ssh_options.erl", line: 184]}, 
  {:ssh, :connect, 4, [file: ~c"ssh.erl", line: 187]}, 
  {:ssh_sftp, :start_channel, 3, [file: ~c"ssh_sftp.erl", line: 200]}, 
  {MyApp.SftpClient, :start, 1, [file: ~c"lib/sftp_client.ex", line: 59]}
]

and the beginning of the error:

{:timeout, {:gen_server, :call, [:application_controller, {:set_env, :ssh, :"$def-algs$",  ... (a long list of encryption algs, presumably configuration)...