Unexpected :ssh_sftp :timeout after SIGINT

I’m trying to understand an unexpected :timeout during :ssh_sftp.start_channel/3. This is the flow of events:

  1. A Broadway pipeline is running, using a BroadwayKafka producer
  2. The application receives SIGINT and starts to shutdown
  3. Broadway batchers execute Broadway.handle_batch/4, kicking off a series of synchronous events
  4. The last of which is writing a file to a remote SFTP server

The precise timing:

  • SIGINT received: 15:39:26.382
  • First SFTP connection attempted: 15:39:27.618
  • First SFTP connection timeout: 15:39:32.620
  • The last Kafka consumer leaves the group and the app exits with 0: 15:39:52.711

So the connection “times out” before the app fully shuts down.

I have two threads I’m not sure how to pull on any further:

  1. The fact that the timeouts occur about exactly 5s after the attempt makes me think there’s some other underlying timeout. But why would we only hit this timeout on SIGINT?
  2. Maybe :timeout is a misleading error here and there’s something else going on entirely

Anybody have any ideas? I’m hoping there’s a magic “Ah you forgot about !” solution here.

There are some more timeouts involved in the ssh client, like for example send_timeout for gen_tcp that is used by ssh client.

Have you tried running ssh with debug messages logging to console? Erlang -- ssh

I’ll give the debug func a shot. My understanding is that the default for send_timeout here is :infinity, so I wouldn’t expect that to be the issue here. Is that correct?

One more thing, I always was using this config option by default for all sftp connections (don’t remember what it is for, copied it over from a sftp wrapper in elixir):

rekey_limit: 1000000000000

Cannot confirm that for sure, I can say though that I used to have problems with sftp connections hanging randomly on some specific servers.

I would also check if the timeout is not coming from OTP when it tries to shutdown the VM gracefully.

One of the subtleties of this problem is that these timeouts do not occur during normal operation. They only occur during the shutdown.

I’ve added the ssh debug logging as suggested, and I’ll hopefully get some more info from that today.
@D4no0 , as you suggest:

I would also check if the timeout is not coming from OTP when it tries to shutdown the VM gracefully.

Where would I start digging to find out about this?

I’d also note that the fact that the timeout seems to be 5s, and the default timeout for GenServer.call is 5s, makes me suspicious. But then, why would this only occur during shutdown?

Sorry, I should have included this in the first place. Here’s the stack trace for the timeout:

  {:gen_server, :call, 2, [file: ~c"gen_server.erl", line: 370]}, 
  {:ssh_transport, :default_algorithms, 0, [file: ~c"ssh_transport.erl", line: 96]}, 
  {:ssh_options, :default, 1, [file: ~c"ssh_options.erl", line: 727]}, 
  {:ssh_options, :default, 1, [file: ~c"ssh_options.erl", line: 602]}, 
  {:ssh_options, :handle_options, 3, [file: ~c"ssh_options.erl", line: 184]}, 
  {:ssh, :connect, 4, [file: ~c"ssh.erl", line: 187]}, 
  {:ssh_sftp, :start_channel, 3, [file: ~c"ssh_sftp.erl", line: 200]}, 
  {MyApp.SftpClient, :start, 1, [file: ~c"lib/sftp_client.ex", line: 59]}

and the beginning of the error:

{:timeout, {:gen_server, :call, [:application_controller, {:set_env, :ssh, :"$def-algs$",  ... (a long list of encryption algs, presumably configuration)...