Hackney client issue/error when download files from S3

Hello,

I have an issue getting error from Hackney client when app downloading bunch of files in parallel from S3

Here error I am getting:

2020-07-22T01:40:22.622 [error] Task #PID<0.25550.1> started from #PID<0.1576.2> terminating
** (stop) exited in: :gen_server.call(:hackney_connections, {:lookup, {:hackney_ssl, 's3.amazonaws.com', 443, [customize_hostname_check: [match_fun: #Function<5.13117337/2 in :public_key.pkix_verify_hostname_match_fun/1>], server_n
ame_indication: 's3.amazonaws.com', verify: :verify_peer, depth: 100, cacerts: [<<48, 130, 6, 75, 48, 130, 4, 51, 160, 3, 2, 1, 2, 2, 17, 0, 217, 181, 67, 127, 175, 169, 57, 15, 0, 0, 0, 0, 85, 101, 173, 88, 48, 13, 6, 9, 42, 134, 
...>>, <<48, 130, 5, 207, 48, 130, 3, 183, 160, 3, 2, 1, 2, 2, 20, 8, 22, 95, 138, 76, 165, 236, 0, 201, 147, 64, 223, 196, 198, 174, 35, 184, 28, 90, 164, 48, 13, ...>>, <<48, 130, 2, 43, 48, 130, 1, 177, 160, 3, 2, 1, 2, 2, 10, 1
23, 113, 182, 130, 86, 184, 18, 124, 156, 168, 48, 10, 6, 8, 42, 134, 72, 206, 61, 4, 3, ...>>, <<48, 130, 3, 115, 48, 130, 2, 91, 160, 3, 2, 1, 2, 2, 11, 0, 174, 207, 0, 186, 196, 207, 50, 248, 67, 178, 48, 13, 6, 9, 42, 134, 72, 134, 247, ...>>, <<48, 130, 2, 78, 48, 130, 1, 211, 160, 3, 2, 1, 2, 2, 10, 60, 246, 7, 169, 104, 112, 14, 218, 139, 132, 48, 10, 6, 8, 42, 134, 72, 206, 61, ...>>, <<48, 130, 3, 148, 48, 130, 2, 124, 160, 3, 2, 1, 2, 2, 10, 49, 245, 228, 98, 12, 108, 88, 237, 214, 216, 48, 13, 6, 9, 42, 134, 72, 134, ...>>, <<48, 130, 6, 91, 48, 130, 4, 67, 160, 3, 2, 1, 2, 2, 17, 0, 202, 233, 27, 137, 241, 85, 3, 13, 163, 230, 65, 109, 196, 227, 166, 225, ...>>, <<48, 130, 5, 90, 48, 130, 3, 66, 160, 3, 2, 1, 2, 2, 16, 79, 210, 43, 143, 245, 100, 200, 51, 158, 79, 52, 88, 102, 35, 112, 96, ...>>, <<48, 130, 5, 70, 48, 130, 3, 46, 160, 3, 2, 1, 2, 2, 16, 93, 223, 177, 218, 90, 163, 237, 93, 190, 90, 101, 32, 101, 3, 144, ...>>, <<48, 130, 2, 10, 48, 130, 1, 145, 160, 3, 2, 1, 2, 2, 16, 110, 71, 169, 200, 139, 148, 182, 232, 187, 59, 42, 216, 162, 178, ...>>, <<48, 130, 2, 12, 48, 130, 1, 145, 160, 3, 2, 1, 2, 2, 16, 110, 71, 169, 199, 108, 169, 115, 36, 64, 137, 15, 3, 85, ...>>, <<48, 130, 5, 90, 48, 130, 3, 66, 160, 3, 2, 1, 2, 2, 16, 110, 71, 169, 198, 90, 179, 231, 32, 197, 48, 154, 63, ...>>, <<48, 130, 5, 90, 48, 130, 3, 66, 160, 3, 2, 1, 2, 2, 16, 110, 71, 169, 197, 75, 71, 12, 13, 236, 51, 208, ...>>, <<48, 130, 2, 105, 48, 130, 1, 239, 160, 3, 2, 1, 2, 2, 16, 33, 42, 86, 12, 174, 218, 12, 171, 64, 69, ...>>, <<48, 130, 5, 131, 48, 130, 3, 107, 160, 3, 2, 1, 2, 2, 14, 69, 230, 187, 3, 131, 51, 195, 133, 101, ...>>, <<48, 130, 2, 148, 48, 130, 2, 26, 160, 3, 2, 1, 2, 2, 8, 44, 41, 156, 91, 22, 237, 5, 149, ...>>, <<48, 130, 5, 235, 48, 130, 3, 211, 160, 3, 2, 1, 2, 2, 8, 86, 182, 41, 205, 52, 188, 120, ...>>, <<48, 130, 2, 141, 48, 130, 2, 20, 160, 3, 2, 1, 2, 2, 8, 117, 230, 223, 203, 193, 104, ...>>, <<48, 130, 5, 221, 48, 130, 3, 197, 160, 3, 2, 1, 2, 2, 8, 123, 44, 155, 211, 22, ...>>, <<48, 130, 4, 32, 48, 130, 3, 8, 160, 3, 2, 1, 2, 2, 9, 0, 132, 130, 44, ...>>, <<48, 130, 6, 47, 48, 130, 4, 23, 160, 3, 2, 1, 2, 2, 8, 37, 161, 223, ...>>, <<48, 130, 4, 48, 48, 130, 3, 24, 160, 3, 2, 1, 2, 2, 9, 0, 218, ...>>, <<48, 130, 5, 136, 48, 130, 3, 112, 160, 3, 2, 1, 2, 2, 8, 125, ...>>, <<48, 130, 4, 99, 48, 130, 3, 75, 160, 3, 2, 1, 2, 2, 1, ...>>, <<48, 130, 5, 195, 48, 130, 3, 171, 160, 3, 2, 1, 2, 2, ...>>, <<48, 130, 1, 242, 48, 130, 1, 120, 160, 3, 2, 1, 2, ...>>, <<48, 130, 1, 182, 48, 130, 1, 91, 160, 3, 2, 1, ...>>, <<48, 130, 5, 65, 48, 130, 3, 41, 160, 3, 2, ...>>, <<48, 130, 3, 65, 48, 130, 2, 41, 160, 3, ...>>, <<48, 130, 5, 131, 48, 130, 3, 107, 160, ...>>, <<48, 130, 5, 107, 48, 130, 3, 83, ...>>, <<48, 130, 2, 195, 48, 130, 2, ...>>, <<48, 130, 6, 11, 48, 130, ...>>, <<48, 130, 5, 210, 48, ...>>, <<48, 130, 3, 114, ...>>, <<48, 130, 3, ...>>, <<48, 130, ...>>, <<48, ...>>, <<...>>, ...], partial_chain: #Function<0.99272334/1 in :hackney_ssl.check_hostname_opts/1>, verify_fun: {&:ssl_verify_hostname.verify_fun/3, [check_hostname: 's3.amazonaws.com']}]}})
    ** (EXIT) time out
    (stdlib 3.11) gen_server.erl:215: :gen_server.call/2
    (hackney 1.16.0) /var/remo3_prototype/deps/hackney/src/hackney_connection.erl:82: :hackney_connection.new_connection_id/4
    (hackney 1.16.0) /var/remo3_prototype/deps/hackney/src/hackney_connection.erl:34: :hackney_connection.new/1
    (hackney 1.16.0) /var/remo3_prototype/deps/hackney/src/hackney_pool.erl:64: :hackney_pool.checkout/4
    (hackney 1.16.0) /var/remo3_prototype/deps/hackney/src/hackney_connect.erl:214: :hackney_connect.socket_from_pool/4
    (hackney 1.16.0) /var/remo3_prototype/deps/hackney/src/hackney_connect.erl:44: :hackney_connect.connect/5
    (hackney 1.16.0) /var/remo3_prototype/deps/hackney/src/hackney.erl:333: :hackney.request/5
    (ex_aws 2.1.3) lib/ex_aws/request/hackney.ex:21: ExAws.Request.Hackney.request/5
Function: &:erlang.apply/2
    Args: [#Function<0.39970933/1 in ExAws.Operation.ExAws.S3.Download.download_to/3>, [%{end_byte: 3145727, start_byte: 2097152}]]

Here the download function:

  def download_part_file(path, object_key) do
    file_path = Path.join(path, split_get_last(object_key, "/"))
    Logger.info("Dowloading object: #{object_key} to: #{file_path}")

    ExAws.S3.download_file(Remo3.Constant.importer_bucket(), object_key, file_path)
    |> ExAws.request!()

    file_path
  end

Here Hackney/ExAws config

config :ex_aws, :hackney_opts,
  follow_redirect: true,
  recv_timeout: 30_000,
  timeout: 300_000,
  checkout_timeout: 50_000,
  max_connections: 500

I call(ed) download function in parallel like this:

  def download_s3_objects(dest_path, s3_keys) do
     Task.async_stream(
      s3_keys,
      fn s3_key -> Remo3.Aws.S3.download_part_file(dest_path, s3_key) end,
      maximum_concurrency: 3,
      timeout: @flow_timeout,
      ordered: false
    )
    |> Enum.to_list()
  end

and this function is also called in parallel like this:

  def process_account_partitions(hour, partitions) do
    Task.async_stream(
      partitions,
      fn partition -> process_account_partition(hour, partition) end,
      maximum_concurrency: @max_concurrency,
      timeout: @flow_timeout,
      ordered: false
    )
    |> Enum.to_list()
  end

My input to this logic is list of lists where first list is partitions and nested list is S3 objects per partition.

In total one node may get ~5000 S3 files to download. This issues happens not emidiatly but after some time when some of files have been dowloaded sucessfully.

Questions:

  • what this issue means?
  • can I retry it and how?
  • do I have correct hackney client config?

Action I did.

I tried to enbale Hackney trace log to get more details with:

:hackney_trace.enable(:max, '/home/ec2-user/data/ebs/elixir/remo3/logs/hackney.log')

but my app crashed on starts with:

[ec2-user@ip-xxx-xx-xx-180 remo3_prototype]$ _build/prod/rel/kona/bin/kona start
{"Kernel pid terminated",application_controller,"{application_start_failure,kona,{bad_return,{{'Elixir.Importer.App',start,[normal,[]]},{'EXIT',{undef,[{dbg,tracer,[process,{#Fun<hackney_trace.0.64280899>,{hackney,<0.1207.0>}}],[]},{hackney_trace,do_enable,3,[{file,\"/var/remo3_prototype/deps/hackney/src/hackney_trace.erl\"},{line,46}]},{'Elixir.Importer.App',start,2,[{file,\"lib/importer.ex\"},{line,14}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,277}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,kona,{bad_return,{{'Elixir.Importer.App',start,[normal,[]]},{'EXIT',{undef,[{dbg,tracer,[process,{#Fun<hackney_trace.0.642808

Crash dump is being written to: erl_crash.dump...done

Thank you for your time looking at this issue.