Heavy async HTTP outbound and %Mint.TransportError{}

Howdy! I’m trying to do a ton of concurrent outbound HTTP efficiently. I’m doing this:

    task = fn url ->
      Req.new(
        url: url,
        finch: MyFinch,
        max_redirects: 2,
        retry: false
      )
      |> Req.Request.put_header("connection", "close")
      |> Req.head()
      |> case do
        {:ok, response} ->
          response.status

        {:error, response} ->
          :error
      end
    end

    Task.Supervisor.async_stream_nolink(
      {:via, PartitionSupervisor, {MyApp.TaskSupervisors, self()}},
      urls,
      task,
      max_concurrency: 1000,
      on_timeout: :kill_task,
      ordered: false,
      timeout: 60000
    )
    |> Enum.to_list()

At first, processing output looks good, but after ~5 seconds, I start ending up with a ton of:

%Mint.TransportError{reason: :nxdomain}
%Mint.TransportError{reason: :nxdomain}
%Mint.TransportError{reason: :timeout}
%Mint.TransportError{reason: :nxdomain}
%Mint.TransportError{reason: :nxdomain}
%Mint.TransportError{reason: :nxdomain}
%Mint.TransportError{reason: :timeout}
%Mint.TransportError{reason: :timeout}
%Mint.TransportError{reason: :timeout}

Anything bad leap out? The error seems to indicate Mint failing in connect…but…why. At lower volume or the start of a big batch, everything works fine. It’s just as it cranks up heavy it goes sideways.

My Finch instance looks like this:

{Finch, name: MyFinch, pools: %{:default => [size: 400, count: 4, protocol: :http1]}}

Is the app containerized? The nxdomain error often manifests due to problems with the container’s network configuration.

Not containerized.

It’s a 1.14-rc0/OTP 25.0-rc3 mix release built on hexpm/elixir:1.14.0-rc.0-erlang-25.0-rc3-debian-stretch-20210902-slim locally and deployed on a modest Linux VM (4gb mem, 2.40GHz cpu, Debian Stretch).

I have to cross-compile as I’m on a M2 MBP. That was the last hexpm/elixir Docker image for Stretch and, since this is proof of concept type stuff at the moment, I reached for off the shelf rather than fiddling in making my own new Stretch image.

I’ve run the code natively too on my MBP via iex and observed the same behavior.

Not sure what to recommend, maybe your firewall / router / PiHole? I assume you have tried curl successfully?

I mean, the work VM def isn’t piholed. And they both start fine but buckle as rate goes sustained high.

I’m starting to mull DNS lookup throttling. :thinking:

Even Cloudflare might consider thousands of lookups bunched up from a single IP as malicious.

Yeah, this seems to be outside the BEAM.

Thread starvation + DNS maybe, like this thread talked about:

I’m seeing improvement with an inet configuration file tweaked to {lookup, [dns, native]}.

Hmm, maybe it’s more on the BEAM internal timeslicing side rather than outside in OS or DNS processing. Like, the VM doesn’t process the burst quickly enough so lots of dns resolution mangle.

I can run the same chunk of data in some comperable Rust code on the same host and it resolves everything fine. :thinking: