Strange EADDRNOTAVAIL errors when fetching external pages after dockerizing Phoenix app

I’m having this strange problem with Elixir combined with Docker that I can’t find a similar incidence of elsewhere. I built a web app using Phoenix and dockerized it with docker-compose. It connects to a database and contacts various different services on the public internet. Recently I restarted my server after updating, and now I’m seeing errors like these in the logs:

** (exit) an exception was raised:
    ** (Protocol.UndefinedError) protocol Enumerable not implemented for {:error, :eaddrnotavail} of type Tuple. This protocol is implemented for the following type(s): Ecto.Adapters.SQL.Stream, Postgrex.Stream, DBConnection.Stream, DBConnection.PrepareStream, Floki.HTMLTree, Function, Range,
, Stream, List, GenEvent.Stream, HashDict, IO.Stream, File.Stream, HashSet
        (elixir 1.10.3) lib/enum.ex:1: Enumerable.impl_for!/1
        (elixir 1.10.3) lib/enum.ex:141: Enumerable.reduce/3
        (elixir 1.10.3) lib/enum.ex:3383: Enum.map/2
        (to_booru 0.1.0) lib/to_booru.ex:58: ToBooru.extract_uploads/2
        (szurupull 0.1.0) lib/szurupull_web/controllers/upload_controller.ex:29: Szurupull.UploadController.extract/2
        (szurupull 0.1.0) lib/szurupull_web/controllers/upload_controller.ex:1: Szurupull.UploadController.action/2
        (szurupull 0.1.0) lib/szurupull_web/controllers/upload_controller.ex:1: Szurupull.UploadController.phoenix_controller_pipeline/2
        (phoenix 1.5.7) lib/phoenix/router.ex:352: Phoenix.Router.__call__/2

As it turns out the Phoenix process in the container was no longer able to fetch any webpages on the external internet. Instead it gets back the error code :eaddrnotavail. However, it was still connected to the database container, and I could still visit the webpage of the Phoenix app in my browser correctly, so it looks like the connection between the Docker containers is functioning properly.

My application uses Tesla underneath to handle fetching webpages, so I attached to the Docker container and tried retrieving it using iex (by running /app/bin/my_app remote). I got the same error of :eaddrnotavail. For some reason it always fails after 16 seconds almost exactly.

iex> Tesla.client([Tesla.Middleware.Logger], {Tesla.Adapter.Hackney, [recv_timeout: 30000]}) |> Tesla.get("https://www.youtube.com")
{:error, :eaddrnotavail}
11:49:46.210 [error] GET https://www.youtube.com -> error: :eaddrnotavail (16026.201 ms)

11:49:46.214 [debug]
>>> REQUEST >>>
(no query)
(no headers)
(no body)

<<< RESPONSE ERROR <<<
:eaddrnotavail

If I use httpc then it gives :econnrefused as an error code instead. This time it always fails after nearly 8 seconds.

iex> Tesla.client([Tesla.Middleware.Logger], {Tesla.Adapter.Httpc, [recv_timeout: 30000]}) |> Tesla.get("https://www.youtube.com")

11:50:36.891 [info]  [73, 110, 118, 97, 108, 105, 100, 32, 111, 112, 116, 105, 111, 110, 32, [123, ['recv_timeout', 44, '30000'], 125], 32, 105, 103, 110, 111, 114, 101, 100, 32, 10]

11:50:44.878 [error] GET https://www.youtube.com -> error: :econnrefused (8002.325 ms)
{:error, :econnrefused}

11:50:44.879 [debug]
>>> REQUEST >>>
(no query)
(no headers)
(no body)

<<< RESPONSE ERROR <<<
:econnrefused

It doesn’t work if I provide the IP address directly, either. The error becomes :econnrefused.

iex> Tesla.client([Tesla.Middleware.Logger], {Tesla.Adapter.Hackney, [recv_timeout: 30000]}) |> Tesla.get("https://127.217.14.206")
{:error, :econnrefused}
iex(szurupull@e5e30164a29a)2>
12:20:10.467 [error] GET https://127.217.14.206 -> error: :econnrefused (2.721 ms)

12:20:10.469 [debug]
>>> REQUEST >>>
(no query)
(no headers)
(no body)

<<< RESPONSE ERROR <<<
:econnrefused

Of course, this works if I use iex -S mix from my host machine outside the container:

iex(1)> Tesla.client([Tesla.Middleware.Logger], {Tesla.Adapter.Hackney, [recv_timeout: 30000]}) |> Tesla.get("www.youtube.com")

[warn] GET www.youtube.com -> 301 (48.126 ms)
[debug]
>>> REQUEST >>>
(no query)
(no headers)
(no body)

<<< RESPONSE <<<
content-type: application/binary
x-content-type-options: nosniff
cache-control: no-cache, no-store, max-age=0, must-revalidate
pragma: no-cache
expires: Mon, 01 Jan 1990 00:00:00 GMT
date: Sun, 27 Dec 2020 12:23:20 GMT
location: https://www.youtube.com/
x-frame-options: SAMEORIGIN
server: ESF
content-length: 0
x-xss-protection: 0


{:ok,
 %Tesla.Env{
   __client__: %Tesla.Client{
     adapter: {Tesla.Adapter.Hackney, :call, [[recv_timeout: 30000]]},
     fun: nil,
     post: [],
     pre: [{Tesla.Middleware.Logger, :call, [[]]}]
   },
   __module__: Tesla,
   body: "",
   headers: [
     {"content-type", "application/binary"},
     {"x-content-type-options", "nosniff"},
     {"cache-control", "no-cache, no-store, max-age=0, must-revalidate"},
     {"pragma", "no-cache"},
     {"expires", "Mon, 01 Jan 1990 00:00:00 GMT"},
     {"date", "Sun, 27 Dec 2020 12:23:20 GMT"},
     {"location", "https://www.youtube.com/"},
     {"x-frame-options", "SAMEORIGIN"},
     {"server", "ESF"},
     {"content-length", "0"},
     {"x-xss-protection", "0"}
   ],
   method: :get,
   opts: [],
   query: [],
   status: 301,
   url: "www.youtube.com"
 }}

But if I connect to the container from the host with docker-compose exec my_container sh and use curl, I am still able to retrieve the site normally. This makes me suspect this is an issue with the Elixir side somehow.

/app # curl -v "https://www.youtube.com"
...
> GET / HTTP/2
> Host: www.youtube.com
> User-Agent: curl/7.64.0
> Accept: */*
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [264 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
} [5 bytes data]
< HTTP/2 200
< content-type: text/html; charset=utf-8
< x-content-type-options: nosniff
< cache-control: no-cache, no-store, max-age=0, must-revalidate
< pragma: no-cache
< expires: Mon, 01 Jan 1990 00:00:00 GMT
< date: Sun, 27 Dec 2020 11:47:34 GMT
< x-frame-options: SAMEORIGIN
< strict-transport-security: max-age=31536000
< p3p: CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=en for more info."
< server: ESF
< x-xss-protection: 0
< set-cookie: YSC=WWanzXxZmHI; Domain=.youtube.com; Path=/; Secure; HttpOnly; SameSite=none
< set-cookie: VISITOR_INFO1_LIVE=kFXTUXRePU4; Domain=.youtube.com; Expires=Fri, 25-Jun-2021 11:47:34 GMT; Path=/; Secure; HttpOnly; SameSite=none
< alt-svc: h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
< accept-ranges: none
< vary: Accept-Encoding
...

And some sites do not give any errors at all from the Elixir side, like example.com. But the ones that are important for my service to run all return errors. I’m not sure what the differentiating factor is between the sites that work and the ones that don’t.

iex> Tesla.client([Tesla.Middleware.Logger], {Tesla.Adapter.Hackney, [recv_timeout: 30000]}) |> Tesla.get("https://example.com")

11:54:36.278 [info]  GET https://example.com -> 200 (8045.135 ms)

11:54:36.278 [debug]
>>> REQUEST >>>
(no query)
(no headers)
(no body)

<<< RESPONSE <<<
age: 252244
cache-control: max-age=604800
content-type: text/html; charset=UTF-8
date: Sun, 27 Dec 2020 11:54:36 GMT
etag: "3147526947+ident"
expires: Sun, 03 Jan 2021 11:54:36 GMT
last-modified: Thu, 17 Oct 2019 07:18:26 GMT
server: ECS (sec/96EE)
vary: Accept-Encoding
x-cache: HIT
content-length: 1256

Nothing changes even after I stop/remove/rebuild/restart the container with docker-compose, or restart the docker daemon.

I should also mention that I use an internal DNS server that forwards to 8.8.8.8, but I’m still able to retrieve sites on it using curl from within the container.

Here are the relevant parts of docker-compose.yml:

version: '3.3'

services:
    szurupull:
      build:
        context: /home/ruin/build/work/szurupull
      ports:
        - 4000:4000
      networks:
        - misaka
      depends_on:
        - szurupull_db
      environment:
        - SECRET_KEY_BASE=${SECRET_KEY_BASE}
        - DATABASE_HOST=szurupull_db
        - DATABASE_URL=ecto://postgres:postgres@szurupull_db/postgres
        - VIRTUAL_HOST=<...>
        - VIRTUAL_PORT=4000
        - LETSENCRYPT_HOST=<...>
        - UID=1000
        - GID=1000

    szurupull_db:
      image: postgres:9.6
      volumes:
        - "/mnt/hibiki/config/szurupull/sql:/var/lib/postgresql/data"
      networks:
        - misaka
      environment:
        - POSTGRES_DB=postgres
        
networks:
    misaka:
        external: true

I run the app in release mode after compiling it with mix compile and mix release. (I followed this guide.) Here is the Dockerfile:

FROM elixir:1.10.3-alpine as build

# install build dependencies
RUN apk add --update git build-base nodejs npm yarn python

RUN mkdir /app
WORKDIR /app

# install Hex + Rebar
RUN mix do local.hex --force, local.rebar --force

# set build ENV
ENV MIX_ENV=prod

# install mix dependencies
COPY mix.exs mix.lock ./
COPY config config
RUN mix deps.get --only $MIX_ENV
RUN mix deps.compile

# build assets
COPY assets assets
RUN cd assets && npm install && npm run deploy
RUN mix phx.digest

# build project
COPY priv priv
COPY lib lib
RUN mix compile

# build release
# at this point we should copy the rel directory but
# we are not using it so we can omit it
# COPY rel rel
RUN mix release

# prepare release image
FROM alpine:3.9 AS app

# install runtime dependencies
RUN apk add --update bash openssl postgresql-client curl

EXPOSE 4000
ENV MIX_ENV=prod

# prepare app directory
RUN mkdir /app
WORKDIR /app

# copy release to app container
COPY --from=build /app/_build/prod/rel/szurupull .
COPY entrypoint.sh .
RUN chown -R nobody: /app
USER nobody

ENV HOME=/app
CMD ["bash", "/app/entrypoint.sh"]

And entrypoint.sh:

#!/bin/bash
# docker entrypoint script.

# assign a default for the database_user
DB_USER=${DATABASE_USER:-postgres}

# wait until Postgres is ready
while ! pg_isready -q -h $DATABASE_HOST -p 5432 -U $DB_USER
do
  echo "$(date) - waiting for database to start"
  sleep 2
done

bin="/app/bin/szurupull"
eval "$bin eval \"Szurupull.Release.migrate\""
# start the elixir application
exec "$bin" "start"

I tried adjusting the url: option that the app listens on in config.exs to {127, 0, 0, 1}, but it doesn’t change anything. I also haven’t changed any of the application code since I restarted the server either.

config :szurupull, SzurupullWeb.Endpoint,
  url: [host: "localhost"]

Here is the result of running a few debugging commands from within the container.

/app # ulimit -a
-f: file size (blocks)             unlimited
-t: cpu time (seconds)             unlimited
-d: data seg size (kb)             unlimited
-s: stack size (kb)                8192
-c: core file size (blocks)        unlimited
-m: resident set size (kb)         unlimited
-l: locked memory (kb)             64
-p: processes                      unlimited
-n: file descriptors               1048576
-v: address space (kb)             unlimited
-w: locks                          unlimited
-e: scheduling priority            0
-r: real-time priority             0

/app # netstat -an | grep -e tcp -e udp | wc -l
27

/app # netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.11:34115        0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:40591           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:46545           0.0.0.0:*               LISTEN      216/beam.smp
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      -
tcp        0      0 :::4000                 :::*                    LISTEN      -
tcp        0      0 :::4369                 :::*                    LISTEN      -
udp        0      0 127.0.0.11:57260        0.0.0.0:*                           -

Is there some sort of Elixir or Docker configuration I need so the connections can succeed again?

1 Like

I tore my hair out over this for ten hours, and never figured out why this happened. It went down to the level of :inet.getaddr returning an :nxdomain error at the Erlang layer. There seemed to be some DNS issues with the version of Alpine I used in the container (3.9) but no amount of upgrading containers or Elixir dependencies changed anything.

However, I later discovered that :inet_res.nslookup, using Erlang’s built-in DNS client, was succeeding in obtaining external IPs.

Eventually I found a workaround: by configuring :inet to use :dns lookup instead of :native lookup, everything works. There’s a small delay the first time :inet_res fetches the IP address, but after that it works as before.

First I created an erl_inetrc file and added it to the repo:

%% -- ERLANG INET CONFIGURATION FILE --
%% read the hosts file
{file, hosts, "/etc/hosts"}.
%% read and monitor nameserver config from here
{resolv_conf, "/etc/resolv.conf"}.
%% specify lookup method
{lookup, [dns, native]}.

The important part I added was {lookup, [dns, native]}. That’s what causes the Erlang runtime to use :inet_res for address lookups.

Next I set the environment of the container to use this erl_inetrc file as the config for :inet.

ENV ERL_INETRC=/app/erl_inetrc

After that I cleared and rebuilt the container.

I really hope I never have to deal with this again.

4 Likes

That indeed does not sound fun to deal with! Thanks for reporting back on the issue, there’s a good chance that it could help someone else avoid the same pain in the future. Welcome to the forum and I hope that you won’t encounter many difficult problems like this in the future :blush:

1 Like

@Ruin0x11 you just saved my day!
I had the same problem since begining of June 2021. I couldn’t figure out what was happening, I was getting random request failure from different libraries with only :nxdomain as the message. When I say different libraries, I mean some were using :hackney directly, some were using Tesla. I was using :httpc directly. The only error message was something like

"level": "error", "message":":nxdomain"}
"level":"error","message":"RESPONSE: {:error, {:failed_connect, [{:to_address, {'some_domain_name', 443}}, {:inet, [:inet], :nxdomain}]}}"}

I tried the above fix and now everything is working.

Just to add a bit more information. I thought I would have to mess with vm.args.eex but I did not have to.
I am running inside docker, the only thing I did was add those two lines in my final image

COPY ./erl_inetrc ./
ENV ERL_INETRC=/erl_inetrc

mix release was “smart” enough to detect the changes.

After 10 days, I was really getting worried.