Debugging a remote node - how was it and what have I learned

marcin · May 7, 2021, 3:39pm

This post is intended to be a guide for others, I was running a remote debugger for the first time and appreciate feedback on how I could have made this easier!

I’m aware of another post with a tutorial for remote debugging: Elixir Remote Debugging
What has expecially hepled me is comment from @m31271n that contains crucial information not mentioned anywhere else!

I had a weird bug that happens only in production and staging, and only when user session is existing, although this particular API request did not need authentication.
One of GraphQL mutations, when returning a validation error, would return HTTP 500 error instead of the error. Only when session was set (did not happen when signed out or private mode in browser) and only on production and staging (running identical staging and production, yey!).

I figured out that I will try to use the debugger and distribution functionality to see what’s up.
There are quite a lot of resources out there showing how to run Erlang debugging on production, so I was feeling quite assured, but soon I fell into a 12 hour rabbit hole trying to get it right.

So what are the pitfalls?

Connecting to a remote node via SSH tunnel can fail and I didn’t know why. The Node.connect :'app@domain.com' would just return false and that’s it. I tried to follow the instructions carefully but failing more and more again I started wondering about some obvious things I am missing but the tutorials did not find worthwhile to mention - maybe i should run local iex with ENV_MIX=prod? (actually not). In the end:

I made sure remote app is run with -name app@domain.com and not -sname app
I used a real domain of remote host, so -name app@realdomain.com,
but set it to 127.0.0.1 on local machine in /etc/hosts.
I run a tunnel to forward both epmd and app port, and made a script so I didn’t have to repeat this procedure every time (attached below)

After I made the connection, using :debugger seemed like a simple 3 step process. However, it did not fire breakpoints at all on remote node (it did on local node though). Again, I started wondering about possible causes of this? Should the BEAM VMs on local and remote host have exact same version (I did upgrade/downgrade them to be same, but that was probably not necessary and did not help)? Do I need to have :int module also on remote node, even if :int.ni() documentation tells me that it will instrument a module on all nodes? (yes, I needed the whole :debugger app actually).
Failing to use debugger, I thought maybe I can use :dbg module. It turned out very cryptic to use, and I could not make it trace my remote node. I found out that there is a GUI app called Erlyberly which makes using tracing simpler, but I could not make it connect to the node - and found an issue that stated connecting over SSH will not work.
Finally I tried to use recon, because it is advertised as safe and it DID show traces of remote node, but was way too erlangy for me to use. I tried to use recon_ex package, but it failed to install (conflict between recon version it required, and one required by rabbitmq client library). I gave up.
I watched the whole video on tracing by Gabi Zuniga and I would love to try his tracer library, but I didn’t. Instead I figured out that I need to add debugging symbols to my remote node.
Adding debugging symbols to remote node (I use releases mechanism to deploy) is not easy. It is not easy because of all the different information out there:

Some tell to run mix compile.elixir --debug-info. In my scripts i only had mix compile, so I added mix compile --debug-info. Did not work.
I tried also to put mix compile.elixir --debug-info AFTER mix compile, as a workaround. I did not want to split mix compile into 6 or so commands it runs. And it did not work.
Then I found out it is possible to add build_info: true option to elixirc_options: key in mix.exs under project key list. It wasn’t easy because the resources just say: “add this to mix.exs”, but do not say where to add it. Finally I figured it out - they need to be added to project key-list, along :app, :version and so on. Having done that, I discovered it does create bigger beam files with debug info, but did not work on remote node - my beam files seemed stripped.
By the way - how to tell that debug info is in the module? Run ModuleFoo.module_info and see if :compile key value is not an empty list.
Only then I found the comment by @m31271n whom I owe beers :
1. Add strip_beams: false (again, where to??) - but this I figured - in mix.exs, under project function, under keys :release, then :appname, next to steps: [:assemble, :tar].
2. Add :debugger to :extra_applications list.

The debugger worked! I could break and attach to processes, and step, and see variables. This was awesome although a bit hard anyway:

my laptop fans started flying away - no idea why running the debugger and not the app generated such a workload.
the debugger ui is also very slow (maybe related). You click something, then wait seconds for feedback.
I was debugging a graphql query, which would be killed after 15 seconds (timeout), so I had a very little time to look into what is happening. Perhaps I could make timeout time longer.
There is no way to show stack trace (?) I knew what my Absinthe resolver module is called like, but the error seemed to happen after {:error, .. } tuple is returned to absinthe. I wanted to break on a module higher in the call stack, but I did not know what Is the call stack. I had to use Absinthe docs + source code to figure out what to try.
Super hard to investigate Absinthe.Resolution structure passed around in controllers - its very big and all in Erlang. I fired up an erl repl and tried to copy-paste the variable value from debugger into there. It took very long and then I discovered the string value is not complete - it ends in the middle of Resolution struct… The only thing I could do is to text-search in it in Vim, but it did not help me understand anything.

I repeated the stepping process and noted when the stepper just stops and the process is marked as killed - I assumed an exception is throw there. In the end after around 30 runs I nailed down the line where the stepper would stop:

%{
  ...
  user: get_in(resolution.context, [:user, :email]),
  ...
}

This was in custom middleware that sends errors to Sentry when there are any API problems (even validation), and it would add current user’s email to the metadata. It made sense to be the culprit - it runs after the resolver, it runs only with Sentry enabled (not enabled on my dev machine), and only when there is a user. But still I did not know why? This line looks perfectly well.
This block of code was wrapped in a try block so an error was not logged in app, but sent to Sentry. I checked again and indeed, there was an issue with this exception, but was wrongly assigned to a different problem, and I did not see it.
It would say:

(UndefinedFunctionError function App.Users.User.fetch/2 is undefined (App.Users.User does not implement the Access behaviour))

There you have it! get_in does not work with structs! When resolution.context.user was nil, it defaulted to nil, but when it was a struct, it couldn’t access the email field! I did not know that! But mystery was solved.

Really interested if I could make my life easier here, except noticing the error in Sentry in the first place Do you have some tips?
Maybe using tracing approach would reveal the error easier? Or not really? I guess it would not point me to the place where exception happens?
I still did not try tracer framework from Gabi - it looks very easy to use and powerful! Perhaps another time.

Marcin

Here’s epmd-tunnel bash script below: Run it: ./epmd-tunnel server1 app2 and follow instructions

#!/bin/bash

if [ $# -lt 1 ]; then
  echo $0 host app_name
  exit 1 
fi

if [ $# -lt 2 ]; then 
 ssh $1 epmd -names
 echo "now run $0 $1 app_port"
 echo "make sure that remote node runs with -name (not -sname)"
 echo "and the name is with full domain of the host."
 exit 0
fi

HOST=$1
APP=$2

PORT=$(ssh $HOST epmd -names| grep "name $2 at port" |cut -f 5 -d ' ')
if [ ! -n "$PORT" ]; then echo "Can't figure out port for $APP"; exit 1; fi
echo "$APP distributed port is $PORT"

BEAM=$(ssh $HOST ps awux |  grep -- "beam.*-name $APP")

# echo $BEAM

COOKIE_ARGS=$(echo $BEAM | sed -n 's/.*-setcookie \([^ ]*\).*/--cookie \1/p')
DOMAIN=$(echo $BEAM | sed -n 's/.*-name [^@]*@\([^ ]*\).*/\1/p')

if [ ! -n "$COOKIE_ARGS" ]; then echo "Can't figure out cookie  for $APP"; exit 1; fi

echo 1. Put into /etc/hosts:
echo 127.0.0.1 $DOMAIN
echo
echo 2. In another terminal run:
echo iex $COOKIE_ARGS  --name me@$DOMAIN -S mix run --no-start
echo
echo 3. Paste:
echo Node.connect ":'$APP@$DOMAIN'"
echo

echo Running tunnel now....
ssh -L 4369:localhost:4369 -L $PORT:localhost:$PORT $HOST

Sebb · May 7, 2021, 6:50pm

That was painful to read
Thank god there is a happy end!
I have nothing to contribute, just my compassion.

I have no Elixir code in production yet, but I think I’ve learned from your report to practice remote debugging. I had a look into debugging and tracing some time ago. The debugger does not work for me, its just way to slow.
Tracing with :dbg seems nice, but I hear its dangerous, if you don’t know what you’re doing (like me).
All Elixir tracing libraries seemed abandoned at the time I looked, but I just see that rexbug merged a PR, which fixes a bug, that prevented it to run with Elixir 1.11.

lukaszsamson · July 11, 2024, 8:22pm

Support for remote debugging will be added in the next version of ElixirLS. See elixir-ls/README.md at master · elixir-lsp/elixir-ls · GitHub