This post is intended to be a guide for others, I was running a remote debugger for the first time and appreciate feedback on how I could have made this easier!
I’m aware of another post with a tutorial for remote debugging: Elixir Remote Debugging
What has expecially hepled me is comment from @m31271n that contains crucial information not mentioned anywhere else!
I had a weird bug that happens only in production and staging, and only when user session is existing, although this particular API request did not need authentication.
One of GraphQL mutations, when returning a validation error, would return HTTP 500 error instead of the error. Only when session was set (did not happen when signed out or private mode in browser) and only on production and staging (running identical staging and production, yey!).
I figured out that I will try to use the debugger and distribution functionality to see what’s up.
There are quite a lot of resources out there showing how to run Erlang debugging on production, so I was feeling quite assured, but soon I fell into a 12 hour rabbit hole trying to get it right.
So what are the pitfalls?
- Connecting to a remote node via SSH tunnel can fail and I didn’t know why. The
Node.connect :'app@domain.com'
would just return false and that’s it. I tried to follow the instructions carefully but failing more and more again I started wondering about some obvious things I am missing but the tutorials did not find worthwhile to mention - maybe i should run local iex with ENV_MIX=prod? (actually not). In the end:
- I made sure remote app is run with
-name app@domain.com
and not-sname app
- I used a real domain of remote host, so
-name app@realdomain.com
, - but set it to 127.0.0.1 on local machine in /etc/hosts.
- I run a tunnel to forward both epmd and app port, and made a script so I didn’t have to repeat this procedure every time (attached below)
-
After I made the connection, using
:debugger
seemed like a simple 3 step process. However, it did not fire breakpoints at all on remote node (it did on local node though). Again, I started wondering about possible causes of this? Should the BEAM VMs on local and remote host have exact same version (I did upgrade/downgrade them to be same, but that was probably not necessary and did not help)? Do I need to have:int
module also on remote node, even if:int.ni()
documentation tells me that it will instrument a module on all nodes? (yes, I needed the whole :debugger app actually). -
Failing to use debugger, I thought maybe I can use :dbg module. It turned out very cryptic to use, and I could not make it trace my remote node. I found out that there is a GUI app called Erlyberly which makes using tracing simpler, but I could not make it connect to the node - and found an issue that stated connecting over SSH will not work.
-
Finally I tried to use recon, because it is advertised as safe and it DID show traces of remote node, but was way too erlangy for me to use. I tried to use recon_ex package, but it failed to install (conflict between recon version it required, and one required by rabbitmq client library). I gave up.
-
I watched the whole video on tracing by Gabi Zuniga and I would love to try his
tracer
library, but I didn’t. Instead I figured out that I need to add debugging symbols to my remote node. -
Adding debugging symbols to remote node (I use releases mechanism to deploy) is not easy. It is not easy because of all the different information out there:
- Some tell to run
mix compile.elixir --debug-info
. In my scripts i only hadmix compile
, so I addedmix compile --debug-info
. Did not work. - I tried also to put
mix compile.elixir --debug-info
AFTERmix compile
, as a workaround. I did not want to splitmix compile
into 6 or so commands it runs. And it did not work. - Then I found out it is possible to add
build_info: true
option toelixirc_options:
key inmix.exs
underproject
key list. It wasn’t easy because the resources just say: “add this to mix.exs”, but do not say where to add it. Finally I figured it out - they need to be added to project key-list, along:app
,:version
and so on. Having done that, I discovered it does create bigger beam files with debug info, but did not work on remote node - my beam files seemed stripped. - By the way - how to tell that debug info is in the module? Run
ModuleFoo.module_info
and see if:compile
key value is not an empty list. - Only then I found the comment by @m31271n whom I owe beers
:
- Add
strip_beams: false
(again, where to??) - but this I figured - in mix.exs, underproject
function, under keys:release
, then:appname
, next tosteps: [:assemble, :tar]
. - Add
:debugger
to:extra_applications
list.
- Add
- The debugger worked! I could break and attach to processes, and step, and see variables. This was awesome although a bit hard anyway:
- my laptop fans started flying away - no idea why running the debugger and not the app generated such a workload.
- the debugger ui is also very slow (maybe related). You click something, then wait seconds for feedback.
- I was debugging a graphql query, which would be killed after 15 seconds (timeout), so I had a very little time to look into what is happening. Perhaps I could make timeout time longer.
- There is no way to show stack trace (?) I knew what my Absinthe resolver module is called like, but the error seemed to happen after
{:error, .. }
tuple is returned to absinthe. I wanted to break on a module higher in the call stack, but I did not know what Is the call stack. I had to use Absinthe docs + source code to figure out what to try. - Super hard to investigate
Absinthe.Resolution
structure passed around in controllers - its very big and all in Erlang. I fired up anerl
repl and tried to copy-paste the variable value from debugger into there. It took very long and then I discovered the string value is not complete - it ends in the middle ofResolution
struct… The only thing I could do is to text-search in it in Vim, but it did not help me understand anything.
- I repeated the stepping process and noted when the stepper just stops and the process is marked as killed - I assumed an exception is throw there. In the end after around 30 runs I nailed down the line where the stepper would stop:
%{
...
user: get_in(resolution.context, [:user, :email]),
...
}
This was in custom middleware that sends errors to Sentry when there are any API problems (even validation), and it would add current user’s email to the metadata. It made sense to be the culprit - it runs after the resolver, it runs only with Sentry enabled (not enabled on my dev machine), and only when there is a user. But still I did not know why? This line looks perfectly well.
This block of code was wrapped in a try
block so an error was not logged in app, but sent to Sentry. I checked again and indeed, there was an issue with this exception, but was wrongly assigned to a different problem, and I did not see it.
It would say:
(UndefinedFunctionError function App.Users.User.fetch/2 is undefined (App.Users.User does not implement the Access behaviour))
There you have it! get_in
does not work with structs! When resolution.context.user was nil, it defaulted to nil, but when it was a struct, it couldn’t access the email
field! I did not know that! But mystery was solved.
Really interested if I could make my life easier here, except noticing the error in Sentry in the first place Do you have some tips?
Maybe using tracing approach would reveal the error easier? Or not really? I guess it would not point me to the place where exception happens?
I still did not try tracer
framework from Gabi - it looks very easy to use and powerful! Perhaps another time.
Marcin
Here’s epmd-tunnel bash script below: Run it: ./epmd-tunnel server1 app2
and follow instructions
#!/bin/bash
if [ $# -lt 1 ]; then
echo $0 host app_name
exit 1
fi
if [ $# -lt 2 ]; then
ssh $1 epmd -names
echo "now run $0 $1 app_port"
echo "make sure that remote node runs with -name (not -sname)"
echo "and the name is with full domain of the host."
exit 0
fi
HOST=$1
APP=$2
PORT=$(ssh $HOST epmd -names| grep "name $2 at port" |cut -f 5 -d ' ')
if [ ! -n "$PORT" ]; then echo "Can't figure out port for $APP"; exit 1; fi
echo "$APP distributed port is $PORT"
BEAM=$(ssh $HOST ps awux | grep -- "beam.*-name $APP")
# echo $BEAM
COOKIE_ARGS=$(echo $BEAM | sed -n 's/.*-setcookie \([^ ]*\).*/--cookie \1/p')
DOMAIN=$(echo $BEAM | sed -n 's/.*-name [^@]*@\([^ ]*\).*/\1/p')
if [ ! -n "$COOKIE_ARGS" ]; then echo "Can't figure out cookie for $APP"; exit 1; fi
echo 1. Put into /etc/hosts:
echo 127.0.0.1 $DOMAIN
echo
echo 2. In another terminal run:
echo iex $COOKIE_ARGS --name me@$DOMAIN -S mix run --no-start
echo
echo 3. Paste:
echo Node.connect ":'$APP@$DOMAIN'"
echo
echo Running tunnel now....
ssh -L 4369:localhost:4369 -L $PORT:localhost:$PORT $HOST