I am using Rustler, and specifically a ResourceArc to preload some data into memory(in this case an ML model).
The setup is as follows. Poolboy handles the starting of GenServers.
conf = [
name: {:local, :my_model},
worker_module: MyModelServer,
size: 1,
max_overflow: 0
]
:poolboy.child_spec(:my_model, conf, [some_model_arg])
These GenServers manage creating and holding the ResourceArc refs. That’s all they’re doing.
defmodule MyModelServer do
use GenServer
def start_link(model_arg), do: GenServer.start_link(__MODULE__, model_arg, [])
def fetch_handle(pid), do: GenServer.call(pid, :fetch_handle)
def init(model_arg), do: get_resource(model_arg)
def handle_call(:fetch_handle, _from, state), do: {:reply, state, state}
defp get_resource(model_arg) do
case MyApp.Native.create_model(model_arg) do
{:error, _} = err -> err
ref -> {:ok, ref}
end
end
end
Then to utilize the model/arc, it’s a poolboy transaction that fetches the pid, then uses the pid to fetch the ref/arc handle.
:poolboy.transaction(:my_model, fn pid ->
handle = MyModelServer.fetch_handle(pid)
MyApp.Native.use_model(handle, some_args)
end)
That may be overly complex, maybe I don’t need poolboy and can just start the GenServers directly. However, as is, it’s working…mostly. It will be working just fine, then randomly, i’ll get an argument error at the point where I’m calling out to Rust with the ResourceArc.
[error] Task #PID<0.5475.0> started from #PID<0.5448.0> terminating
** (ArgumentError) argument error
(my_app 0.1.0) MyApp.Native.use_model(#Reference<0.3453741534.1237975042.12785>, ["I'm the model input"])
I know the second argument is ok. A list of strings is what’s expected here. My assumption is that the ref being passed maybe references a Rust struct that no longer exists, but I’m not sure why it would no longer exist.
My question is no so much how can I can recreate the model arc/ref, that part I know how to do. I’m more curious why this might be happening. Is there a way to debug it? Refs are a very opaque type, that I’m not sure how to really inspect or debug, much less when they’re merely a handle to a struct in Rust, as is the case here. I’m also not as familiar with Rust/Rustler(beyond having set up a few nifs/arcs before, so I have some familiarity, but not so much with debugging/inspecting).
The main reason I’m ruling out the setup/poolboy/Elixir side of things, is because that GenServer is the one that is creating the handle and holding onto it, so even if it was restarted by poolboy, it would be creating a new, and therefore valid, handle. Also looking at the logs, the request preceding the one with ArgumentError was successful, so it doesn’t appear to be a case that a previous request caused an error in Rust that would have put that Arc in a bad state, and then checked it back into poolboy in a failing state. That doesn’t appear to be the case here. Thought it certainly could be an issue on the Elixir side, or with my setup, but at first glance it doesn’t seem like anything obvious there.
Anyone have an insight into what may be the cause? Or things I could do to introspect things when this happens? Sadly it’s very intermittent, so I don’t know how to recreate it, but I’d like to try to get to the root of it.
EDIT: One other thought I just had going through the Rustler docs, is maybe it’s something to do with dirty scheduling. Calls to Rust that use this ResourceArc are long running. Long running in a >1ms sense, not as in minutes or hours, but typically 0.25-0.75 seconds on average, but can be up to 1-2 seconds. Currently I’m just creating the ResourceArc in the standard way, and not specifying that it’s dirty. Not sure if that could be a factor.