Flame terminates without visible error

Zsolt · February 15, 2024, 2:39pm

My flame process is failing for some reason, but I can’t debug the cause of it. The process works OK on a local backend and for smaller payloads in production.

I’m uploading and processing images, and this only occurs when I exceed a certain payload size. It is triggered at around 5 large files that are between 20-50 MB.

The basic steps of the process are:

using live view upload and copy files to a temp folder
create parent and local file streams so flame calls can access the file (based on the example from here: Rethinking Serverless with FLAME · The Fly Blog)
process files with image library
stream files to s3 with ExAws

I have enabled debug logs for the phoenix app and ExAws as well.

On hint I can get from the logs is that ExAws is struggling with the upload, it seems like it’s not getting any input. Part of an upload request shows: BODY: "".

The only relevant flame log I get is:

2024-02-15T14:14:41Z app[0806250b612738] ams [info]14:14:41.353 [error] GenServer FLAME.Terminator.ChildPlacementSup terminating
2024-02-15T14:14:41Z app[0806250b612738] ams [info]** (stop) killed
2024-02-15T14:14:41Z app[0806250b612738] ams [info]Last message: {:EXIT, #PID<0.2625.0>, :killed}
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]14:14:41.352 [error] GenServer #PID<0.2913.0> terminating
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]** (stop) killed
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]Last message: {:DOWN, #Reference<0.3703568104.3533176833.226316>, :process, #PID<64302.2627.0>
, :killed}                          
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]State: %{runner: #FLAME.Runner<id: nil, instance_id: nil, private_ip: nil, backend: FLAME.FlyB
ackend, terminator: #PID<64302.2627.0>, node_name: nil, single_use: true, timeout: 30000, status: :booted, log: :debug, boot_timeout: 30000, idle
_shutdown_after: 30000, idle_shutdown_check: #Function<8.81159202/0 in FLAME.Runner.new/1>, ...>, checkouts: %{}, otp_app: :phoenix_albums, backe
nd_state: #FLAME.FlyBackend<host: "https://api.machines.dev", local_ip: ["fdaa:3:e5fc:a7b:c207:6cdf:6e23:2"], cpu_kind: "performance", cpus: 1, m
emory_mb: 4096, gpu_kind: nil, image: "registry.fly.io/phoenix-albums:deployment-01HPPHKRM3RX1WQ2E7DJFTC0KT", app: "phoenix-albums", boot_timeout
: 30000, runner_id: "0806250b612738", remote_terminator_pid: #PID<64302.2627.0>, runner_node_basename: "phoenix-albums-01HPPHKRM3RX1WQ2E7DJFTC0KT
", runner_instance_id: "01HPPHV48EZPS6AMZJPSCJ4B2H", runner_private_ip: "fdaa:3:e5fc:a7b:252:9f38:8dec:2", runner_node_name: :"phoenix-albums-01H
PPHKRM3RX1WQ2E7DJFTC0KT@fdaa:3:e5fc:a7b:252:9f38:8dec:2", ...>}

but this might just be the regular timeout shutdown.

First and foremost I’m looking for ways to debug what is going on in the Flame runners, because the data processing definitely breaks somewhere, but I’m unable to catch any errors.

Zsolt · February 16, 2024, 3:58pm

After further debugging I think we can eliminate the libraries mentioned above, since they work fine locally for larger payloads, and for smaller ones in prod.

I made some runs with the following Pool config

 {
          FLAME.Pool,
          name: PhoenixAlbums.ImageProcessor,
          shutdown_timeout: 1200_000,
          idle_shutdown_after: 1200_000,
          timeout: 1200_000,
          min: 1,
          max: 10,
          max_concurrency: 1,
          single_use: true,
          log: :debug
}

to ensure, that we always have at least on machine, and that it can’t possible time out for longer running processes. The results are the same: the runner still exits without any visible errors, when upload 5 images at ~120MB, in about 75s, and usually the last 2 images are not processed. The same happens for a large number of small images, for 130 files at about ~100MB only 25 are processed, and the runner exits after ~15s.

One recurring message I get every time in the logs is: Reaped child process with pid: 365 and signal: SIGUSR1, core dumped? false .

Also the runner machine is always destroyed after it exits, despite of the min: 1 option.

This feels more like a Fly.io issue at this point, but I would welcome any insight into how flame runners are handled, or if there is any approach I could try to debug this.

Zsolt · February 21, 2024, 4:27pm

Resolved in version 0.1.10