FLAME on Fly.io fails: WARN could not unmount /rootfs: EINVAL: Invalid argument

Hello,

I’ve been trying to run a simple FLAME function on Fly.io. I have a Livebook up and running on it and when I run FLAME.call, I see a another machine being provisioned but it is immediately destroyed.

I’ve tried this multiple times with different CPU/MEM combinations but end up with the logs below:

2024-10-31 13:34:58.293	
machine restart policy set to 'no', not restarting
2024-10-31 13:34:58.063	
[   25.712374] reboot: Restarting system
2024-10-31 13:34:58.061	
 WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-10-31 13:34:58.014	
 INFO Starting clean up.
2024-10-31 13:34:58.000	
 INFO Main child exited normally with code: 0
2024-10-31 13:34:36.779	
[Livebook] starting :"product-hub-flame-ed8b724b389e6d2613e0@fdaa:1:2b33:a7b:f4:c3b:5424:2" in FLAME mode with parent: #PID<17563.1397.0>, backend: :flame
2024-10-31 13:34:36.747	
Generated flame app
2024-10-31 13:34:35.418	
Compiling 15 files (.ex)
2024-10-31 13:34:35.418	
==> flame
2024-10-31 13:34:35.236	
* Getting flame (Hex package)
2024-10-31 13:34:35.231	
flame 0.5.1
2024-10-31 13:34:35.231	
New:
2024-10-31 13:34:35.230	
Resolution completed in 0.025s
2024-10-31 13:34:35.201	
Resolving Hex dependencies...
2024-10-31 13:34:33.974	
 WARN Reaped child process with pid: 361 and signal: SIGUSR1, core dumped? false
2024-10-31 13:34:33.098	
2024/10/31 08:04:33 INFO SSH listening listen_address=[fdaa:1:2b33:a7b:f4:c3b:5424:2]:22 dns_server=[fdaa::3]:53
2024-10-31 13:34:33.019	
Machine created and started in 2.437s
2024-10-31 13:34:32.972	
 INFO [fly api proxy] listening at /.fly/api
2024-10-31 13:34:32.968	
 INFO Preparing to run: `/app/bin/server` as root
2024-10-31 13:34:32.908	
 INFO Starting init (commit: 693c179a)...
2024-10-31 13:34:32.286	
2024-10-31T08:04:32.286407518 [01JBGSNJ23TJ13R0BTTFG2GSH2:main] Running Firecracker v1.7.0
2024-10-31 13:34:32.217	
Configuring firecracker
2024-10-31 13:34:30.808	
Successfully prepared image ghcr.io/livebook-dev/livebook:0.14.5 (150.881379ms)
2024-10-31 13:34:30.657	
Pulling container image ghcr.io/livebook-dev/livebook:0.14.5

Is there a config I need to specify so that the runner node doesn’t terminate after boot? I assume that’s something the FLAME lib takes care off with the min/max count

Please correct me if I’m wrong, but the purpose of FLAME is to run short-lived tasks and then terminate, similar to AWS Lambda and Azure Functions. So it terminates by design.

You can probably fiddle with shutdown_timeout FLAME — flame v0.5.1, but what are you trying to do?

2 Likes

Yes, I understand that it’s short-lived but what’s happening here is that the machine is spun up, but before any work is sent to it, the instance is terminated so no computation is done. The default shutdown_timeout is 1 minute, I believe.

I’m trying the example from the first video here without the GPU computation:

Perhaps it has something to do with the way the cookie is setup? Do I need to set anything apart from:

env: %{"LIVEBOOK_COOKIE" => Node.get_cookie()

In the FLAME FLY Backend config?

That logged error is transient and won’t effect the flame machine. Can you share your entire logs? Thanks!

1 Like

Hi Chris!

So I managed to get this working by spinning up a remote fly instance connected to my local livebook. From there I was succuessfully able to run a FLAME task.

I’m now trying to understand the difference between connecting to an instance from my locally and running the Task v/s running a livebook directly on Fly.io and setting up FLAME (which is the issue currently).

For the latter case, I’m getting a connection timeout when I run to run a Task via FLAME.

I did check that the ENV and Image used were the same in both instances so looks like something in the way the machines are deployed in both cases? Not sure where to look beyond these.

Here is the config:

Kino.start_child(
  {FLAME.Pool,
   name: :runner,
   code_sync: [start_apps: true, sync_beams: Kino.beam_paths(), compress: false],
   min: 1,
   max: 1,
   max_concurrency: 10,
   boot_timeout: :timer.minutes(3),
   idle_shutdown_after: :timer.minutes(1),
   timeout: :infinity,
   track_resources: true,
   backend:
     {FLAME.FlyBackend,
      cpu_kind: "shared",
      cpus: 1,
      memory_mb: 1024,
      env: %{"LIVEBOOK_COOKIE" => Node.get_cookie()}}}
)