Hi everyone, I recently took on a personal challenge of running Meta’s Segment Anything Model in Elixir. Since it’s not supported by Bumblebee, I started using Ortex to run ONNX models.
Everything is mostly smooth, but my final masks are a bit distorted. Since it’s my first time using Nx/Ortex/ONNX, I’m struggling to figure out the problem. Could it be an issue with the ONNX model itself?
I think I can simplify some of the image pipeline using Image - any chance you can put the decoder model somewhere I can access? (I’m not comfortable regenerating myself).
Update
I cloned the sam repo and followed the instructions to generate an onnx model after downloading the default vit_h checkpoint:
% python scripts/export_onnx_model.py --checkpoint sam_vit_h_4b8939.pth --model-type default --output sam.onnx
Loading model...
Exporting onnx model to sam.onnx...
Model has successfully been run with ONNXRuntime.
But that gives me a single onnx file, not separate encoder/decoder. What am I missing? (he says, clearly illustrating he knows nothing about this domain)
Yes, their examples aren’t very comprehensive in terms of explanations. From what I understand, SAM is a “two-stage model.”
The first stage takes an image and transforms it into something the next model can understand (image embeddings). This stage, which I refer to as the encoder or vision encoder, is common in many image processing tasks, not unique to SAM. That’s why it’s not included in their export script.
The second stage, which you exported, takes the image embeddings and other inputs to produce the mask (referred to as the decoder).
I can’t upload the decoder to a repository right now as I’m not at home, but you can find the vision encoder here. Since you’ve exported the decoder, you should be set with these two models!
Which I think is saying that the shape of the image data is {1024, 1024, 3} but the model expects {3, 1024, 1024}.
Given you already have a working model, may I still ask if you are able to upload the encoder and decoder you are using so I can at least eliminate that as an issue in my experiment?