Running facebook's segmentation model using ortex

herisson · May 9, 2024, 9:11am

Hi everyone, I recently took on a personal challenge of running Meta’s Segment Anything Model in Elixir. Since it’s not supported by Bumblebee, I started using Ortex to run ONNX models.

Everything is mostly smooth, but my final masks are a bit distorted. Since it’s my first time using Nx/Ortex/ONNX, I’m struggling to figure out the problem. Could it be an issue with the ONNX model itself?

Here is the outputs I’m getting:

I’ve put my code into a gist so you can check it out.

gist.github.com

https://gist.github.com/herissondev/2e56459e3076c9404233b5530d0bbc44

Sam_ortex.livemd


# Segment Anything Model (SAM) using Ortex

```elixir
Mix.install([
  {:ortex, "~> 0.1.9"},
  {:image, "~> 0.37"},
  {:nx_image, "~> 0.1.2"},
  {:exla, "~> 0.7.2"},
  {:kino, "~> 0.12.3"},

This file has been truncated. show original

If you have some time to take a look, I would appreciate it. I’ve run out of ideas

Thanks!

herisson · May 9, 2024, 12:51pm

Got it to work ! Turns out the issue was from the decoder model. I exported it myself to an onnx model and now it works just fine !

kip · May 9, 2024, 10:56pm

I think I can simplify some of the image pipeline using Image - any chance you can put the decoder model somewhere I can access? (I’m not comfortable regenerating myself).

Update

I cloned the sam repo and followed the instructions to generate an onnx model after downloading the default vit_h checkpoint:

% python scripts/export_onnx_model.py --checkpoint sam_vit_h_4b8939.pth --model-type default --output sam.onnx
Loading model...
Exporting onnx model to sam.onnx...
Model has successfully been run with ONNXRuntime.

But that gives me a single onnx file, not separate encoder/decoder. What am I missing? (he says, clearly illustrating he knows nothing about this domain)

herisson · May 11, 2024, 9:18am

Yes, their examples aren’t very comprehensive in terms of explanations. From what I understand, SAM is a “two-stage model.”

The first stage takes an image and transforms it into something the next model can understand (image embeddings). This stage, which I refer to as the encoder or vision encoder, is common in many image processing tasks, not unique to SAM. That’s why it’s not included in their export script.

The second stage, which you exported, takes the image embeddings and other inputs to produce the mask (referred to as the decoder).

I can’t upload the decoder to a repository right now as I’m not at home, but you can find the vision encoder here. Since you’ve exported the decoder, you should be set with these two models!

Hope it makes things clearer

kip · May 11, 2024, 9:30am

@herisson, much appreciated - I will give that a go and see where I get to. Many thanks.

kip · May 12, 2024, 8:51pm

Seems I’m still a bit stuck - and hoping your patience hasn’t run out

Using the encoder link you pointed me at, I’m seeing the following error:

tensor =
  image_tensor
  |> Nx.as_type(:f32)

# Mean and std values copied from transformer.js
mean = Nx.tensor([123.675, 116.28, 103.53])
std = Nx.tensor([58.395, 57.12, 57.375])

normalized_tensor =
  tensor
  |> NxImage.normalize(mean, std)

{image_embeddings} = Ortex.run(encoder, Nx.broadcast(normalized_tensor, {1024, 1024, 3}))

And the error is:

** (RuntimeError) Dimensions do not match: InputsLength { inference_input: [[1024, 1024, 3]], model_input: [[None, Some(3), Some(1024), Some(1024)]] }
    (ortex 0.1.9) lib/ortex/model.ex:51: Ortex.Model.run/2
    #cell:le3cboucvteyzzeh:21: (file)

Which I think is saying that the shape of the image data is {1024, 1024, 3} but the model expects {3, 1024, 1024}.

Given you already have a working model, may I still ask if you are able to upload the encoder and decoder you are using so I can at least eliminate that as an issue in my experiment?

herisson · May 13, 2024, 10:39am

That’s my bad, with all my different trials I got mixed up in my code versions…

I’ve uploaded my complete working code and models on hugging face which you can find here : ginkgoo/SegmentAnythingModel-Elixir-Ortex at main

That should do the work !

kip · May 13, 2024, 11:20am

Really appreciate it, thanks very much. Back to experimental mode!

kip · May 13, 2024, 6:01pm

I’ve made a modified version of your Livebook to use Image for the image pipeline. I think it simplifies some of the code.

I’ve also made it so the encoder and decoder are downloaded from hugging face using req so the Livebook now works standalone. So here’s the badge!

Thanks for taking the time to help me through my lack of understanding. Its been a good learning experience.

herisson · May 13, 2024, 7:04pm

That’s amazing, thank you!
Indeed, the image library makes things much simpler and cleaner. I didn’t know these badges existed, awesome!

Happy to help! I also learned a lot to get the code working.

Anko · August 4, 2024, 7:17am

I tried to run the livebook from here
https://raw.githubusercontent.com/elixir-image/image/main/livebook/segment_anything.livemd

It gives me
** (ArgumentError) cannot broadcast tensor of dimensions {3, 768, 1024} to {1, 3, 1024, 1024} with axes [1, 2, 3]
(nx 0.7.3) lib/nx/shape.ex:253: Nx.Shape.broadcast!/4

I had to change the resizing line to

resized =
  Image.thumbnail!(image, "1024x1024", crop: :attention)

to make it end up as square. Other options are here Image — image v0.48.1

Would love to have this work with SAM 2

herisson · August 4, 2024, 8:59am

yes there was indeed an error on the gist sorry. However the livebook that @kip made is fully functional.

I’ll try to make it work with SAM 2 but I have yet to fin ONNX models of it.

Anko · August 10, 2024, 4:03am

hmm sorry I thought i had one but I sent the wrong link, still looking for the right one

update: vietanhdev/segment-anything-2-onnx-models at main