Image-To-Text model recommendation

I tried the 3 “standard” Image-To-Text models for image captioninng: “facebook/deit-base-distilled-patch16-224” and “microsoft/resnet-50” and "“google/vit-base-patch16-224”.

Results are poor, at least as low as the score :slight_smile:

This car is predicted as:

%{predictions: [%{label: "speedboat", score: 0.136890709400177}]}

Among the 280 proposed I2T models by Hugging Face, does anyone has a recommendation?

Are you looking for “Image classification” (pick one of predefined labels) or “Image-to-text” (describe image with text)? Also, how are you making the prediction? You can use Livebook Neural network smart cell to explore the various models, here’s one example per each of the tasks:

I did not try Neural Network, just Image classification with Bumblebee from a Phoenix app.

But your Livebook prediction seems much better. I may need to import this into the Phoenix app. What is your set up? The following fails:

Mix.install(
  [
    {:nx, "~> 0.6.2"},
    {:exla, "~> 0.6.1"},
    {:axon, "~> 0.6.0"},
    {:kino, "~> 0.11.0"},
    {:kino_bumblebee, "~> 0.4.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

Nx.global_default_backend(EXLA.Backend)

For kino_bumblebee 0.4.0 (released today) you want to use Livebook main (we will have a new release next week).

How did you arrive at that score? Using the example from the docs:

# Image classification

```elixir
Mix.install([
  {:bumblebee, "~> 0.4.0"},
  {:exla, ">= 0.0.0"},
  {:kino, "~> 0.11.0"}
])

Nx.global_default_backend(EXLA.Backend)
```

## 🐈‍⬛

```elixir
{:ok, resnet} = Bumblebee.load_model({:hf, "microsoft/resnet-50"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "microsoft/resnet-50"})
:ok
```

<!-- livebook:{"output":true} -->

```
:ok
```

```elixir
serving = Bumblebee.Vision.image_classification(resnet, featurizer)

image_input = Kino.Input.image("Image", size: {224, 224})
```

```elixir
image = Kino.Input.read(image_input)

# Build a tensor from the raw pixel data
image =
  image.file_ref
  |> Kino.Input.file_path()
  |> File.read!()
  |> Nx.from_binary(:u8)
  |> Nx.reshape({image.height, image.width, 3})

Nx.Serving.run(serving, image)
```

<!-- livebook:{"output":true} -->

```
%{
  predictions: [
    %{
      label: "beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon",
      score: 0.9962531328201294
    },
    %{label: "convertible", score: 0.0015192109858617187},
    %{label: "grille, radiator grille", score: 3.9919803384691477e-4},
    %{label: "passenger car, coach, carriage", score: 2.2876601724419743e-4},
    %{label: "car wheel", score: 2.235681313322857e-4}
  ]
}
```

With the Livebook, yes, but I used this in Phoenix:

{:ok, resnet} = Bumblebee.load_model({:hf, model})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, model})

serving = 
 Bumblebee.Vision.image_classification(resnet, featurizer,
   defn_options: [compiler: EXLA],
   top_k: 1,
   compile: [batch_size: 10]
 )
{:ok, image} = Vix.Vips.Image.new_from_file("/.../my-image.webp")
{:ok, %Vix.Tensor{data: data, shape: shape, names: names, type: type}} =
      Vix.Vips.Image.write_to_tensor(image)

t_img = Nx.from_binary(data, type) |> Nx.reshape(shape, names: names)

%{predictions: predictions} = Nx.Serving.run(serving, t_img)
predictions

This doesn’t compile neither, or I don’t know how: {:kino_bumblebee, git: "https://github.com/livebook-dev/kino_bumblebee", branch: "main", override: true}

I think this is a bug in Vix, the binary is laid out as {height, width, channels}, but Vix returns {width, height, channels}.

This doesn’t compile neither, or I don’t know how

You can use kino_bumblebee 0.4.0 and livebook main. What error are you getting?

I reported the bug at Wrong shape returned from write_to_tensor/1 · Issue #126 · akash-akya/vix · GitHub : )

It is in fact in Nx.shape that I did not pay attention… Thanks for finding this one.

Mix.install(
  [
    {:nx, "~> 0.6.2"},
    {:exla, "~> 0.6.1"},
    {:axon, "~> 0.6.0"},
    {:kino, "~> 0.11.0"},
    {:kino_bumblebee, git: "https://github.com/livebook-dev/kino_bumblebee", branch: "main"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

Nx.global_default_backend(EXLA.Backend)

gives me:

Unknown output format: %{chunk: false, text: "{\e[34mEXLA.Backend\e[0m, []}", type: :terminal_text}. If you're using Kino,
you may want to update Kino and Livebook to the latest version.

Yeah, you need to either run Livebook main (until the release next week) or use kino ~> 0.10.0 and kino_bumblebee ~> 0.3.0, sorry for the inconvenience : )

Well, then you start messing around with axon and nx versions because everyone wants another version. So it stopped there and tried to find out what my code on Phoenix was doing wrong. The picture was ok but indeed the Nx part was faulty. Anyway, I wait for the update. Thanks again!