Using "selvakumarcts/sk_invoice_receipts" with BumbleBee

UlfAnger · July 24, 2024, 3:10pm

Hello,

i try to use the model selvakumarcts/sk_invoice_receipts from Huggingface with Bumblebee, but when i try to load the model

 Bumblebee.load_model({:hf, "selvakumarcts/sk_invoice_receipts"})

i get the Error

 could not match the class name "VisionEncoderDecoderModel" to any of the supported models, please specify the :module and :architecture options

As i understand, i have to implement the module VisionEncoderDecoderModel, but i don’t know how. Is there a kind of tutorial, which you can recommend for doing the implementation? I don’t know if this is important, in the config.json _name_or_path = “naver-clova-ix/donut-base”

Thanks,
Ulf

jonatanklosko · July 24, 2024, 4:55pm

Hey, selvakumarcts/sk_invoice_receipts is a fine-tuned Donut model. It is actually a bit tricky, VisionEncoderDecoderModel is an abstraction in Python hf/transformers for combining a pair of vision and text models into a single model. Bumblebee does not support it at the moment. The underlying models are Mbart (supported) and DonutSwin (which is basically Swin) (not supported). So the first step would be implementing Swin, and then the abstractions on top.

@bosko was looking into this (How to use Swin and Donut models with Bumblebee?). There were tricky parts with the model architecture, not sure where this landed.

As for general notes on implementing Bumblebee models, you can read this post and look at PRs adding other models.

bosko · July 24, 2024, 5:15pm

Unfortunately I am still struggling with Swin

I still have to fix/implement some parts (like attention in Swin is using relative position bias table which I still haven’t figured out how to do in Bumblebee).

Once Swin is done I hope implementing Donut will be much easier because Mbart is already supported. However, at the moment, I cannot estimate how much time I’ll need to finish Swin support.

jonatanklosko · July 24, 2024, 5:34pm

@bosko thanks for the update! And no pressure, I just wanted to outline all the context for the reference.

UlfAnger · July 25, 2024, 3:36am

Thank you for the info and explanation. I will start with looking at the link and PRs.