Using "selvakumarcts/sk_invoice_receipts" with BumbleBee

Hello,

i try to use the model selvakumarcts/sk_invoice_receipts from Huggingface with Bumblebee, but when i try to load the model

 Bumblebee.load_model({:hf, "selvakumarcts/sk_invoice_receipts"})

i get the Error

 could not match the class name "VisionEncoderDecoderModel" to any of the supported models, please specify the :module and :architecture options

As i understand, i have to implement the module VisionEncoderDecoderModel, but i don’t know how. Is there a kind of tutorial, which you can recommend for doing the implementation? I don’t know if this is important, in the config.json _name_or_path = “naver-clova-ix/donut-base”

Thanks,
Ulf

Hey, selvakumarcts/sk_invoice_receipts is a fine-tuned Donut model. It is actually a bit tricky, VisionEncoderDecoderModel is an abstraction in Python hf/transformers for combining a pair of vision and text models into a single model. Bumblebee does not support it at the moment. The underlying models are Mbart (supported) and DonutSwin (which is basically Swin) (not supported). So the first step would be implementing Swin, and then the abstractions on top.

@bosko was looking into this (How to use Swin and Donut models with Bumblebee?). There were tricky parts with the model architecture, not sure where this landed.

As for general notes on implementing Bumblebee models, you can read this post and look at PRs adding other models.

3 Likes

Unfortunately I am still struggling with Swin :frowning:

I still have to fix/implement some parts (like attention in Swin is using relative position bias table which I still haven’t figured out how to do in Bumblebee).

Once Swin is done I hope implementing Donut will be much easier because Mbart is already supported. However, at the moment, I cannot estimate how much time I’ll need to finish Swin support.

@bosko thanks for the update! And no pressure, I just wanted to outline all the context for the reference.

1 Like

Thank you for the info and explanation. I will start with looking at the link and PRs.