could not match the class name "VisionEncoderDecoderModel" to any of the supported models, please specify the :module and :architecture options
As i understand, i have to implement the module VisionEncoderDecoderModel, but i don’t know how. Is there a kind of tutorial, which you can recommend for doing the implementation? I don’t know if this is important, in the config.json _name_or_path = “naver-clova-ix/donut-base”
Hey, selvakumarcts/sk_invoice_receipts is a fine-tuned Donut model. It is actually a bit tricky, VisionEncoderDecoderModel is an abstraction in Python hf/transformers for combining a pair of vision and text models into a single model. Bumblebee does not support it at the moment. The underlying models are Mbart (supported) and DonutSwin (which is basically Swin) (not supported). So the first step would be implementing Swin, and then the abstractions on top.
I still have to fix/implement some parts (like attention in Swin is using relative position bias table which I still haven’t figured out how to do in Bumblebee).
Once Swin is done I hope implementing Donut will be much easier because Mbart is already supported. However, at the moment, I cannot estimate how much time I’ll need to finish Swin support.