Hey, selvakumarcts/sk_invoice_receipts
is a fine-tuned Donut model. It is actually a bit tricky, VisionEncoderDecoderModel
is an abstraction in Python hf/transformers for combining a pair of vision and text models into a single model. Bumblebee does not support it at the moment. The underlying models are Mbart (supported) and DonutSwin (which is basically Swin) (not supported). So the first step would be implementing Swin, and then the abstractions on top.
@bosko was looking into this (How to use Swin and Donut models with Bumblebee?). There were tricky parts with the model architecture, not sure where this landed.
As for general notes on implementing Bumblebee models, you can read this post and look at PRs adding other models.