XLA reserves memory upfront and then allocates within that reservation as needed. This behaviour can be customized with client options preallocate: false
or other :memory_fraction
. However, I don’t think this will help with the OOM error.
We are still yet to do more optimisations for stable diffusion, but two things you can try this:
- Load the parameters into the CPU with
Bumblebee.load_model(..., backend: {EXLA.Backend, client: :host})
- Enable lazy transfers in serving defn options:
defn_options: [compiler: EXLA, lazy_transfers: :always]
This way, instead of placing all parameters on the GPU, they will be transferred as needed.
Also make sure to try with batch size 1.