HuggingFace image format expected; converting "width x height" to "height x width"

I’m hoping to get to some help getting a fundamental understanding of the binary image format that is assumed by the models on Hugging Face that are now consumable by Bumblebee. This is to ensure I am converting image data from image correctly.

My understanding of different in-memory formats

  • Data from libvips, which is what is used by Vix and therefore Image is stored in “width by height” format with each of the bands (bands in libvips, channels in other libs). For an RGB image the data layout is r1g1b1r2g2b2..... representing scanlines (hence “width by height”).

  • Data for OpenCV is in “height by width” format (more aligned to a typical matrix) with the channel format being BGR instead of RGB.

  • StbImage data is “height by width” with a channel format of RGB. This appears to be the normal format for Numpy and PIL and therefore perhaps this is the canonical format for the HuggingFace models?


What image format is expected by HuggingFace models?

Above I speculate that its is “height by width” with a channel format of “RGB”. Is that correct? :u8 format for most file formats, but float for some.

How to convert “width by height” to “height by width”?

Given an Nx tensor with data in “width by height” order, how do I convert that data into “height by width” format? This is what I need to do to convert images from the libvips memory model to the numpy / Nx layout.

What channel format is assumed by the HuggingFace models?

Is it RGB (as it is for numpy) or is it BGR (as it is for OpenCV)?