When running axon/examples/vision/mnist example [after updating plugins and setting EXLA backend] the 5 training epochs run in ~30 seconds on two different computers. One computer with cpu only and the other with an nvidia RTX 3060 gpu. nvidia-smi monitor shows 40% gpu use during training. I thought the gpu would be faster? Do I have something configured wrong? Thanks!
[info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[info] XLA service 0x7f7c68297ba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[info] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
[info] Using BFC allocator.
[info] XLA backend allocating 10627212902 bytes on device 0 for BFCAllocator.
[info] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.