Axon memory usage high

preciz · May 3, 2024, 5:32pm

I’m training simple feed forward Neural Networks on CPU and I often see the memory usage go up over 128GB of RAM.

Is this normal? The training data is 70000x4000 tensor, when this gets made I see a bump in memory usage. But then when training loop starts it keeps increasing until out of memory is reached. Should memory keep rapidly increasing during training?

My code seems simple:

      {train_data, test_data} =
        data
        |> Nx.tensor(type: :u8)
        |> Nx.divide(255.0)
        |> Nx.reshape({count, 4000})
        |> Nx.to_batched(batch_size, leftover: :discard)
        |> Enum.split(train_batches_count)

      {train_labels, test_labels} =
        series["label"]
        |> Explorer.Series.to_tensor()
        |> Nx.new_axis(-1)
        |> Nx.equal(Nx.tensor([0, 1]))
        |> Nx.to_batched(batch_size, leftover: :discard)
        |> Enum.split(train_batches_count)

      optimizer = Polaris.Optimizers.adamw(learning_rate: learning_rate)

      model_params =
        model
        |> Axon.Loop.trainer(:categorical_cross_entropy, optimizer)
        |> Axon.Loop.run(Stream.zip(train_data, train_labels), %{}, epochs: 5, compiler: EXLA)

polvalente · May 3, 2024, 5:46pm

I believe you might be missing Nx.default_backend(EXLA.Backend)

Your data occupies around 280MB, if I calculated this right. If you didn’t set the default backend, that’ll be allocated in Nx.BinaryBackend. Then, at each iteration of the loop, Nx will copy the data over to EXLA due to your choice of compiler, at least doubling the memory usage.

Then, you might be running into the GC not doing its work fast enough.

polvalente · May 3, 2024, 5:47pm

Actually, the data occupies around 1GB by itself according to my calcs. I’d missed the implicit f32 conversion.

preciz · May 3, 2024, 6:06pm

My notebook starts with

    Nx.Defn.default_options(compiler: EXLA)
    Nx.global_default_backend(EXLA.Backend)

I can post the whole thing, but I really see this frequently, how can I debug and know if this is a memory leak? So I can at least know it’s a real issue.

polvalente · May 3, 2024, 6:34pm

Ah right. I thought that was the whole Nx code.

You could add a Stream.map function that calls :erlang.gc before every epoch at least, and that will eliminate the possibility of gc being too slow.

Also, which EXLA version are you using?

preciz · May 3, 2024, 7:11pm

Everything is on latest version, exla 0.7.1.

Sorry, I don’t get that, where should I put Stream.map?

polvalente · May 3, 2024, 8:06pm

Put that as your Axon.Loop data input! So you’d wrap the current input Enumerable and pass that

preciz · May 4, 2024, 9:11pm

Thank you @polvalente for mentioning garbage collection.

If I set the :garbage_collect option to true with Axon.Loop.run then the memory usage is low and it doesn’t increase. If I don’t set it, then it grows continuously during training.

This solves the issue.