How to execute Incremental PCA or similar batch tasks in Nx/Scholar?

bultas · October 17, 2023, 11:03am

Hello,

and thank you for the excellent libraries like Nx, Scholar, Axon…

We are in the process of transitioning some of our work from Python to Elixir. It’s heartening to see that many of our tasks can be accomplished in Elixir, thanks to Nx and the surrounding ecosystem. However, we have not yet found a solution for handling partial/batches to conserve memory when performing tasks like PCA.

PCA in Scholar

Given our dataset, we would need to construct a tensor of s64[50000][10000] as input, which poses challenges for several reasons.

In Python, we might use something like IncrementalPCA from scikit-learn to handle this. However, we’re unsure if we’ve overlooked a way to achieve this with Nx/Scholar.

We would appreciate any guidance or directions on this matter.
Thank you.

NickGnd · October 18, 2023, 9:32am

hey
I’m far from being an expert and I’m not familiar with the PCA, so please take what I’m going to write with a pinch of salt.

Have you looked at Nx.serving?
As far as I know Nx.serving can be used to batch requests and computations, quoting the docs

More specifically, servings are a mechanism to apply a computation on a Nx.Batch, with hooks for preprocessing input from and postprocessing output for the client. Thus we can think of an instance of Nx.Serving.t/0 (a serving) as something that encapsulates batches of Nx computations.

https://hexdocs.pm/nx/Nx.Serving.html

The Nx library comes with its own tensor serving abstraction, called Nx.Serving, allowing developers to serve both neural networks and traditional machine learning models within a few lines of code.

Hope it helps, let us know.
Cheers

bultas · October 19, 2023, 9:56pm

Thank you, I will take a closer look and let you know if I find a solution.

krstopro · July 30, 2024, 10:14am

Hi @bultas and sorry for the delayed action. Scholar.Decomposition.PCA now has incremental_fit/2 function which can be used to fit a model on a stream of batches.

I would suggest opening an issue on GitHub in the future whenever you have questions like this. We will be notified and will be able to react more quickly.