Data storage for: machine learning / data mining

  1. I know about Elixir Nx. That’s not what I’m asking about,.

  2. For many problems, “dumb algorithm + huge dataset” beats “clever algorithm + tiny dataset”.

  3. In the “dumb algorithm + huge dataset” model, we can informally divide into “data preparation” and “training”.

For “training”, Elixir Nx can be very useful.

I’m more curious about “data preparation” – collecting the dataset, figuring out how to storage the dataset (which distrubted db to use ---- huge dataset often implies more than one machine), cleaning the dataset (data can be noisy), figuring out how to serve the dataset with low latency + high throughput (for training).

For those who have used Elixir for the “data preparation” part of “dumb algorithm + huge dataset” type problems, can you share insights on experiences / workflows / libraries used ?



Probably doesn’t fit your description of “huge datasets” because it all fit into a single db (millions of rows), but I’ve used elixir to synch between external dbs, clean up that data as it was coming in, store it, then do all sorts of processing on it to generate other couple million records.

The bottleneck in that case was always the Db/IO (on AWS RDS, specially the IOPS) - whenever the DB could handle it the instance running the “pipeline” was always maxed out between 90/100% CPU utilisation on 4 cores.
While probably there’s plenty of languages that could be faster for doing the actual data preparation, I’m not sure they would be as easy to model and keep such pipeline running - plenty of db timeouts when doing certain aggregations while streaming a lot of records in and out, dependencies between different parts of the pipeline, etc.

It would have been probably “easy” to split the workload into 3 or 4 machines (if the db wouldn’t be a problem), as the orchestrating parts were all globally identified gen_servers, but never got to do it.