Anybody using elixir/erlang to create dataset for machine learning?

Sometimes it is kind of frustrating to use mostly python based tools to create dataset. You got so many cores on machines for deep learning and it takes forever to create dataset if you can’t leverage them. For instance I have been running this dataset creation task for more than a day and it only utilize 400% cpu while there are 24 metal cpus on the server.

1 Like

Sometimes it is kind of frustrating to use mostly python based tools to create dataset. You got so many cores on machines for deep learning and it takes forever to create dataset if you can’t leverage them.

I have no idea what this create dataset means at all. This is coming from a statistician/programmer that does sampling, collecting, cleaning, modeling, and web scraping.

1 Like

More or less, You feed your neural network to produce a dataset.

Here are some open datasets

2 Likes

Hi :slight_smile:

I am also really interested in any answer regarding this topic. I have personally never done data collection on a large scale using Erlang, but I will have to in a near future. So if any experts out there are willing to share their experience it would be much appreciated :slight_smile:

Regarding your example, if I understand well you have a server rack with 24 processors, and I will assume they are 6-core/12-thread each if you have recent Xeon hardware. So 24 * 2 * 6 = 288 means that with my assumption you have 288 available logical cores for your task (The exact number does not actually matter).

So the maximum number of parallel processes that you could efficiently leverage is 288, while 400% usage seems to point there is only 4 of them. I have no experience whatsoever with ML nor dataset processing, but whatever task you are doing if you see unused CPU cores it might be an indicator of non-concurrent parts in your program. I would suggest trying to see if your data collection job allows for at least N concurrent processes to run simultaneously, where N = (physical CPU cores) x 2.

Normally you can always reach full CPU load if the Erlang VM is allowed to run on all the cores of your machine.

Hope this helps,

Igor

More or less, You feed your neural network to produce a dataset.

Oh… it that GAN thing?

I don’t think so, GAN is putting 2 neural networks working adversary…

One generates data, the other trying to detect if data is fake. It ends when neural checking network can not differentiate between generated data and real data.

BTW there is this very impressive video about GAN…

2 Likes

Ah okay. I was thinking that one of the NN in GAN generate data to trick the other one but that’s only the train part it’s not necessarily for generating data for the sake of it.

Do you know what architecture of NN is the data generating one?

I am sorry but I don’t know…

I have been interested in DL since Google Alpha(Go, Master, Zero) was able to defeat top human players at the game of Go, and top dedicated computer programs at the game of Chess.

I have been using Python recently to see what could be done.

Most of Python libraries for DL are wrapper around C/C++, so I think it could be doable in Elixir too.

There is also tensorflex :slight_smile:

1 Like