How many processes can I create in one cpu core?

I am new to elixir and i wanted to confirm one thing. Lets say i have a vps with 2GB RAM and 1 CPU core.

I have a csv file with 2,000,000 records. To process the csv file i need say 10 processes so that work is easier. This is how i imagine i can process the csv faster.

  1. Write a program that breaks the csv in 1000 parts.

  2. Process 2000 items with 1 elixir process since i have 10 processes.

Remember i have 1 CPU core. How many processes can i create in one cpu core?

As many as you like (almost as there are limits in the VM, but you can bump them up when you need). However remember that more processes do not equal faster processing (concurrency != parallelism), TBH often it can slow you down as you need to do context switching and you will loose cache locality (i.e. data you need to process will be removed from cache for new data that is needed which will result in delays).

So in case of processing single CSV file on single core machine I would go with single process, as it should be not only easier, but faster as well.

The number of processes you can make is not limited by the number of CPU cores. The number of processes that can actually run at exactly the same time, however, depends on the number of CPU cores. But it is not something you usually have to think about very deeply: Often, processes have to wait on IO (reading/writing files) or other calls to the operating system or external world. The scheduler knows when this happens, and uses that to switch to another process that currently is not waiting. So we usually run many more processes than CPUs, and end up with a system that is still much faster than having only at most the amount of processes as we have CPU cores.

In your case, many processes might work on the same file, which means they all have to wait on the same thing. Instead, it is more natural to have one process that extracts lines from this file, which passes it on to a pool of workers that work on each of the lines, which might pass it on to other workers if there is more work to be done, until at some point you reach a stage at which you want to combine the results, which probably means that you’re limited to a single process there again as well.

I highly suggest using a more higher-level library like Flow for this. It makes the hard decisions like ‘how many processes for each stage’ and ‘how to connect the stages’ for you (which you can fine-tune if you want, but the defaults are very sane).

3 Likes

Thanks As relates the question of splitting a csv, into 10 parts and using processes on a single core, is it a guarantee that the work shall be done faster with processes.

I have tried using one process in java and its slow as slow can be.

Thanks for the length explanation. If i have a csv, i want to create 1000 files with 2000 lines each. I shall then create a process for each file. is there something wrong in thinking or solving the problem that way?

No, absolutely not.

If all you want is to quickly process medium sized CSV file then Elixir isn’t the best choice. Either use Pandas or use specialised tools like xsv (written in Rust).

Why? According to elixir

Elixir’s processes are isolated from one another, they do not share any memory and run concurrently. They are very lightweight and the BEAM is capable of running many thousands of them at the same time. That’s why Elixir exposes primitives for creating processes, communicating between them and various modules on process management.

Splitting and processing one file at a time seems like things wil move a long faster unless i dont understand what

Elixir’s processes are isolated from one another, they do not share any memory and run concurrently.

means.

The place where you miss the point is that you still have only one CPU. There is no magic that will allow you to run things concurrently on single CPU in this world. In general if you process linear data (like files) then there is no point in doing that in more processes (system processes) than there is cores in your CPU, in Erlang terms - it is not feasible to process streams of data in more processes than you have schedulers enabled in your system (and it makes no sense to run more schedulers in your system than you have physical cores).

So there is no magic in this world that will speed up that files processing by running processing in multiple threads on the single core.

1 Like

Thanks. I thought there was free lunch in this joint with the notion of processes. I don’t think then elixir offers any solutions to my specific problem.

Your problem is absolutely solvable by Elixir – and I’d say with much less lines than many other languages.

But the benefits of the Erlang’s BEAM VM and the OTP itself will not shine on a single-core system, as @hauleth said.

In this particular case I’d only recommend you use Elixir if: (1) you want the code for that task to be yours and thus small and readable and (2) are willing to learn for a bit.

However, if you don’t insist on code ownership then I think it’s much wiser to find an external tool that does what you want and utilize that.

1 Like

No, there is no such as free lunch. You cannot eat two apples at the same time as you are limited by amount of mouths.

However as I said, check out xsv as this is based on Andrew Gallant’s awesome csv library, that should be fast enough (at the end of article he test his code against this file which also has over 2M records).

1 Like

Being very painfully pedantic (very precise in the use of terminology)
 there is a difference between “concurrency” and “parallelism”. “Concurrency” involves the means the establishment of more than one independent process composed together to form a solution. You can create concurrent solutions and execute them on a system with a only a single CPU.

Parallelism refers to the ability to execute two or more concurrent processes simultaneously. You must have more than one processing core to execute two processes in parallel.

Erlang is built for concurrency and will run concurrent solutions (even with a single CPU). Given multiple execution cores, it can also execute processes in parallel.

5 Likes

I believe that is what @hauleth is saying in his posts- while this work can be done concurrently on a single core it will be slower than the single threaded approach because there is no chance of parallelism. Because of this you’d need to instead increase single threaded performance to complete the task sooner.

3 Likes