Why is data deep-copied when sending to another process?

marcello · October 24, 2019, 12:10pm

Hi everyone,

I’m just learning elixir with the help of “Elixir in Action” by Sasa Juric. In Chapter 5 “Concurrency primitives” there is something I don’t get. Maybe someone can help me?

It’s with regard to deep-copy data when sending data to another process.

“You may wonder about the purpose of shared-nothing concurrency. First, it simplifies the code of each individual process. Because processes don’t share memory, you don’t need complicated synchronization mechanisms such as locks and mutexes. Another benefit is overall stability: one process can’t compromise the memory of another.”

Why should there be a need for sync. mechanisms? Data is immutable, it can’t be mutated by another process. So compromising data shouldn’t be possible.
What am I missing?

Thanks, Marcello

LostKobrakai · October 24, 2019, 12:12pm

You’re missing that processes die and that makes their heap be cleaned up. Or in other words: When state is shared you need to make sure to only clean it up when nobody needs it anymore. By copying data there’s no need to track processes, which need access to a piece of data.

hubertlepicki · October 24, 2019, 12:28pm

yes, precisely that. However, what @marcello says is true mostly for the small messages.

If you have a message consisting of binaries larger that 64 bytes (which is, relatively small), these will not be copied and instead will be allocated somewhere in designated space on the BEAM. So, if you are sending large blobs between processes these are not copied over.

benwilson512 · October 24, 2019, 12:33pm

Hey Marcello. To add on to what @LostKobrakai said, it also improves garbage collection performance while the processes are alive. Each process has it’s own heap that isn’t shared with anything, and this allows each process to be garbage collected independently of any other. This is an important feature for soft realtime systems, because it means that small processes communicating with clients won’t be paused for long periods just because some other process loaded a ton of data from something and will have a longer garbage collection.

marcello · October 24, 2019, 1:36pm

Thanks for your replies!

At first GC makes totally sense for me but I was wondering if that’s worth copying the data. Also as @hubertlepicki pointed out, binaries larger 64bytes will not be copied and kept on a special shared memory heap. So, one could also check, if some data is also used by another process, and if so, don’t clean that up.

But now I remember, that Erlang/OTP is designed for having thousands of processes - so checking on that may be a daunting task (also in the sense of performace and responsiveness). Maybe that is a valid reason?

So, GC as an arguement - I’m fine with that.
But compromising shared data between processes? Well, it’s immutable, isn’t it?
Synchronization? Again, it’s immutable…

Doesn’t make sense for me as arguements? Are these wrong or is there anything else?

Thanks for helping me!

LostKobrakai · October 24, 2019, 1:43pm

GC is enough of a reason to need to copy data. The data can be as immutable as you like if you can’t be sure if the data is still available and not yet gone. Also e.g. the binary heap can quite quickly cause memory problems if a long running process is keeping (depending on the otp version even just a chunk of ) a large binary around and therefore that large binary cannot be cleaned up and stays around in memory.

peerreynders · October 24, 2019, 1:53pm

I’m no BEAM engineer but some speculation…

Aside from the deallocation issues there could also be a re-location issue (defragmenting the heap). Multiple references to the same memory location will make difficult if not impossible to relocate the data and sooner or later you could run into heap fragmentation issues.

But as I said I don’t even know if the BEAM attempts to do this.

Another possibility is effects on memory latency. Now if all shared and non-shared values are stored exactly the same way there shouldn’t be any difference. But I wouldn’t be surprised if the compiler could more effectively optimize on non-shared values.

Aside: The Value of Values

benwilson512 · October 24, 2019, 1:58pm

This is what we mean by garbage collection. In order to do this globally you have to stop EVERY process when doing a garbage collection, like Java does. This is a problem.

Notably, the actual runtime will mutate underlying data in certain cases if it can be sure that the old data is no longer accessible in order to improve performance. These optimizations would become hard or impossible if data liveliness has to be tracked across all possible processes that can touch the data.

Related to what peerreynders said, having each process with it’s own heap ensures that when it’s time for a particular process to do work, all of it’s data is physically co-located in memory, which makes it easier for the CPU caches to work as designed. In fact many processes are small enough to fit inside the CPU caches entirely, which dramatically speeds up performance.

Really though, it’s all about garbage collection, either in the normal case or in the process death case. If processes can share data you either need to impose a ref counting penalty on ALL data or you need to impose “stop the world” garbage collection. Refcounting large binaries is an acceptable trade off for that data type, but not an acceptable trade off for others. Stop the world garbage collection isn’t acceptable for a soft real time system. Ergo, isolated process heaps.

marcello · October 24, 2019, 2:08pm

Thanks everyone for explanations and spending your time on it!

I feel welcome here

hubertlepicki · October 24, 2019, 2:29pm

I did some tests on a real-life project and avoiding GC of kicing in during request is a big win. It actually happens a lot without much trying on the app I was testing on. Many times the request has been served without a single GC run, which means nothing has been stopped to clear memory and the user didn’t expect increased latency. Depending on your system the processes then either come back to the Ranch worker’s pool and can be GCed while idle or are dropped and restarted altogether nullifying the need for any GC.

rvirding · October 24, 2019, 10:15pm

I think one thing to realise wrt gc is that if you share data between processes then when you do a gc you have to gc all the processes and the whole heap. This means that you need a real-time collector which complicates things. Running in multiple threads which the BEAM does complicates matters even more as you need to make it thread safe and/or pay a large cost in synchronisation. This is now done for the large binaries which results that sometimes it can take long time to reclaim their memory. Basically every process which has referenced the binary has to do a full gc before the binary can be reclaimed. This can take a comparatively long time and overflowing memory with unreclaimed large binaries is actually a problem.

So while not sharing data and copying data when sending messages is actually quite a good way of doing it even if it sounds wrong. Way back when I did some implementations of erlang with shared memory and real-time collectors and it is not trivial to get it right. Fun though, but not trivial.

In 22 there are some new special memory areas which are shared so some things are good and really fast but you also pay a heavy price. Check out atomics, counters and persistent_terms and you will see what I mean.

ityonemo · October 25, 2019, 6:39am

The vm has special knowledge of its data structures though, right? I’m not entirely sure why a process couldn’t refcount itself and know which other processes have borrowed a reference, and on process death do push a recopy of those about to be gc’d memory bits into the target (like, a “lazy copy”). A lot of use cases are going to be transient tasks which are effectively read only, with the parent (say genserver) being long lived, so not copying seems to me like it would be useful. I just wish I had time to look at the internals and contribute more than just my armchairing.

sribe · October 25, 2019, 11:50am

What happens when any process in this scenario needs to GC?
Then the target has to be paused when the source of the data dies.

ityonemo · October 25, 2019, 3:19pm

Why? You could have a read only pointer table in the target that gets written to by the source after the copy has been performed. You might have to make the pointer operation atomic but that’s not terrible, it’s just one word.

benwilson512 · October 25, 2019, 3:27pm

Atomic operations across cores require the cores to synchronize. This is a non trivial bottleneck when you have thousands of concurrent processes. What you’re proposing it the “refcount everything” option I mentioned earlier. The performance penalty to this on multicore systems is non trivial.

That one word may also be more expensive than you think. Consider something like a list. A list isn’t a single thing, it’s a chain of [head | tail]. Refcounting each link in the chain (cause remember, any arbitrary part of the list could have been sent to a different process) would double the memory overhead of the list cell which is only 1 word to begin with.

ityonemo · October 25, 2019, 3:43pm

I’ll be honest I don’t know how cross core memory operations work on x86 with it’s potentially convoluted nonuniform memory access. The last system I wrote machine code for was the rex neo, which had hierarchical scratchpad memory instead of cache so a write across from one core to another was instantly guaranteed to be atomic without any synchronization primitive at all.

OvermindDL1 · October 25, 2019, 5:05pm

Yeah the x86 memory model in comparison is an absolute horror, lol… ^.^;

ityonemo · October 25, 2019, 8:05pm

if anyone is curious:
http://www.rexcomputing.com/

The two 20-year olds taped out a chip in six months, which is incredible, and the chip worked and had a power profile a fraction (i think 10x lower) that of Nvidia GPU for FFT, and 5x faster at 200 MHz on 22nm process; the intended target audience was 5G modem manufacturers, but they couldn’t find a buyer, since they didn’t really know how to do technical sales (also they burned 6 months because the person they contracted to do the motherboard screwed them and they had to redo most of it themselves). They should have nabbed ericsson but ericsson has a cosy relationship with intel and got hosed when intel’s 7nm process failed.

I do think it’s interesting that the chip layout basically has you running concurrent cores for which the best model for processing is the actor model. It seems like my whole computer industry “career” has been about nudging me to the actor model over and over again.

rvirding · October 26, 2019, 9:07am

I think you will end up copying most of the data anyway but it a much less controlled fashion. As someone has pointed out you are viewing the data sent in a message as one big coherent chunk created in the sender and sent to the receiver, but this is generally not the case. Usually the data in a message is put together from smaller bits which come from many different processes. This will mean that either you have a right mess of keeping track of what and from where, or you will be doing a lot of copying between processes which is what you are trying to avoid.

A simple example of what I mean. Assume A sends a chunk of data to B which sends it on to C, so A -> B -> C. From what I understand of your method is that the data is not copied and A knows it has sent to B and B knows it has sent it to C. What happens if/when B dies? A does not know about C so if A now dies it can’t copy the data to C. Or should B tell A about C? But that means a process has to keep track of where all its data has come from, which we don’t need to do now.

This is exactly the problem you get with atomics, counters and persistent_terms. Atomics and counters solve the problem by only allowing integers while persistent_terms allow complex data structures but require a global gc when you update the data.