How does Phoenix benefit from hardware caching?

nathanl · July 19, 2016, 10:18am

“Programming Phoenix” includes this statement:

“Templates are precompiled. Phoenix doesn’t need to copy strings for each rendered template. At the hardware level, you’ll see caching come into play for these strings where it never did before.”

This sounds awesome, but I want to understand how it works. Precompiled templates are definitely more memory-efficient, but so far, I’ve been unable to see a caching effect in action. I had a theory:

If the same string is used for (eg) the header on each request, it will be found at the same location in memory (if it’s a refc binary)
The completed template will be given to Cowboy as an iolist, and Cowboy will call writev on the socket with some of the same memory addresses each time
The fact that writev gets some of the same memory addresses for each request will signal the operating system to look for those strings in CPU cache

I’ve written a small Phoenix app to test this theory, but so far it seems to be wrong. (There’s a detailed writeup in the README.)

Can anyone point me to a way that Phoenix benefits from hardware caching, or tell me what I’m doing wrong in my experiment?

(I hope to discuss this in my ElixirConf Talk with @JEG2.)

nathanl · July 19, 2016, 1:29pm

Hmm, maybe this timely tweet about seeing cache misses will be helpful.

benwilson512 · July 19, 2016, 1:31pm

Hardware caching could simply mean that the iodata is generally laid out in contiguous memory, meaning that the L3 and other CPU caches will readily have that information on hand because the COU generally pre fetches chunks of contiguous RAM when it accesses an address.

this would be distinct from an object graph way of representing the template information which may contain pointers all over the place.

In this interpretation of hardware caching it’s simply a property that makes any given request fast, it isn’t about caching across requests necessarily.

OvermindDL1 · July 19, 2016, 2:34pm

If you want different caching then you could do something that I do. For example I have a few sub-templates in my main templates that are costly to render (conditionals and large data chunk), I could easily have those stored in ETS for a given input and just clear it out every once in a while (in this case this one chunk is the same for everyone so I’d only need to store it in a cache once and just rebuild it when its data changes). In the general case template rendering to iolists are so fast that it is going to be hard to beat (in the general case a caching system lookup would be slower than running render).

nathanl · July 19, 2016, 2:57pm

Hardware caching could simply mean that the iodata is generally laid out in contiguous memory, meaning that the L3 and other CPU caches will readily have that information on hand because the COU generally pre fetches chunks of contiguous RAM when it accesses an address.

Interesting. I’ll have to see if I can tell whether the addresses given to writev are contiguous in memory.

It’s complicated by the fact that the size of the specific strings changes how the BEAM treats them. They have to be bigger than 512 bytes (in my testing) to both 1) be refcounted, not copied across processes (64 byte minimum) and 2) not be combined when calling writev (based on “ERL_ONHEAP_BIN_LIMIT”, from what I’ve read)

nathanl · July 19, 2016, 3:42pm

Maybe I’m just Doing It Wrong™ , but I can’t tell that these addresses are contiguous in memory. Some of the addresses I see using dtrace on a Phoenix app:

(0 bytes): 0x0000000000000000
(0 bytes): 0x00000000b0ac4478
(104 bytes): 0x0000000019f43cae
(106 bytes): 0x0000000019f80d48
(1140 bytes): 0x000000001a003088
(125 bytes): 0x0000000019f45797
(125 bytes): 0x0000000019f45817
(125 bytes): 0x0000000019f45897
(17 bytes): 0x0000000019f420e8
(22 bytes): 0x0000000019f43c98
(22 bytes): 0x0000000019f80af8
(241 bytes): 0x0000000019f45917
(29 bytes): 0x0000000019ec02d0
(3 bytes): 0x0000000019f45794
(3 bytes): 0x0000000019f45814
(3 bytes): 0x0000000019f45894
(35 bytes): 0x0000000019f80b38
(460 bytes): 0x0000000019f455c8
(666 bytes): 0x000000001a003528

I figured that if they’re contiguous, I should be able to take one address, add the number of bytes written from there, and find the start of another address in the list. Eg:

“19f45814” |> Integer.parse(16) |> elem(0) |> Kernel.+(3 * 8) |> Integer.to_string(16)

I don’t find anything that way, but he fact that the first couple are listed as writing “0 bytes” probably means the bytes are rounded, so maybe this is the wrong way to go about it.

asierguti · July 26, 2016, 12:20am

It’s a bit more complex. There is a cache hierarchy, L1, L2, L3, and then you have the main memory. In between the CPU and the main memory you have the TLB coprocesor in the CPU to translate virtual addresses to physical addresses.

L1 memory is very small, and there are 2 caches, one for instructions and one for data.

As benwilson512 mentioned, when there is a cache miss, the CPU will ask data to the next cache level, up to the HDD, if it needs to. When the data is found, it is copied to all the other cache levels, but it’s done in chunks. It is called cache lines.

So, if you want to create a program that is cache friendly, you would do it in this way, creating data that is contiguous, so if there is a cache miss somewhere, you will retrieve as much useful information as possible.

Now, back to phoenix, you may have usual code, or you may have precompiled template code. The difference is that the first one executes at run time, while the second one can execute at compile time and store the results so they ready at run time and there is almost no execution of any of your code.

You should remember how erlang and elixir work. If you pass some data, the data is copied around. This is done to isolate the data from the execution.

If you are really really interested in having full control of your data layout, then I would recommend you to code in C or C++. These are the 2 only high level languages that allow you to specify how the data will sit in the memory, and then use all sort of techniques to speed up execution. For example, you can pass references and pointers, so that there are no copies involved anywhere.

The whole point of elixir and erlang is not to have a language where you can tweak everything to squeeze any CPU cycle. The point is to have a highly concurrent and reliable system.

OvermindDL1 · July 26, 2016, 1:59pm

Rust I’ve been playing with lately, fantastic C API support, and it has a very good control of memory (though still not to C/C++ extent, however the next version should fix that). This gets me an idea for a mix module to support rust compilation more easily multi-platform than C NIF’s… Hmm…

benwilson512 · July 26, 2016, 2:03pm

Have you seen https://github.com/hansihe/Rustler ?

OvermindDL1 · July 26, 2016, 2:37pm

Ooo an interop lib, awesome! I may have to play with that…

OvermindDL1 · July 26, 2016, 2:51pm

Aaaaand it seems unmaintained, does not work with the current version of Erlang/Elixir, is suggesting using the old multirust system instead of the modern rustup (as it in actually requires multirust installed, which cannot happen anymore as multirust removes itself if installed now while saying to migrate to rustup…), etc

NobbZ · July 26, 2016, 3:07pm

I wouldn’t say that it is inactive… You got a reply pointing to an upstream bug in less than an hour, also there has been a release 2 days ago, which should also tackle the rustup/multirust issue as far as I can see.

OvermindDL1 · July 26, 2016, 3:40pm

That is a good sign, however that upstream bug is another rust library for the low-level erlang API, and it’s had no response to its issue report for over a week.

EDIT: Heh, I just noticed that the person that posted the link is the author of the other library, yet no updates over there for a long time.

NobbZ · July 26, 2016, 3:47pm

OTP 19 is only 2 weeks old, give them some time to do necessary research and adjustements and use 18 until then.

OvermindDL1 · July 26, 2016, 3:57pm

Eh except I’ve been using 19 for well before its release, at work I needed something that was new in it (forgot what it was now…), so I’ve no choice.

nathanl · July 26, 2016, 8:27pm

It’s a bit more complex. There is a cache hierarchy…

Thanks - as you can tell, I’m not very knowledgable about hardware caches.

You should remember how erlang and elixir work. If you pass some data, the data is copied around. This is done to isolate the data from the execution.

Yes - usually. Except that binaries greater than 64 bytes in size are shared by all processes and refcounted, and those larger than 512 (in my tests) are not combined when calling writev, probably based on ERL_SMALL_IO_BIN_LIMIT (according to Elixir RAM and the Template of Doom – Evan Miller)

If you are really really interested in having full control of your data layout

I’m not. I have simply been trying to understand a stated benefit of Phoenix.

nathanl · July 26, 2016, 8:29pm

I’ve exchanged emails with José Valim and Chris McCord about this. It sounds like the original statement in “Programming Phoenix” may have been based on some assumptions rather than specific measurements.

nathanl · August 29, 2016, 9:16pm

Update: I just learned about perf stat -e L1-dcache-load-misses from Julia Evans and wanted to try perf stat -e L1-dcache-load-misses -p $(pgrep beam) on Phoenix, but I’m on OS X and perf is Linux-only.

I tried anyway, running Phoenix under a Linux VM in Vagrant, but perf told me that L1-dtrace-load-misses was not supported (for my CPU?), and I’m out of my depth there. But if anyone cares to try this on Linux hardware and gets results, I’d be interested.

nathanl · October 4, 2016, 2:55pm

I took another shot at this: I set up a Linux box on linode.com, got Phoenix running on it, and ran /perf stat -e L1-dcache-load-misses -p $(pgrep beam). That does show me data like this:

 Performance counter stats for process id '17462':

             1,714      L1-dcache-load-misses

       1.399432582 seconds time elapsed

However, that’s all noise - it’s what I see after running the command above, waiting a second or so, and pressing control + c, without making any requests to the app. It appears that there are more than 1k cache misses per second for that process even if I’m not asking it to do anything.

I don’t see a good way to separate the noise from the signal here. If I could, I could compare the performance of my two Phoenix endpoints. But it appears I’m stuck again.

(Probably nobody cares about this thread anymore but me, but having documented this much, I thought I might as well post an update.)

andre1sk · October 4, 2016, 3:02pm

You are runing on a box that depending on the plan might be runing several hundred VMs that’s not exactly a good enviroment for gathering CPU cache stats