Troubling gen_tcp.send() performance

Summary

I am trying to create a media streaming server in Elixir, with an initial focus on RTMP publishing and playback. I chose Elixir/Erlang because it seemed like a perfect candidate but I seem to be having trouble.

The testing setup is 3 applications, 1 RTMP publisher (3rd party OBS studio), 1 RTMP viewer (VLC), and my Elixir server. Both the publisher and viewer connect to my elixir server over localhost, the publisher sends the elixir server video and audio data and each packet gets relayed off to the viewer, all over TCP. The publisher is currently set to send 2500kbps, and network traffic shows it pretty close to this.

When running the test I notice the video is stuttering a lot. VLC debug messages show it’s receiving frames inconsistently and trying to compensate for it.

After getting help from people in IRC and looking through observer, I think I have pretty much pinpointed the issue to the :gen_tcp.send() calls being slow, so slow in fact I have observed up to 5-10 seconds just to push out an individual send call.

Since i know Erlang is heavily used in switches I can’t believe that this performance I"m getting is normal. Lowering my video’s bitrate to 500kbps does show smoother playback but I can still tell there is an issue.

For reference, the code I have so far is up at https://github.com/KallDrexx/mmids-temp. Note that this is a temporary repository, I plan to split each of hte apps up into their own repositories, slap an MIT license on them, then upload them to hex once I have this thing stabilized.

Based on diagnostics I coded a 2500kbps video is averaging 200-250 messages per second going from the publisher to the viewing client.

What is the architecture?

The general architecture I have right now is that when any type of client connects I utilize ranch to spawn a gen_server. This server receives TCP binary (using active_once and raw flags), attempts to deserialize any RTMP messages contained in it, react to messages that can/should be reacted to, and respond with any responses back to the client. This all occurs within a single gen_server and no other processes are involved.

For demonstration purposes when a viewing client requests playback I use pg2 to subscribe to a specific channel for audio and video data. Publishing clients that are publishing a/v data on that same stream key push that data to all subscribed clients. The viewing clients then receive the a/v data, serialize them into RTMP messages, serialize them into binary, then send them off across the network pipe.

What have I tried?

First I tried utilizing :os.system_time(:milli_seconds) to determine how long any audio/video data packet took from deserializing from the publisher to right before binary serialization of the client. I noticed that it would start out extremely fast and then pauses would occur (long 5-10 second pauses) and then batches of packets would get processed, then another pause, etc…

Then I was reminded about observer, and I loaded it and saw the following graph: https://dl.dropboxusercontent.com/u/6753359/observer1.PNG. The I/O graph told me that while inbound traffic was smooth, outbound was being staggered.

I then opened the process for the server managing the viewing client. I noticed the message queue length was constantly increasing, never decreasing, and the process was constantly stuck in the prim_inet:send/3 function.

In doing some Googling I came across this thread talking about slow send() performance, and while it didn’t have a definite fix it did mention batching up the binary for the send() call so I wasn’t calling it 200 times every second.

The first thing I tried was to utilize a timer. Instead of calling send() every message I put the binary in an iodata queue held in the gen_server’s state. I then added :timer.send_interval(100, :send_queue) to my initialization thinking I could send data once every 100ms.

This did not give any better results outside of managing the message queue better. What I noticed with observer and this timer was odd in that I would keep pressing the refresh hotkey and I would see my queue keep growing for up 5-10 seconds, and then go down to zero again. This repeated over and over, and every refresh it was still stuck on prim_inet:send/3. This seems to me that send is just taking a ridiculous amount of time. Changing the timer interval up or down did not really help noticably.

The last thing I tried was to stop the interval and send every X times I try to send a message, allowing me to batch messages together but make smaller batches then the interval method caused. This didn’t help by a noticeable amount either, and was worse for managing the message queue.

So what now?

I’m not quite sure how to proceed from here. I can’t believe that sending data via TCP is really that bad for a VM that I hear so many low latency and soft-realtime praise for.

At the end of the day when the final final system is built I am hoping to get 50 inputs sending data to 150 outputs (based on current performance I’ve seen from other third party products). So it’s a bit disconcerning that I can’t even get 1 in 1 out working reliably.

Does anyone have any advice on where I go from here?

3 Likes

Can’t help much more than pointing to http://erlang.org/doc/man/inet.html#setopts-2 and seeing whether the high/low watermark settings apply to your case…

1 Like

Thanks for the suggestion, I had missed that option. Unfortunately setting both watermark settings to to 64k and didn’t see any appreciable difference :-/. I was still seeing upwards of 10 seconds of my process being stuck in waiting for prim_inet:send/3 to finish :(.

1 Like

Did you eliminate the possibility that the client is the issue btw?

1 Like

Funny you should mention that. About 20 minutes ago I was in the process of setting up a linux VM to make sure it wasn’t some weird localhost issue on Windows when I decided to try using FFMPEG for playback instead of VLC (still on windows). Amazingly that works perfectly fine, the message queue length is almost never more than 0 (and when it is it’s at 1) and observer’s I/O load charts show output exactly matching iniput.

So VLC must be doing something funky with localhost sockets and I’ve been banging my head on a non-issue for the past week (well non-issue assuming I don’t see it later when I test non-localhost delivery).

Oh well…

3 Likes

Windows has a really nasty network stack (hit issues with it constantly at work, on Windows 10 even), so this would not surprise me at all as we’ve had experiences where Windows can take upwards of 1-full-second to send out a buffer on a localhost at times (unable to find any magic incantation that fixes it either).

1 Like

Well after discussions on the Erlang mailing list I have a really good handle of what’s going on, just not why (I cross posted on there since this is more of a BEAM/Erlang Stdlib question than Elixir) and a someone replied with a really good post and gave me a good understanding of how to optimize the stack better.

Unfortunately, at the end of the day there’s something fundamentally wrong with how I’m relaying audio/video data and without a good foundation of how video works I’m having trouble figuring this out and I might have to give up on this project if I can’t get through this wall.

1 Like

[quote=“KallDrexx, post:7, topic:2719, full:true”]Unfortunately, at the end of the day there’s something fundamentally wrong with how I’m relaying audio/video data and without a good foundation of how video works I’m having trouble figuring this out and I might have to give up on this project if I can’t get through this wall.
[/quote]

That makes me wonder if there is a way to sendfile an output stream, that way your ffmpeg or so call could directly output to the necessary socket(s, with a multiplexer) or so… Not sure I’d recommend such an approach, but… ^.^

1 Like

If I understand what you are saying then I don’t think so, because there’s still lower level handshaking and other protocol level things that have to take place before I can start relaying audio/video packets. You technically have to do some analysis of the A/V data so your first video packet is an actual key frame (so late joining viewers don’t start with a badly formatted feed).

1 Like

That could still be done, I’m talking about setting up a socket including whatever headers, then passing the socket back to the kernel to just take the output of a program and stream it over the socket. ^.^

1 Like

Still would be tricky, as the RTMP spec is extremely over-complicated and tries to compress the message headers based on the previous header it sent over. So for example if I send a message with a type 0 header with video data with a timestamp of 25, my next message with a timestamp of 30 will have a type 1 header with a timestamp of 5 (delta). If it receives a type 1 header without receiving a type 0 header than it fundamentally can’t parse the message with the type 1 header (cause it’s missing information)

That means that there’s a real possibility of it sending garbage messages to the client if things aren’t timed just perfectly.

Anyways, I honestly don’t think that will solve the problem. I have infrastructure in the project that allows me to do raw binary dumps of all TCP traffic it sends and receives (labelled by session id I’m designating for each connection). I then have a cli application I made to allow me to parse the raw binary 1 RTMP message at a time, and I can verify that I"m sending exactly what I"m receiving (at least for A/V data packets).

3 Likes

Just as a final note in this, this is 100% not an issue with gen_tcp and this whole thread can be ignored.

I still see the same type of IO graph in the original post, but now that I have flawless video playback I think it’s just a matter of how VLC actually does TCP reads compared to it’s buffer (probably stops reading once the buffer is full maybe?).

Good news out of this is that I actually have a working RTMP server with full playback support, and once I clean some things up I"ll have proper hex packages up.

5 Likes

I still learned from the discussion so I’d say both threads (this and the Erlang one) still make for a good reading. :slight_smile:

We are looking forward to the Hex packages, thanks for sharing.

2 Likes

I know this is really late, but just leaving it for future readers. On macOS or Linux, if you want to send a large file, you see performance increase as you make your chunks larger and make fewer calls to send. It’s pretty smooth curve, showing gradually decreasing benefit. On Windows, not so. Larger chunks can get you much worse performance–now I tested this a long time ago, but sounds like issues have persisted.

Note that I am not talking about gen_tcp.send here, but C code, directly using the sockets interface.