I understand the argument of “erlang has it’s roots in telephony, of course erlang/elixir is a great fit for this”
I understand that voice chat is “basically just a router for udp packets, and elixir has great libraries for those”
For whatever reason, looking at old threads, I am not finding a good simple example:
Question: Is there any example of using erlang/elixir to build a simple voice chat server (I don’t care if the client is desktop software or browser/rtc), I’m just looking for something where erlang/elixir itself does the ‘heavy lifting’ (rather than offloading the work to some other server).
My understanding of such project is: elixir/erlang is a great fit of “orchestration”… connecting people, setting the pipeline of encoding/decoding/streaming via web/etc. But underneath those encoder/decoder are still delegated to some low level libs.
Though, please note, that I’m not an expert in the the subject and all my experience for realtime audio/video chat is limited by one project where it’s done via WebRTC and Elixir/Phoenix server was just authorizing and connection people.
Do you have any intuition on why this is? Given that audio is relatively “low bandwidth”, surely the problem is not copying bytes around efficiently. Is there some expensive step with encoding/mixing/decoding that Erlang/Elixir is not well suited for?
My vague memories from “signal processing” class remind me that those algorithms usually require lots of numerical calculations. Simply saying - if digital audio is represented as a list of numbers - any conversion of it would require some kind of number crunching which is not the strongest side of the BEAM.
(I wonder if Elixir Nx might change that in the future, though right now it’s more focusing on ML problems… also there is a library pelemay that could probably help (?))
Not sure there’s any perfectly justified reason but at least my thinking would be that the BEAM has GC pauses (albeit very short and are per-process which should limit their impact) that can affect latency – which would be a deal-breaker in audio streaming. And as @RudManusachi said, number crunching isn’t the BEAM’s strongest suit so future scaling might suffer.
If you expect no more than 20 people streaming audio at any given time and have done due measurements then Elixir might be a perfect choice. But there are multiple stories I’ve read in blogs about a lot of binaries moved around which eventually made the BEAM’s GC slower.
But from my side I’d opt for an Elixir orchestrator and a Rust implementation beneath just to be sure if the app scales it will still work without lagging.
Playing with media is not as simple as you’d expect, unfortunately
We’re building WebRTC SFU server, that handles both audio and video. We made it work, but It’s still experimental and extensively developed.
That’s mostly true, though Elixir is very convenient for handling protocols and containers too, because of its good support for binaries and bitstrings. Heavy, numerical computations are indeed delegated to low-level, mostly C libraries, but since it’s done via simple NIFs or C nodes, it doesn’t involve
If you deal with latencies around 2ms, then it may be a problem - haven’t tested. But for usual media streaming it’s good enough. People even write streaming apps in Go, that has stop-the-world GC - that one happens to be problematic though, AFAIK.
We didn’t have time for big optimizations of Membrane yet, neither we used it with OTP 24 JIT. Anyway, it’ll probably never be as fast as if we used Rust or C. Membrane focuses rather on reliability, scalability and maintainability.
@mat-hek : Thanks for the detailed and insightful reply.
Can you please verify that Membrane also supports client->server instead of just p2p chat? For example, if we are doing something like Minecraft/AmongUs proximity chat, we want the server to be able to take the audio streams, take the user’s location into account, and remix it.
If it is not too much trouble, could you guide us through “journey of 1ms of sound data?” ? Suppose person A is talking to person B with client-server (not p2p), I can get us started with
1. Person A's vocal cords generates sound waves.
2. Sound waves hit microphone. Which converts to digital signal.
3. Laptop allows Chrome browser to access digital signal.
4. ??? Magic ???
5. ??? More Magic ???
6. Elixir server gets data.
7. ??? More Magic ???
8. ??? Magic ???
9. Data reaches browser of Person B.
10. Person B's laptop speakers generates sound.
Could you walk us through those intermediate steps (possibly which parts of the membrane codebase it hits?)
You can easily find a lot of information about WebRTC on the net, for example at https://webrtchacks.com/. In short, it goes like that:
before starting the transmission, the session is negotiated (network addresses, certificates, tracks, codecs etc) using the SDP offer-answer model and Interactive Connectivity Establishment (ICE) candidates, usually via WebSocket/Phoenix channel
media connection is established via ICE
media encryption keys are exchanged through the media connection via DTLS
then the browser gets the track (in this case audio samples) from the microphone, encodes it (usually with OPUS), packs it into Real-time Transport Protocol (RTP) and encrypts it (so it becomes SRTP)
the encrypted stream is sent through the ICE media connection (usually via UDP)
if the connected peer is the Membrane server, it unpacks the audio (OPUS) stream with Membrane.WebRTC.EndpointBin. That bin consists of the ICE endpoint with DTLS handshake, and the RTP bin, that handles SRTP, SRTCP and some of their extensions.
at that point, you can do anything with the received stream, while the Membrane server just passes it to Endpoint Bins of all other peers in the room
each Endpoint Bin packs the stream to RTP, encrypts to SRTP and sends via ICE sink
each peer’s browser gets the stream, decrypts, unpacks, decodes to raw audio and plays out
That’s true, in the case of audio the MCU would mix the audio together and send one mixed stream to each participant
Our server is SFU only, at least for now, so it doesn’t do reencoding, it only repacks/re-muxes streams. However, it should be totally possible to build an MCU with Endpoint Bins and other Membrane plugins - for audio, we should even have all the needed plugins available - though some are experimental.
I do not yet know the difference between “re-encode” and “re-pack/re-mux”. The following is current best guess, is it correct?
Suppose we have input streams A, B, C, D, E and we want to output A+0.5B+0.1C
re–pack/re-mux: we send the client 3 streams A, B, C; the client does out[t] = A[t] + 0.5B[t] + 0.1C[t] // in particular note that the client can thus prefectly recover A, B, and C
re-encode: the server does out[t] = A[t] + 0.5B[t] + 0.1C[t]; the server encodes out to a stream, and sends the client only a single stream // the client (without assuming structural properties) can’t directly access A, B, or C
^-- is the following correct for ‘re-encode’ vs ‘re-pack/re-mux’ ?
You’re basically right. Encoding refers to codecs (like Opus or H264) and usually involves media compression, while packing/muxing refers to protocols or containers (like RTP or MP4), which only wrap the media, usually by adding some headers. Encoding usually involves some heavy computing, while muxing is just moving binaries around.
Awesome thread, I have nothing to contribute except admiration for this community where a question can be asked, a library maintainer can readily expound a ton of domain knowledge, and everyone’s day is better!
Great topic, this is something I do know nothing about, but its really interesting, I’d like to play around. Is there some tutorial material you can recommend to understand the basics to enable me to understand the membrane guide and the webrtc-sfu-server @mat-hek mentioned.