Bandit - a pure Elixir HTTP server for Plug & WebSock applications

mtrudel · October 20, 2023, 9:18pm

Bandit is an HTTP server for Plug and WebSock apps.

Bandit is written entirely in Elixir and is built atop Thousand Island. It can serve HTTP/1.x, HTTP/2 and WebSocket clients over both HTTP and HTTPS. It is written with correctness, clarity & performance as fundamental goals.

In ongoing automated performance tests, Bandit’s HTTP/1.x engine is up to 4x faster than Cowboy depending on the number of concurrent requests. When comparing HTTP/2 performance, Bandit is up to 1.5x faster than Cowboy. This is possible because Bandit has been built from the ground up for use with Plug applications; this focus pays dividends in both performance and also in the approachability of the code base.

Bandit also emphasizes correctness. Its HTTP/2 implementation scores 100% on the h2spec suite in strict mode, and its WebSocket implementation scores 100% on the Autobahn test suite, both of which run as part of Bandit’s comprehensive CI suite. Extensive unit test, credo, dialyzer, and performance regression test coverage round out a test suite that ensures that Bandit is and will remain a platform you can count on.

Lastly, Bandit exists to demystify the lower layers of infrastructure code. In a world where The New Thing is nearly always adding abstraction on top of abstraction, it’s important to have foundational work that is approachable & understandable by users above it in the stack.

Project Goals

Implement comprehensive support for HTTP/1.0 through HTTP/2 & WebSockets (and beyond) backed by obsessive RFC literacy and automated conformance testing
Aim for minimal internal policy and HTTP-level configuration. Delegate to Plug & WebSock as much as possible, and only interpret requests to the extent necessary to safely manage a connection & fulfill the requirements of safely supporting protocol correctness
Prioritize (in order): correctness, clarity, performance. Seek to remove the mystery of infrastructure code by being approachable and easy to understand
Along with our companion library Thousand Island, become the go-to HTTP & low-level networking stack of choice for the Elixir community by being reliable, efficient, and approachable

mtrudel · October 19, 2023, 8:36pm

All,

After several years of effort, I just published version 1.0.0 of both the Bandit and Thousand Island libraries. Folks that are depending on versions in the 0.x.y or 1.0.0-pre series of either library should update your dependencies to be ~> 1.0.

This has been a ton of work, and has been made possible in large part due to the help of tons of contributors. In particular, @moogle19, @ryanwinchester and @alisinabh have gone above and beyond on all fronts. The project wouldn’t be the success it is without help from folks like them. Thanks all!

I put together a bit of a retrospective blog post about the whole journey here, if anyone cares to learn more!

Eiji · October 19, 2023, 8:56pm

I have tried bandit only once. There was some simple bug, but it was hard to debug. For some reason my instinct suddenly directed me to this library. What’s funny I read about it only once before and I simply forgot its name, but somehow I have managed to find it on forum.

Anyway, bandit’s error handling was an amazing help which saved me a lot of energy and time. Been waiting for a stable 1.x release, so I could use it by default. Thanks!

kwando · October 19, 2023, 9:26pm

Well done guys and congratz on 1.0! Been using ThousandIsland for a small hobby project of mine and it works great

Hisako1337 · October 19, 2023, 9:28pm

Awesome effort! And I‘ll definitely upgrade to it right away! Thank you so much!

But one question I have still: what are the typical performance improvements one can expect with an ordinary liveview app? I have no good estimate how much time per route is spent in the http layer vs other parts of the pipeline. Anyone ideas?

mtrudel · October 20, 2023, 9:08pm

Excellent question and one I get a lot. The short answer is that in most cases you probably won’t see much of a difference between Bandit and Cowboy from a performance perspective; your plug’s implementation is going to be the dominant factor in overall performance, and switching out the underlying server won’t magically make that work go away.

That having been said, there are many workloads in which you could expect to see a benefit to Bandit. The ideal case would be large numbers of HTTP/1 clients doing lots of IO on very short lived connections. In that case you could see some substantial benefits (see my latest benchmark for more).

Some workloads are going to be worse. In particular, HTTP/2 performance in Bandit is pretty awful at the moment, but is going to be getting a lot of attention as part of the work to add WebSockets over HTTP/2 (RFC 8441) support. This will be one of the next things I’m working on.

In terms of LiveView, Bandit’s WebSocket implementation is generally a little bit faster than Cowboy’s (around 10-20%). You might see some real-world benefit there; it really depends on your particular usage patterns.

AstonJ · April 5, 2024, 9:03pm

A post was split to a new topic: HTTP/2 Continuation Flood attack alert - Bandit 1.4.0 and upwards is safe

martosaur · December 6, 2024, 11:37pm

@mtrudel Hi
I was digging into Absinthe today and noticed it has a code path that results in a process just exiting when a certain internal timeout is exceeded. From Bandit perspective, this results in this kind of error:

15:08:27.012 [error] GenServer #PID<0.1581.0> terminating
** (stop) :foobar
    bandit.exs:13: anonymous fn/2 in Router.do_match/4
    /Users/amartsinovich/Library/Caches/mix/installs/elixir-1.17.3-erts-15.1.3/5a1480805a98346fc77508013de370cc/deps/plug/lib/plug/router.ex:246: anonymous fn/4 in Router.dispatch/2
    (telemetry 1.3.0) /Users/amartsinovich/Library/Caches/mix/installs/elixir-1.17.3-erts-15.1.3/5a1480805a98346fc77508013de370cc/deps/telemetry/src/telemetry.erl:324: :telemetry.span/3
    /Users/amartsinovich/Library/Caches/mix/installs/elixir-1.17.3-erts-15.1.3/5a1480805a98346fc77508013de370cc/deps/plug/lib/plug/router.ex:242: Router.dispatch/2
    bandit.exs:6: Router.plug_builder_call/2
    (bandit 1.6.1) lib/bandit/pipeline.ex:127: Bandit.Pipeline.call_plug!/2
    (bandit 1.6.1) lib/bandit/pipeline.ex:36: Bandit.Pipeline.run/4
    (bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
Last message: {:continue, :handle_connection}
State: {%ThousandIsland.Socket{socket: #Port<0.22>, transport_module: ThousandIsland.Transports.TCP, read_timeout: 60000, silent_terminate_on_error: false, span: %ThousandIsland.Telemetry{span_name: :connection, telemetry_span_context: #Reference<0.2186918947.463994885.242980>, start_time: -576460734287898010, start_metadata: %{remote_address: {127, 0, 0, 1}, remote_port: 53034, telemetry_span_context: #Reference<0.2186918947.463994885.242980>, parent_telemetry_span_context: #Reference<0.2186918947.463994885.242884>}}}, %{opts: %{http: [], websocket: [], http_1: [], http_2: []}, plug: {Router, []}, handler_module: Bandit.InitialHandler, http_1_enabled: true, http_2_enabled: true}}

My understanding is that there’s not much Bandit can do here in terms of error handling and in general using exit isn’t the best way of handling this kind of case. But I was wondering, if it’s a good idea to define terminate callback for handler and issue a telemetry event? WDYT?

mtrudel · December 7, 2024, 3:59pm

Yep, we can do better here. The work that grzuy did in fix: throwing plug properly handled and returns 500 by grzuy · Pull Request #411 · mtrudel/bandit · GitHub makes this nearly trivial. I’ll get a PR worked up to improve this.

mtrudel · December 7, 2024, 4:19pm

Fixed on main. Will go out in the next release.

Thanks for bringing this up @martosaur !

awksedgreep1 · December 9, 2024, 2:53pm

Hey Matt,
Any thoughts on what may have been the bottleneck here? Seems we ran up the CPU as well.

mtrudel · December 9, 2024, 3:57pm

Tough to tell. A few observations:

BLUF: Of course this is an apples and oranges comparison. I personally couldn’t give two hoots how well we perform against other languages (especially natively compiled ones running on base libraries); that’s not a game I have any interest in playing or one that has any winners. It’s MONGODB IS WEB SCALE all over again and I have better things to do with my time than to engage in the comparative aspects of this. The two contestants aren’t even playing the same sport
The PR’s setup looks fine (it’s not a matter of app configuration). I didn’t look at any of the lower level OS / BEAM tuning details
He’s using m7a.large instances, which at first glance look like they’d perform a smidge worse than the instances we use for microbenchmarks in CI. From that perspective the results seem roughly correspondent with what I’d expect in absolute terms.
I’m a little worried by the growth numbers that Bandit demonstrates. CPU usage shouldn’t be growing without bound like that, and (as he states) that’s likely the root cause of the lacklustre numbers elsewhere
There’s not really a whole lot of actionable steps to take based on this data. We really do need a better benchmarking environment (ideally one that runs as part of CI), as the microbenchmarking setup we use now just doesn’t get to the absolute scale needed to reproduce these sorts of situations ‘in the lab’, which is a necessary precondition to be able to improve them in Bandit. If anyone is looking for a place to help, that’s probably the highest value way to do so

mpope · December 9, 2024, 4:12pm

I’d say that POST benchmark is a bit disingenuous given it is also testing Ecto. But for the GET I would suggest enabling the supercarrier to 85% of the available memory, that could give Bandit a boost. I’ve been profiling some webservers using msacc recently and quite a bit of time is spent in GC. I wonder if it is a similar scenario in this case.

Futher, the benchmark application uses Jason.encode instead of Jason.encode_to_iodata While the body is not large, I’m sure it would help. Using :erlang.byte_size on the encoded device lists shows that they breach the refc binary threshold. This has the potential to put pressure on the binary allocator and effect the system negatively. Keeping these terms to the local process heap might be a benefit.

Final note is that iirc these are the “defacto” configs to use Any tips on tuning the BEAM VM? - #2 by scottmessinger. These are not set in the vmargs of the test app.

+sbwt none
+sbwtdcpu none
+sbwtdio none

TL;DR VM settings could make Bandit look better

mtrudel · December 9, 2024, 4:25pm

I know what some of those words mean!

When you mention you’ve been profiling web servers lately, is there anything actionable you’ve seen with Bandit? That low level of profiling is truthfully not something I have a lot of experience or expertise with, and I’d greatly appreciate if you would be willing to share some insights with the project

mpope · December 9, 2024, 4:31pm

Sadly no, this was a more general debugging webapp production issues and not comparing the performance across multiple web servers. What it comes down to (in my sample size of ~5) is rarely the overhead of the web server dominates but the overhead of encoding / decoding data (Absinthe / Jason / etc). Which makes application-based benchmarks difficult because that is not where the BEAM shines.

That being said, if you link this microbenchmark I can try to use the regular suspects on it, perf / msacc / eprof / fprof, to see if anything actionable stands out.

mtrudel · December 9, 2024, 4:59pm

See, that’s part of the problem. The benchmarks that run in CI are mostly intended to be comparative, in order to evaluate the relative perf impact of a branch compared to main as a baseline (I also use them to run comparative benchmarks against Cowboy).

Whenever I’ve done eprof/fprof work it’s always been against an adhoc ‘hello world’ Plug instance and a small set of hand-rolled client connections. The ephemeral nature of these test setups (and my general amateur relationship with profiling tools) makes it hard for me to really assess performance in an absolute sense.

So, to answer your question, I suppose a good ask here would be to ask how you as an SME would together a process that would profile a simple ‘hello world’ plug in a reproducible way that allowed for flexible client access patterns (single vs keepalive requests, HTTP/1 vs HTTP/2, etc). I’m happy to systematize it, I just generally don’t know my way around how to use the profiling tools effectively enough to not drown in the sheer volume of output

mpope · December 9, 2024, 5:19pm

Hm I see. Let me stew on it a bit, I’ll take a look into that suite. I’ve written suites that compare the output of two profiles to rank differences between them but I find it is less stable for “low level” programs and more meaningful for “high level” programs, such as ones that transform complex data structures.

martosaur · December 23, 2024, 10:10pm

I tested the latest main and very happy with how exceptions, throws and exits are managed either wrapped with Plug errors or not One thing I noticed though is that throws and exits emit neither stop nor exception telemetry events. Do you think they should? I’m investigating logging off of Bandit telemetry events and having those events in addition to kind, reason and stacktrace metadata would allow me to properly log crash reason.

mtrudel · December 27, 2024, 6:48pm

Yes I do. I’ll get that done up for the next release.

mtrudel · December 27, 2024, 7:19pm

See Add support for sending telemetry events on throws and exits by mtrudel · Pull Request #443 · mtrudel/bandit · GitHub