Built with Elixir: Approximated.app - handling 200k + custom domains for web apps

carterbryden · June 19, 2023, 10:01pm

Hey folks, I’ve been a long-time mostly-lurker on these forums, but I know we’re always looking for examples of Elixir running in production. I’ve built and been running a product for the last two years, Approximated.app, which runs on Elixir/Phoenix and I thought might be another good example.

What it does
It lets you easily automate custom domains at scale for any web application.
Example use cases: blog hosts, marketplaces, cloud dev environment providers, SAAS apps of all kinds.

The stack

Elixir, Phoenix, Ecto
Oban for queueing
Liveview for dashboard UI
Customized Caddy server builds for proxy server clusters
Fly.io for distributed instances
Postgres for cloud DB, SQLite or Postgres for on-prem enterprise version

Some interesting stats:

It currently serves over 200,000 custom domains
The main Elixir app manages hundreds of globally distributed proxy clusters (at least one per customer)
Millions of requests per day go through the clusters
Over a billion Oban jobs run so far (most tasks get queued, got to monitor every domain too)
99.87% average uptime for each domain
- I suspect this is actually higher, but current monitoring is in 5 min increments so sometimes a regular internet networking blip for a single request shows as 5 mins of downtime on the charts
One developer so far (me)

Engineering stories

Early on I deployed a bug that, under the right circumstances, pretty much guaranteed that all CPU would be maxed out no matter what resources were thrown at it. This manifested one night while I was sleeping. 8 cores were maxed out for 8 hours, but no one noticed because of the process scheduling built into the beam. To users, everything was just working as it should.
I’ll add some more when I have time!

I’m not sure what people might want to know, but if anyone has any particular questions they want answered just let me know and I can update this to reflect!

guzishiwo · June 20, 2023, 12:39am

Which cloud service provider do you choose for Postgresql, such as aws and gcp?

carterbryden · June 20, 2023, 1:04am

Digital Ocean managed postgres, I think it’s a 4gb instance now

tj0 · June 20, 2023, 7:54pm

Pretty awesome.

The first thought that came to my mind was that Elixir is really good for solo devs, but just haven’t seen many financially successful projects.

plausible.io (2-3 people now I think)
https://savvycal.com
https://metamorphic.app/ (seems defunct now)
logflare (acquired)

Know of any others?

Also, any thoughts of moving from Caddy to Elixir for the reverse proxy? Not sure if it’s truly worth the headache if everything is working.

tommica · June 20, 2023, 8:12pm

Would be interested in the decisions of how you went with the architecture your code (contexts, etc) - it sounds like a complex project, but is the codebase complex too?

carterbryden · June 20, 2023, 9:19pm

I think it’s top tier for small teams and solo devs, and I suspect for larger teams too but I don’t have the background to really say. The closest thing I have to compare it to is Laravel, which is wildly productive, but Elixir just has a way better foundation for a lot of things (concurrency, distributed, realtime, etc, etc).

I guess it depends on what you’d consider financially sucessful. Revenue, profit, team size, etc. all factor in, but I don’t know too many out there talking about how they use Elixir. I know of a few that use Elixir to create products for non-tech industries, like eaglemms.

I daydream all the time about writing a really good reverse proxy in Elixir, it just seems SO well suited for it. If I can get the flywheel of Approximated going a bit more, I’m planning on giving it a shot. Ideally as two libraries that could be run as an integration to an Elixir app or on their own: an acme client, and a reverse proxy (which would use the acme client).

It’s just hard to justify at the moment business-wise because Matt Holt has done such a good job with Caddy, and since I sponsor him a little bit he even lets me bug him from time to time.

For now I’ll just have to be satisfied by coming up with goofy names for it:

ExConn
ServEx
ConnEx
NExusuxEN

carterbryden · June 20, 2023, 9:54pm

A lot of the contexts could probably do with a bit of reorganization, but I’m mostly pretty happy with how I handled things.

I don’t find the codebase too complex, but then again, I wrote it. Some things are “complex” in that there’s a lot of pieces to them, but generally I don’t get too clever with each piece. It’s mostly just “do this, then that, then that”. Pipes, if you will.

Without giving away too much secret sauce, there are contexts for:

users
managing proxy clusters
managing virtual hosts (e.g. custom domains)
managing Caddy configs
managing infrastructure
monitoring virtual hosts
running health checks of so many varieties
resolving infra issues
resolving Caddy config issues
sending admin notifications (for things like health checks)
calling various APIs (I usually just write a minimal little client module, like for Stripe)
adapters for different infra scenarios (went through a few infra designs before the current iteration)
- This way the rest of the code can just call the same functions, and those select the proper adapter for you
some others for more basic entities like subscriptions, teams, roles, API keys, etc.

There’s more but you get the general idea.

The most complex parts are the health checks, because whenever some new problem scenario comes up, I add to them. And they need to handle multiple levels of redundancy (“something is broken, try this. It’s still broken, try that. Now the state is this, but still sort of broken, try this strategy, then start from the top. Still broken? Okay wake someone up.”).

For these I start by coding what I would literally do as a human being looking at this problem, as much as possible. The idea is that instead of reporting “This is broken!”, I want it to report “This was broken for a hot second, but I fixed myself automatically due to your amazing foresight! Also, you look nice today.”

Basically I just stuck with my own made up principle where each context should just be a group of related verbs that work for a larger verb, i.e. I’m doing some larger task (managing a proxy cluster) and each function is a smaller action within that.

If enough functions were written within a context seemed even more closely related, I usually extracted them into their own context - i.e. managing infra was originally part of managing proxy clusters, but became multiple contexts that the proxy clusters context calls.

I’ve tried not to overthink it too much and don’t worry about being perfectly DRY or things like that - the really boneheaded mistakes become apparent sooner or later and I refactor them as needed.

I also make pretty heavy use of Oban and periodic jobs that spawn many one-off jobs. For example, a health check generating periodic job that creates a one one-off job for each proxy cluster. This has a lot of benefits because it lets each health check (in this example) run in its own job and be monitored and retried individually if it fails. This simple strategy has turned out to be pretty resilient at scale.

Another little strategy that I like is that I subscribe liveviews and some other things to different pubsub topics. Then I have some functions in other contexts broadcast to the appropriate topic when they’re finished, so that the modules subscribing to that topic can do (or not) some action automatically.

Mostly it’s used to make the dashboard reactive to updates. For example, when a virtual host is updated it broadcasts a virtual_host_updated event and the liveview can update that particular bit of UI (or not, according to circumstance). One of my favorite parts of Elixir/Phoenix is getting to use a reactive push strategy really easily, instead of constantly polling the API/DB/whatever.

vassilevsky · June 23, 2023, 12:15pm

Good job with the project.

I wonder how you do deployments of new versions to production. What tooling you use.

tommica · June 23, 2023, 5:39pm

Thanks for the write up, it has some interesting insights

al2o3cr · June 23, 2023, 7:35pm

Cars.com is a pretty substantial one - there’s even an ElixirConf talk about it:

carterbryden · June 24, 2023, 11:52pm

For deployment of the main app, I kept things pretty basic so far, hosted on a Digital Ocean droplet. It’s deployed with a reasonably simple bash script that compiles a release on the same host, if that goes okay then it runs any waiting migrations, if that all goes okay then it restarts the app with the new version.

I keep the last 10 releases on the droplet in case I want to revert (I have a script to revert back X releases), though I’ve never done that except to test it.

For the proxy clusters, the main app manages that automatically using the Fly.io API(s). I’m not typically doing any deployment/updating of those manually, and usually if I am its through a remote IEX session in the main app, calling the funcs it uses to automate things.