Hello lovely people of the Elixir community!
I’m working on a system to monitor Nerves devices to ensure they maintain constant internet connectivity. I’d love to get your thoughts on the architecture I’m considering.
The Problem
I need to:
- Monitor a very high number of Nerves devices (potentially thousands or tens of thousands)
- Ensure each device maintains constant internet connectivity
- Use WebSocket connections with periodic ping messages to detect connectivity issues
- Handle the scale efficiently without overwhelming the system
- Ensuring that none of the messages were lost is paramount.
Proposed Architecture
I’m thinking of using Kafka as a queueing buffer to decouple the WebSocket connection handling from the actual processing and storage. This would allow for a more robust and flexible architecture that can handle bursts and scale horizontally.
Here’s the architecture I’m envisioning:
┌─────────────┐
│ Nerves │
│ Device │
└──────┬──────┘
│ WebSocket Connection
│ (ping messages)
▼
┌─────────────────────┐
│ Phoenix │
│ WebSocket Handler │
│ (Phoenix Channels) │
└──────┬──────────────┘
│ publish_heartbeat()
▼
┌─────────────────────┐
│ Kafka Producer │
│ (kafka_ex) │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Kafka Topic │
│ (device_heartbeat)│
│ ┌───────────────┐ │
│ │ Partition 0 │ │
│ │ Partition 1 │ │
│ │ Partition N │ │
│ └───────────────┘ │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Broadway Consumer │
│ (broadway_kafka) │
│ - Processors │
│ - Batchers │
└──────┬──────────────┘
│ batch_insert()
│ detect_offline()
▼
┌─────────────────────┐
│ PostgreSQL │
│ (device_status) │
└─────────────────────┘
The Flow
- Nerves devices establish persistent WebSocket connections to a Phoenix application
- Devices send periodic ping/heartbeat messages (e.g., every 30 seconds)
- Phoenix Channels receive these messages and immediately publish them to Kafka (async, non-blocking)
- Kafka acts as a buffer, handling bursts and providing durability
- Broadway consumers process heartbeats in batches, updating device status in PostgreSQL
- Missing heartbeats can be detected by checking timestamps in the database
Questions for the Community
-
Is Kafka overkill for this use case? Given that I’m dealing with potentially tens of thousands of devices, I thought Kafka would help with:
- Handling bursts of heartbeat messages (f.e. after blackout reconnecting rush)
- Decoupling WebSocket handling from processing
- Horizontal scaling of consumers
- Durability and replay capabilities
But I’m wondering if there’s a simpler approach that would work just as well?
Just Oban?
-
WebSocket connection management: For this scale, should I:
- Use Phoenix Channels with a single topic per device?
- Use a single topic with device IDs in the message payload?
- Consider connection pooling or other patterns?
-
Partitioning strategy: If I go with Kafka, how should I partition the device_heartbeat topic?
- By device ID (ensures ordering per device)?
- Event type?
- Round-robin (better load distribution)?
- Other ideas?
-
Offline detection: What’s the best approach for detecting when a device goes offline?
- Timeout-based checks in the consumer?
- Prescence lifecycle?
- Use Kafka’s consumer lag or other mechanisms?
-
Alternative architectures: Are there other patterns or libraries in the Elixir ecosystem that would be better suited for this use case? I’ve seen mentions of:
- Direct database writes from Phoenix (simpler but less resilient)
- GenStage/Flow for processing (lighter than Kafka)
- “Just” Oban?
- Other message queue systems
-
Performance considerations: For this scale, what should I be most concerned about?
- WebSocket connection limits per Phoenix node?
- Kafka producer throughput?
- Database write performance?
- Memory usage for connection state?
Context
I already have Kafka infrastructure set up in my project (using kafka_ex and broadway_kafka), so adding this wouldn’t require new infrastructure. However, I’m open to simpler solutions if they’re more appropriate.
Thank you for your input🙏
1 Like
Phoenix channels already have a heartbeat to disconnect unresponsive clients. You shouldn’t really need to duplicate that one layer higher. The channel really doesn’t matter for as long as you’re not broadcasting any messages on the channel – a topic by itself is just a string key. It’ll only consume as much performance as there’s connected clients. With separate topics you however reduce the risk of accidentally broadcasting to every device.
You could consider just a bare websocket connection using WebSock if you want to go lower level, or even just bare tcp, but if there’s other channels there’s not much a benefit in that.
Generally I’d consider if you really need to track connectivity globally and by sending/tracking an event from the device itself all the way up to the db. If you can go with per node tracking of connected devices you could e.g. use a per node Registry and every x seconds batch write to the db which devices (Registry.select) are connected to that node. That’s one write (or a handful) per node per 30 seconds. In terms of db load that should be quite fine. In the db you could then roll the data up to report on connectivity. If you do not trust the phoenix heartbeat you could always make the channel process on the server shut down / unregister if there hasn’t been a ping within a provided window.
2 Likes
Yeah, that makes a lot of sense.
I want to use the channels for exchange information with the Nerves device, but topic per device doesn’t sound bad.
The device would need to process the measurements / signals so Kafka will be required for processing either way, but your take on tracking presences is very reasonable.
Per node connection reporting sounds reasonable, but I am assume you present that as an alternative to Presence global tracking for penalty of overhead, right?
1 Like
Yes, especially given you only seem to need the data historically (given going through kafka would be fine) and not realtime.
1 Like
I would not go for Kafka; it mandates you to process either an entire batch of messages, or none at all – you cannot, as a feature of the platform itself, NACK a single message. There are ways around it of course, if a message in a batch of 50 others cannot be processed it can be pushed to another queue (forgot the technical term; was it a “dead letter queue” maybe?).
But that’s just working around Kafka IMO.
As a surface-level advice, NATS with its Jetstream (== persistence) extension enabled seems like a more viable solution. It supports NACK-ing single messages even when fetching them in batches. This allows you to [attempt to] reprocess a message almost immediately f.ex. in the very next batch.
RE: partition strategy, I’d go for timestamp clusters I think. 
As for Websockets, I am ashamed to admit I have never researched how well does 50k+ Websockets scale on a single node that does not cost $1000 a day. Are Websockets scaling better than normal stateless connections? I’d research that if I were you.
1 Like
How many measurements are you talking total per second / minute?
1 Like
Max 1 per 1s.
Could would it change if, let’s say it would be 100/1s? Just curious.
Even at 100 messages a second kafka is way overkill. You’re still easily in Postgres territory, perhaps with some offloading of long term timeseries data storage to something like clickhouse. If you want to use some sort of queue / buffer then I’d still be reaching for like SQS or Google Cloud PubSub instead of the headache that can be managing kafka. Just shoving each payload into Oban through completely fine, we run thousands of jobs per second through Oban just fine.
Regarding managing the devices again unless I’m missing something 10k connections is well within the bounds of even a single normal sized phoenix node.
2 Likes
Sorry, I thought you meant per device so it would be 10k - 1mil of messages per second.
Oh!
Yes that changes things; Kafka is a perfectly reasonable choice at that point. I do want to recommend clickhouse again though for storage, we’ve started integrating it into our stack about a year ago and the compression + query performance is tremendous. There’s definitely a learning curve in that you’re being handed sort of a high powered “data structure as a database” tool but it’s been a delight to use.
On the Websocket front though that still sounds quite doable on a node or 2.
3 Likes
A simpler alternative to kafka is a kafka fork called automq. Decouples compute from storage which simplifies partition rebalancing and scaling kafka brokers.
If the only data you collect is just heartbeat & you are ok with a latency of several secs you might just collect all heartbeats in the API ingestion layer & every second dump them in s3 with a predefined file name. Then on the consumer side you could just load them into click house directly from s3. This would eliminate kafka / automq from your pipeline and just rely on s3 for the “queue” between ingestion & consumption.
1 Like