I have a program that creates multiple connections to MQTT servers. After a certain numbers of connections are created (>300), the Beam VM becomes frozen. When starting the observer before starting the connection, we can see the schedulers jump from ~30% utilisation to ~100% and the Beam VM is stuck enough that switching tabs on the observer or entering commands in iex is almost impossible.
I was able to generate an Erlang crash and read the states of the processes:
I have never tried connecting 300 connections with TortoiseâŠdonât know if some limit in the beam is hit here. I could perhaps try to ask some of my colleagues tomorrow if they know what could cause this.
But first offâare you connecting 300 instances of Tortoise to the same MQTT server, or do you have 300 MQTT servers running?
Hi, thatâs great to have the author of Tortoise here :-). Thank you. I wanted to be sure I pinpoint the problem with some accuracy before opening a github issue.
I do have around 340 connections to 340 MQTT servers (running locally with mosquitto). While it may strange to read, each server is used to simulate a different IoT device. Thatâs why Iâm doing that.
I would be please if you can ask your colleagues. In the meantime I will dig further on the issue. Thank you!
Today I tried to monitoring for long_gc and long_schedule but it did not triggered any messages.
All traces in the schedulers (except one) are in the unregister_match function called from the event module called from the connect function of the connection module.
You may also be running into limits of your machine in general if you are running 340 Mosquitto servers.
To simulate load, I would have 1 MQTT server, a script that pushes messages to multiple topics, each topic could be a device/sensor combination, and then one Tortoise client connection to the MQTT Server, with wildcard topic subscriptions for each device id and sensor type.
You could write the script in Elixir and spawn a process for each âdeviceâ to simulate concurrent messages over the connection.
This would be closer to what you would be doing in a production environment, because your IOT devices wonât be talking directly to your service, but to your MQTT Server, which will send the messages on to your service.
I see where you are pointing to but in our case each physical device has an embedded MQTT server to aggregate sensors data. The data are then read from a client on the device. So to simulate the cluster I need to simulate all of these MQTT servers.
Interesting, so you have a VPN or some static IP for each device?
Do you have control over the devices themselves? If so, I would still write an aggregator script that pushes the messages from the local server up to a âcloudâ server.
One IP for each simulated device, it is done with Linux network namespaces. We donât have aggregator or a MQTT bridges in production so I also donât want it for the simulation.
Wait? You are creating and destroying network namespaces dynamically in a large scale?
We had some similar setup up for some simulated network testing a few years ago.
We created and destroyed many hundred namespaces per hour, and the system call that should create the next one simply just stalled when it should create the nth namespace.
The OS thread responsible for this call just wouldnât be scheduled by the OS anymore and wasnât even kill -9able. Even worse, when the parent was killed, the stale child âsurvivedâ.
It was not possible to create any further namespaces when this occured. n was constant per machine across reboots, but different on a couple of hosts we tested.
Back then my personal funtoo machine survived the most namespaces, somewhere in the 10s of millions, where most other systems staled already in single digit millions.
We were not able to hunt down the root cause, as the affected client decided to just use a VM which gets thrown away after a couple of those iterations when we are still in a safe area of already created namespaces.
So if you really do work with a huge number of dynamically created and deleted namespaces, check if any of the described symptoms happen to you as well.
PS: We did not use erlang for that project, though it seemed to be a limitation of the OS, not of erlang.
Thatâs exactly what I am doing!
Funny that you are very active on this forum since I registered and that we experienced the same problem.
You are right that there are system limitations with network namespaces that I still need to explore. In this particular case however I think it is due to something else. The traces shows the code is stalled in the tortoise lib, not in the code creating the namespaces.
Today I replaced tortoise by emqtt and I was able to create a much higher number of connections (around 512). Then I got some errors with the network namespaces BUT the beam VM was not stuck this time.
So it shows me that something strange is happening with Tortoise in this setup.
It happens because programs have a few states they can be in:
Active_Scheduleable
Sleep_Scheduleable
Wait_Scheduleable
Wait
Scheduleable means the process is in a coherent enough state so it can receive signals. If itâs not scheduleable then the process is waiting on something deep in the kernel.
The âWaitâ one generally only happen on very few specific kernel primitives, accessing low level resources like allocating new network data (not accessing it) is one. If the kernel canât create it because itâs out of space for it then the program will wait until there are available resources. If the program is the one that caused the kernel to run out of the resources then it is a Zombie, forever dead and inaccessible as there isnât a single signal in the entire system that can access it, not even kill -9.
@gausby did you get any info from your colleagues? I donât have the problem with emqtt but would still love to use Tortoise. I will try to write a reproducible example next week.
These lines are actually Port states, not Process states.
The PORT_LOCK refers to the fact that the Port is locked using a port specific lock instead of a driver lock. This is normal and nothing to worry about.