Need help with Exqlite segfaulting when timing out

Author of the exqlite library here. I need some help from experienced NIF devs in hunting down the culprit of this SIGSEGV error. This bug has been keeping me awake at night for a few weeks now.

Ya ya ya, it’s C, and I should be using Rust, but right now, I don’t want to make that change just yet for this library. Let’s ignore that for now.

What I am running into is that when I have an sqlite database, and a pool of DBConnections to a single sqlite database resource on disk. SQLite has a write ahead log functionality that works pretty well in most of my use cases that I’ve been hammering it with.

Except for an issue where database connections timeout. In order to simulate the timeouts in a reproduce-able manner, we shortened the timeout window to 1 second, write 1,000 new rows to a single table and queue up 5,000 async tasks to execute the database query.

Failing test implemented here:

Can be ran by checking that branch out and running the following

mix test test/exqlite/timeout_segfault_test.exs
4 Likes

Is there a way I can have valgrind with an asdf installation or am I going to need to use some other erlang / elixir version manager to do this?

I’ve debugged my share of C NIFs in the past, so I hope I can provide some help. I pulled your branch and reproduced the segfault. Locally I added some fprintf-based logging to exqlite_close and noticed something that made my raise an eyebrow, so I thought I’d share. If I’m off base here, please let me know.

An fprintf statement in exqlite_close prevents the segfault on my system, so there is definitely a timing element here (race condition). I printed the memory addresses for conn and conn->db and saw many duplicate addresses, which made me wonder if exqlite_close is being called multiple times on the same resource, resulting in a double-free-like segfault when running with high concurrency.

BTW, thanks for this library. I am a fan. :slight_smile:

1 Like

This was my observations as well. Which makes me believe this has something to do with the garbage collector.

I’m not familiar with DBConnection, but I’ve written similar libraries in the past. When you do DBConnection.start_link, is it calling your connect one time and then sharing the result among the pool of 50? If that’s the case, I wonder if creating a single enif resource and allowing it to be “disconnected” from multiple concurrent processes in the pool is the source of the race condition. The DBConnection README says it calls disconnect automatically:

DBConnection also takes all of the care necessary to handle
failures, and it shuts down the connection and the socket
whenever the client does not check in the connection to avoid
recycling sockets/connections in a corrupted state (such as a socket
that is stuck inside a transaction).

(It looks like DBConnection was written with network sockets in mind. Due to the nature of SQLite, obviously there is no network socket, so the disconnect behavior may not be optimal for your use case.)

In my experience, it’s safest to ensure that an enif resource gets a dedicated BEAM process so that you can carefully control the access to it. At the very least, I assume you only want a maximum of 1 exqlite_close on a given resource; You can build that guarantee by serializing calls to Connection.close on a dedicated BEAM process. Otherwise, you can’t guarantee there are not 2 scheduler threads freeing the memory simultaneously.

We get spoiled by the BEAM scheduler and garbage collection, but when it comes to NIFs sadly we have to worry about all these issues again. I don’t believe the BEAM garbage collector is causing any problems in this case.

Edit: in my haste I probably jumped to some wrong conclusions about DBConnection. I’m going to spend some more time and try to get more concrete results.

1 Like

The way I’ve written it is that every time a new connection to the pool is created, a new sqlite instance is opened and no sharing of sqlite databases between connections.

Is db_connection even necessary in SQLite’s case? When I was trying to get my library off the ground a few years ago I planned on using :poolboy or, as I learned lately, the :jobs Erlang library.

Plus if you open only one handle and set it up as fully multithreaded you can have only one stateful OTP process per SQLite connection string and return it each time a connection is checked out – and only close it if the processes is terminated – which you can do manually with an API e.g. Exqlite.close(handle) or simply when the app itself shuts down.

no sharing of sqlite databases between connections

Yes, I missed the mark on that detail. I see now that the pool is starting a process per enif resource.

However, the DBConnection docs sound as if they’re exposing the “socket” (in your case the enif resource handle) to client processes as an optimization. The lib is pretty sophisticated, so it’s challenging for me to grok quickly, but the docs suggest this is the case.

Other database libraries would send a request
to the connection process, perform the query in the connection
process, and then send it back to the client. This means a lot of
data copying in Elixir. DBConnection keeps the socket in the
holder and works on it directly.

If I have this right, then you may find that multiple processes (the client procs) are calling exqlite_close concurrently on the same enif resource.

Not necessarily, but I tied this pretty closely to being used by ecto_sqlite3.

I could use poolboy or nimble pool for pooled resources, but when it comes to Ecto, you have to implement the DBConnection interface in order to have it play nice with ecto_sql.

1 Like

This certainly could be the case. I just don’t have a good way to prove that a double close is being called.

I get it and I’m not judging, simply saying that if I encounter something like this and it takes me more than a few hours to troubleshoot I’d likely just eliminate the problem by drastically changing approach.

I even have most of the Rust code that maintains a per-connection-string SQLite handles cache (a parallel hash map); IIRC I only needed to add tests there but it’s when I stopped everything due to personal and work life turmoil.

I will be starting a new job soon and after I settle a bit I’ll finally return to making my library workable. Maybe you can take inspiration from my Rust code; I’ll be happy to help when I have some more free time as well.

Feel free to ping me anytime!

Oh. I haven’t looked into that. I definitely will now, thanks.

@jstimps every connection in the DBConnection pool handles one query at a time which I assumed meant that I could exqlite_open the sqlite database per connection and when that connection disconnects, we could safely close and discard that reference.

Recently, I added one more reference to the statement_t type to enif_keep_resource and enif_release_resource on the connection_t resource to ensure that the database doesn’t get ref counted away with a statement still holding a stale pointer.

It should be possible to influence the kerl build of Erlang via asdf. I’ve never built valgrind support personally, but it appears to be supported:

when that connection disconnects, we could safely close and discard that reference.

Sound reasoning, so long as the disconnect is controlled (serially) in DBConnection’s implementation during timeout scenarios. It’s honestly hard to tell, and I definitely feel you on the challenge of proving one way or the other.

If I had more time to devote, my next step would be to use a GenServer in Exqlite.Connection.connect (in connect/1 call GenServer.start_link) and see if the problem disappears. You might see some performance impact when you do so, though.

I’ll give this a try and see if I can cause it to blow up.

It is. All the timeouts go through a single connection pool GenServer which kills the connection process.

Hypothesis

The issue seems to be that there is a race condition between exqlite_close and exqlite_prepare. The issue happens when we execute sqlite3_prepare_v3 on a connection and in-parallel we close the connection with sqlite3_close_v2.

sqlite3_close_v2 does handle previously prepared statements, but here we are preparing the statement in-parallel while we are closing the connection. Which is causing occasional SQLITE_MISUSE.

How can we reach this state?

Given

  • disconnect callback is executed by connection process
  • handle_prepare and other callbacks are executed in the client process
  • a NIF call can not be interrupted/cancelled once started, irrespective of client side process timeout, or process kill etc.

Scenario

  1. client checkout a connection and makes a query
  2. client query function call invocation makes exqlite_prepare NIF call but fail to complete within connection timeout duration. It is 1ms in the test case we are using. This NIF call is still running or scheduled to run even after the timeout.
  3. due to connection timeout, connection-process receives timeout and initiate connection close (disconnect callback)
  4. connection-process calls exqlite_close NIF

2 & 4 are run in-parallel, leading to race-condition mentioned previously, causing segfault.

Fix

I tried to fix the race-condition using mutex, and sofar haven’t seen a single failure.

This is PoC, I guess we should handle multiple other cases as well, not just the one between close & prepare if this is the actual issue.

This actually makes WAY more sense to me now as to what is happening. When I tried doing a similar open db, write a ton of rows at once in a large async task without using DBConnection I could not get it to segfault, but I was orchestrating when the connection was closed and garbage collected so the segfault in the scenario you describe above wouldn’t happen.

Thanks a ton for all the help everyone.

Do I understand it correctly, that this fix will prevent the segfault, but a connection timeout still won’t interrupt the executing query?

That is correct. I still haven’t figured out how to cancel a currently running query. I think this will be near impossible with the way DirtyNIFs are handled.

I could look into using an enif_cond or some sort of signal to attempt to cancel it. I’m still trying to figure a solution out for that.