Replacing stored procedures with Elixir processes?

Some of the apps where I work frequently suffer outages as a result of our stored procedures, and someone suggested that maybe we could remove some of this load on the database by bringing the data into elixir. Has anyone here seen something like this where they work?

The query-by-id parts would likely get translated into phoenix topics. What would be a good way to backup the in-memory data in case a process crashes? I’m imagining that elixir could generate the needed processes based on demand, so that any processes representing objects that haven’t gotten any attention in a while, could be put to sleep and left in the database, to be warmed up the next time someone asks for those objects. Our apps stall when specific records get a lot of attention, so the reads and writes tend to be concentrated around a specific object ID and all the records associated with it.

1 Like

There’s a discussion on caching to reduce DB load here (particularly for heavy reads that are frequently hit). It is probably a bad idea to roll your own caching system using raw processes - it gets hard quite quickly (e.g. handling timeouts, knowing when to invalidate etc).

2 Likes

I’ve given this some thought, and I actually think what I want to build is not a cache. I think it’s more akin to a buffer for the DB. As such, there’s no need to cache invalidation.

When a request comes in for a record, it’s loaded into an Agent, and the Agent will persist for X minutes since the most recent read/write request. All reads/writes for the associated record will be handled by the Agent while it’s alive, and writes to the DB will be generated by this Agent at a configured time interval (gives devops control of DB load). When the Agent reaches end of life, it writes the final state to the DB.

Is data loss acceptable?

This is exactly the issue. If your database is slow and inefficient, batching data in memory before inserts just adds another layer of failure. It isn’t clear how this would even help with the stored procedure issue, cause those will ultimately fire when inserted anyway.

If you have valuable data, and you can’t tolerate data loss, then you can’t afford to store that data in memory, it must be written to a disk somewhere. If reads are your issue and not writes then you can look into caching.

2 Likes

Writes are the issue. I would remove the stored procedures and do those operations in Elixir, so then the Agents simply write their state to the DB.

Let’s separate the Agent then from the question of “stored procedures and do those operations in Elixir”. What do the stored procedures do?

1 Like

I think the core of the problem is this loop:

          BEGIN
            LOOP
              UPDATE table1
                SET col2 = col2 + val
                WHERE id = arg_id AND month = month_of;
                  IF found THEN
                    RETURN;
                  END IF;
                BEGIN
                  INSERT INTO table1 (id, month, col1, col2)
                    VALUES (arg_id, month_of, 0, val);
                RETURN;
                  EXCEPTION WHEN unique_violation THEN
                    -- do nothing, and loop to try again
                  END;
              END LOOP;
          END;

To persist the data to disk prior to DB update, maybe DETS could handle it? I’ve tested writing Erlang terms to disk in binary format via :erlang.term_to_binary(), it’s really simple, but I don’t know how fast.

Currently devops kills this SP when our service goes down, so I assume data loss is already happening.

These aren’t multi-node solutions. If you run more than 1 server, how do you deal with conflicts? Your stored procedure is based entirely around a unique constraint violation, you can’t possibly replicate that outside of your database because you don’t have a consistent view of your database from Elixir.

I do think that you could move these stored procedures to Elixir, but by “to elixir” I mean you could craft a database transaction that was perhaps more efficient.

I’m not sure that’s true. If a user submits a request, and then devops kills stuff, then the return to the user is that their request fails. The danger of using an Agent is that the user submits a request, your server says “ok, confirmed”, and then the agent dies and their data is actually gone, even though you said it was confirmed.

Before we can evaluate things further though, we need to understand better what the purpose of that loop is, because at the moment you can’t straight up reimplement it in Elixir at all.

2 Likes

It looks like the purpose of this code is to track an amount that we are summing. Our DB has things, each is globally unique, and each has this amount field. Then, events occur which add to this amount, and we want to update and report that amount accurately. All seems doable via topics, and I would think we could spawn a single node process for each globally unique thing (using DynamicSupervisor), so we don’t need to worry about conflicts. The number of events we’re talking about is ~100k - ~200k over ~2 hours, and this many events all relates to just one of our globally unique things. I would think this quite manageable.

The only constraint I see on this table:

ALTER TABLE ONLY public.table1
    ADD CONSTRAINT table1_pkey PRIMARY KEY (id);

And there’s also this index:

CREATE INDEX index_table1_id_month ON public.table1 USING btree (id, month);

Does this generate a unique_violation when inserts are attempted with an existing ID & month pair?

Ah! I finally get it. Which database are you on? This stored procedure is basically a home grown version of postgres’s INSERT ON CONFLICT so if you’re on postgres you may be able to refactor this to just use that and move on.

Beyond that though, I was referring to EXCEPTION WHEN unique_violation THEN. The whole thing is weird though because if the id is a primary key, indexing on (id, month) shouldn’t actually help because querying on just id should already be indexed. Where does arg_id come from in your stored procedure?

3 Likes

Ya, it’s postgres

Excellent! So before we do any fancy Elixir things, your current code should really try to use INSERT ON CONFLICT which allows you to try to insert a value and, if one already exists, conditionally update it. Your stored procedure is basically inserting a summation row, but it first checks if one exists, and then adds to it. I feel like the native version is likely to perform much better.

200k over 2 hours is ~30 updates a second, which really ought to be doable in the DB.

I don’t want to discourage you from an Elixir solution but

will work fine until there is netsplit. Phoenix’s distributed tooling works great because it (aside from Presence) handles distributed communication but not distributed state. Distributed state is much, much harder because you have to deal with split brain, things coming apart and coming back together. Presence handles it, and it required some very fancy cutting edge stuff to make it work.

I still think there are solutions that the Elixir version could help with, depending on whether these sums can be slightly out of date or not, but all of those are going to involve choosing when to do a database update, not try to do the sums in Elixir.

2 Likes

Probably it’s an ID in the URL of the client. In the rails code, it’s present in the context of wherever this function is being called.

OK. I’m not super familiar with how Rails and stored procedures work so I’m a bit lost on that point. Regardless, I’d still check out on conflict.

To explore this from another angle though, how important is it that these updates are exactly up to date? Can they be delayed by a few dozen seconds?

Ya, definitely. EDIT: Actually, it might be important to not be delayed. I think we need to prevent future events if the sum hits a limit.

Cool, so at a minimum, you could replace the per-request summation with an update query that happens on an interval. You could also try to track which entries in your other table that you are trying to sum have been handled before or not.

I suppose my basic point here is this: This feels to me more like a database design problem, not a language problem. You are trying to sum rows in a database, the only way to do so and guarantee zero error is to do the summation in the database. The method ya’ll are doing right now is definitely inefficient, and I’m not surprised there are scaling issues. But before you break out a new language to solve your problem, I’d explore other options with the tools you have.

Elixir is awesome, and it can help with a lot or problems. At the end of the day though as long as the problem statement is “I want the values in these rows in table X to represent the totals of rows found in table Y” your database design and interaction pattern is going to matter far, far more than the language.

EDIT: In response to your edit: Try INSERT ON CONFLICT. In general my point about this being a database design issue remains important, particularly if you use those sums to make further database decisions.

Thanks @benwilson512. This SO answer indicates that the code we’re using probably came from “the manual”.

Does the INSERT ON CONFLICT syntax still support IF found THEN RETURN; END IF;?

It makes that syntax irrelevant. The postgres documentation on it is very good: PostgreSQL Upsert Using INSERT ON CONFLICT statement

EDIT: The SO link you provided in fact shows ON CONFLICT there in the top answer.

1 Like

Yup, I see it. I shared this link with the team.

I think I was confused what the IF found does. I think I understand now.