How to avoid the DB being a bottleneck?

cfenollosa · May 21, 2016, 11:31pm

I’m writing what is basically a chat bot. I don’t need Phoenix so I’m going with a Supervisor that spawns Plug.Adapters.Cowboy to listen for POST requests, uses a Plug.Router to capture the connection, then passes it to a business logic layer, and sends the reply back to the to the user.

Right now, each POST request is independent and thus concurrent, which is a desired behavior. However, I’d like to store chat messages and user data to make the bot more intelligent, learning from the user’s requests.

I guess the basic solution is to make the business logic save all messages to a database using Ecto, but I don’t know if that will bring me concurrency problems (many processes sending INSERTs simultaneously) and that eventually can become a bottleneck. I was planning on using either a Task.Supervisor or a GenServer to asynchronously insert the messages since users spend a few seconds between each message and I can trust that the last message will have already been inserted when a new one comes, while replying to the user instantly.

What do you think, is this design correct? How would you solve the DB layer?

Thanks!

thinkpadder1 · May 22, 2016, 4:22am

You could always store them in memory via ETS then insert it to the database at a later time (save until size x reached, then insert). Can also serve as a nice memory shared cache!

cfenollosa · May 22, 2016, 9:10am

Thanks thinkpadder1.

The Elixir guide mentions that ETS is a bit overkill for most projects. When is the right time to use an Agent to store this state, rather than ETS?

Qqwy · May 22, 2016, 10:01am

Premature optimization is the root of all evil.

Do not complicate your own work now because you think that it might maybe become slow in the future.

Ecto/Postgres is designed to work with a pool of connections, so this already makes it less of a bottleneck. Also, as inserts usually happen in a transaction, they cannot conflict with each other.

The main thing that makes applications slow, is because they have the following process:

get some data from somewhere (such as the user)
compare that to something in the database.
store the resulting data in the database.
read that data from the database.
maybe do some other comparisons.
store that data in the database.
read that data from the database.
return some data to the user

(Step 5, 6 and 7 happen zero or more times, but in growing projects they often appear many times during a single request-response. In Rails: self.save, anyone?)

In a concurrent language such as Elixir, you don’t need the database as your ‘single source of truth’, as you can keep another process (or multiple) running in the background. The only reason to have a database, is to persist the information in case the whole system goes down, and because you might at some point gather more information than what can reasonably be stored in memory.

So, a better loop would be:

get some data from somewhere (such as the user)
compare that to something you already have in memory.
return the result to the user.
(while in the background periodically persisting the results to the database)

This is a perfect use-case for an Agent or other GenServer.
ETS-tables are only useful if you find out at some point that using Agents/GenServers becomes too slow, which will only happen if your whole system depends on a single one. As you are dealing with bot-processes where each instance manages its own conversation, I do not think you will hit this limit.

cfenollosa · May 22, 2016, 10:02am

Thanks qqwy, outstanding reply!

mkunikow · May 22, 2016, 10:30am

So you need need only to store data for machine learning? So I think I would consider this first how you want store data and what use for machine learning and build everything top on this. (example Apache Cassandra + Apache Spark for machine learning)

For sure I will use some no sql sharded database to store big amount of data.

I think building bots it is very interesting topic like slack integration. But I think you need more that regular expression. You will need something like Natural Language Processing (NLP). Make a Self-Training, NLP-Driven Slack Bot (with code)

PS
Sorry if this replay can be little out of topic.

michalmuskala · May 22, 2016, 4:06pm

If the data is only to be accumulated right now for processing later, I’d
think about simply appending it to a file. If you wrap the access in a
module you get an extremely simple solution that can be easily swapped for
something more elaborate in the future. This also gives you an independent
format that could be loaded anywhere.

You can just care about collecting the data right now and think how to
process it later when you actually have the data.

sasajuric · May 22, 2016, 6:00pm

If you’re using Ecto, then a pool of processes (of a configurable size) is used to talk to the database, so the level of concurrent access to the DB is already limited. If you want your data to be persisted so you can learn something from it, then the simplest case of running inserts through Ecto from the POST request handler is fine, and it’s probably the road I’d take.

Appending to a file is also fine, but I think in this case you need to serialize all appends through the same process if you want to write to the single file. If you want to avoid running an external DB, you can consider using Mnesia.

Storing data in memory (through GenServer, Agent, or ETS) is even simpler, but the downside is that data is not preserved on restart. If that’s acceptable, than that’s the approach I’d take.

Finally, as others have mentioned, if you don’t want to keep non-db related stuff waiting on the database, you can move db part to another process. That would speed up non-db related part, and allow you to actively manage the queue of db operations, possibly discarding some store requests in times of overload. I talked a bit about that approach in my last year’s ElixirConfEU talk, see this part.

cfenollosa · May 22, 2016, 6:03pm

Hi Saša, thanks for your contribution. After all you’re right, it’s a design decision. Databases always slow down code, so I have to choose either persistency or speed.

I’ll definitely take a look at your talk!

yoonghm · July 30, 2017, 4:47am

I am looking for the same solution at the same time.

Here are my thought:
a) As application process (ecto) understand the queries very well (i.e., static data, temporarily static over a period, or dynamic data), it can get the data from a shared memory, visible to all ecto processes, instead of going to get it from its database.

b) Supervisors processes triggered by stored procedures to invalidate data in the shared memory, so that only update information are in the shared memory.

Any comment? Thanks.

NobbZ · July 30, 2017, 9:01am

By the way of how GenServers work, I do think that then you get an even worse bottleneck when you ask the GenServer if data is available in the cache before hitting the database.

But thats only a feeling and needs to be (dis)proven

dom · July 30, 2017, 10:42am

(a) is basically ConCache

(b) you don’t really need to go through stored procedures, you can just do the invalidation explicitly in your code (possibly adding helpers functions in your Repo to make it easy).

yoonghm · July 30, 2017, 1:06pm

For @NobbZ, I think IPC is faster than disk access, provided the data is not in the main memory.

For @dom, thank you for the pointer to ConCache.
The data may be updated via back-end processes, thus stored procedures are needed.

Thank you all.

dom · July 30, 2017, 1:26pm

I see, that sounds reasonable then. Postgrex supports LISTEN so this should be quite straightforward if you’re on postgresql.

matiso · July 30, 2017, 2:44pm

This is something I’ve been giving a lot of thought to. I love Phoenix, yet I feel it kind of enforces the traditional way of fetch->update->persist->read->update->persist and so on. What I would love to see is a framework that allows to keep basically everything one needs for serving a request in memory, and avoid “blocking” the request until the DB has finished.

Sure, for some use cases you need that (an API to update a record), but many times we just want to have the system perform an action and eventually persist new state in the DB.

Maybe there is room for a framework, or at least an addition to Phoenix, that would allow to work more OTP-centric, and less Postgres-centric?

I envision a way of retrieving a structure like a Phoenix “context” from the DB, and putting it in transparently in memory. Next time the data is requested, it comes from memory and the DB isn’t hit anymore. So far we got ourselves a cache. But what if we could also perform operations on this structure, that would run in the background, like update some information or sending an email? And, of course, next time we’d request this structure, it would be up-to-date without a need to “expire” anything. It would be kind of like a stateful memory data structure.

I think that with a clever architecture design, one could avoid blocking DB calls in 90% of the cases.

LostKobrakai · July 30, 2017, 4:37pm

That’s just what the generators do to get people started. Having an in-memory cache generated in bootstrap files would be quite a hefty complexity factor, which I wouldn’t want to present to any new users. I’d expect that to cause more confusion than anything else considering the difficulties of adding contexts to the generators. Also caching strategies depend highly on the individual use-case.

That’s essentially what contexts enable you to do and even try to encourage. Blog.list_authors() might start out with retrieving from the database like so: Repo.all(%Author{}), but nobody says this has to stay that way. You can just as easily add additional logic to cache authors to a genserver and only hit the db if needed. As long as the function still returns a list of authors all the weblayer doesn’t need to be changed, because it only interacts with the context functions and doesn’t care about the implementation.

I mean phoenix does even have an --no-ecto mode, so really the framework does not care about how you manage/persist data. Ecto/Postgres is just a nice default for all the people, which do not have more complex requirements.

holsee · July 30, 2017, 7:00pm

Many reasons, but the major difference is the concurrent access versus an agent. Multiple processes can read from / write to ETS table concurrently whereas Agent / GenServer is sequential via Message Queue of proc.

yoonghm · July 31, 2017, 2:32am

I think we need persistence - ACID -, so a process crash is not an issue for the data integrity. I also agree with your vision but it must be static or temporarily static data.