Confused about replacing db's with processes and gen servers

Harrisonl · January 30, 2017, 11:41pm

I remember listening to a podcast with Joe Armstrong where he was talking about a story about a guy trying a few different programming languages. When coming to erlang, this guy asked the mailing list how he would go about modeling the DB for the project he was building, to which Joe replied along the lines of “You don’t need to use a database in erlang” (paraphrasing).

I also saw something similar in this daily drip video where he say’s he doesn’t really need the database and can instead just use GenServers.

I understand how you can you replace the database functionality with these types of processes/ets, where I am confused with is how to still persist the data on crash/restart?

From my understanding, on crash/restart the states the lost.

Is it normal to still maybe have a database backing the processes where the state the is written to the database on change and read from on restart?

Or is it typical to have replicated processes for backup?

DanCouper · January 31, 2017, 12:15am

If I remember this right, I think one of the points he seemed to be making was that you don’t necessarily need a DB for a lot of the stuff that people tend to dump in a DB by default. For a given application, there may, depending on what the application does, be no need for one; you just keep the state in long running processes. But if you need to persist state on disk somewhere, it’s rather likely that you’ll want to use a DB of some kind (even something as simple as DETs, or further to that, Mnesia). I might be wrong, but I am fairly convinced the comment was in no way meant to be taken as a blanket rule.

NobbZ · January 31, 2017, 6:27am

For one of my sideprojects, we used GenServers to hold the current state as is while using a text file to save the steps necessary to reach the current state from scratch.

Consider it beeing a game (it wasn’t). We dumped every move into the textfile and after a restart we just loaded all textfiles and replayed the games that were still marked active.

After a game was finished, its result was stored in the database as well as the name of the corresponding history file.

This is one possible concept of replacing/extending the database.

PS: After the client moved that software for the first time and didn’t follow the instructions, we dumped the history into the database as well

DianaOlympos · January 31, 2017, 8:56pm

It is not bad to use a db as an unique source of truth. But the question stay the same. If the db die or error, how do you persist the data ?

AlejandroHuerta · January 31, 2017, 9:42pm

A backup DB =]

sztosz · January 31, 2017, 9:44pm

One of main advantages of using GenServer for me it’s that you can delegate persisting data to that GenServer. If for some reason the database is locked, or just to gain precious milliseconds, you have current state in GenServer, and it’s that GenServer’s job to properly persist data. Your main process does not have to deal with I/O (because more than often your database will be on different machine probably anyway), because it will be that GenServer’s job to handle saving data, or even delegating the actual save to yet another process. Also when database for some reason dies you have current copy od data in GenServer, who can automatically retry saving data after few minutes, or switching to secondary/backup database, or… whatever you think of, but still your main process that is serving the request for data and possibly transformation of that data so it can be presented to end user/machine/whatever, will not be blocked by I/O, does not have to know anything about how and if the data is persisted. I mean you get encapsulation almost for free, and it’s really hard to break that encapsulation, and because of that it’s very easy to switch from one persistence solution to another, because most parts of your application have to worry about data and not how to obtain this data.

DianaOlympos · February 1, 2017, 7:47am

Then you have the same problem than having it in a GenServer ^^’

josevalim · February 1, 2017, 8:30am

You may be quoting me on this one. Joe asked me once: “why do you use the database se much?”. And I don’t think the answer is “do it yourself in Erlang/Elixir”. If you really need the features provided by a database and you decided to do it yourself in Erlang/Elixir, then it means you will reimplement a database, with persistence, replication, and what not, which is no easy feat.

The lesson here is rather that with Elixir/Erlang you should be able to think about other patterns for persistence rather than putting everything in the database and reading everything out of the database. For example, online games often store part of their state on S3, loading the data when the game starts and saving it back when the game is over, the data is kept in memory while the user plays. Moz has reached a similar design as well: Unlocking New Features in Moz Pro with a Database-Free Architecture - Moz

DianaOlympos · February 1, 2017, 8:39am

I also think it is linked to a way that a lot of people handle concurrency in web services, by using database transaction as their main “poor man STM” primitive. They end up using the Database (or redis…) as a messaging medium. Something we can do “by default” in elixir.

I think the lesson lie in thinking about systems and architecture, and i think Elixir push you to do it better and learn about it. But i am still trying to find better way to give this type of thinking to developpers.

“The Art of Destroying Software” help. But i think lot of people are still confused about what are boundaries and contract design. Especially in an “integrated” world like the BEAM.

Qqwy · February 1, 2017, 9:46am

I think it is very important to note that using a database or not is not black/white. There actually is a gradient of options:

Many people (that I know of) only think of relational databases when talking about databases. These are things like MySQL, PostgreSQL.
These are very nice, but only useful if you actually want all columns in all rows to be searchable and sortable.

For some applications, this is overkill and/or adds restrictions that are unnecessary or even harmful. Other types of databases such as ‘document-oriented’ databases (e.g. MongoDB and other ‘NoSQL’-databases) for instance make it a lot quicker to store large blobs of data that might take varying shapes per data instance, but makes it harder/slower to search for certain parts of the stored information.

Somewhere here is the dividing border between a data base and a data store. I believe the difference is that a database is easier to search through, while a datastore is not made for that, but don’t quote me on that.
A simple datastore would be writing a list of entries to a file in a format that can be parsed at a later time such as JSON, CSV, etc. For some applications, this is definitely enough. DETS is a form of this.

Finally, there is a lot of ephemeral information that is only useful for a short while (such as for the longevity of a browser session), and persisting it would only result in bloating your database or data store. I believe this is what the original quote hinted at: There is a lot of information that you don’t actually need to persist to disk. ETS is a form of this.

Also, if you store information on multiple distributed nodes, it does not matter if one of them goes down for updates or because of a power outage: The data will still be persisted. Using in-memory data stores is a lot faster than keeping a disk-based backup up-to-date as well.

So: Don’t think that ‘do I need a database’ is a question with a yes/no answer. Think:

Does all data follow the same structure?
Do all subparts of this data need to be searchable?
What is the longevity of this data?
Do I want to store the data in one place, or distribute it across multiple servers?
How much data will my application receive (if it becomes popular)? In what way will I make sure that it can handle this?
Is it important for the distributed data to have a defined order (requiring atomic database insertion), or does this not matter (a lot faster)?

And choose based on the answers to all of these questions .

hubertlepicki · February 1, 2017, 10:25am

I had a talk on the subject about a month ago on a conference in Warsaw/Kraków. I don’t believe the videos are up just yet but I’ll link here when they do.

In short: the answer to Joe’s question is - because of several factors.

Our stacks have/had limitations. Poor concurrency and synchronization pushed people towards using relational databases as the default back-ends.
Our tools tell us to do so. If you generate Rails app, it - by default - requires that you set up database. Again, the default.
We’ve been told to do so. At university, by colleagues. Everyone. This is something we no longer think about.

I hired a programmer a few months ago, and gave her task to build some software to help me manage VPS instances that run in the cloud. UI to start/stop/scale etc. I did not even think we need a database here, but the next day she did come to me with 2 seets of paper with UML diagrams of 1) database 2) classes that largely map to the database tables. And she was in deep shock when I told her to simply use the API AWS provides as a back-end.

So yeah, it’s a combination of factors that resulted in programmers thinking about having a relational(or NoSQL!) database.

There is also false notion that using a relational database will be faster than storing stuff in memory. I suspect it is because in languages that force you to create everything and drop everything during the HTTP request lifecycle this actually is true.

But hardly anyone has a proof that this is slower. People just assume. Don’t map/select stuff in memory, use your database to filter records mantra thing. Largely because agan - it is true in many, many cases, and because of the tools. Say you use ActiveRecord - where creation of objects/destruction is increadibly expensive and slow. The in memory filtering of stuff will be slow. And you have no really good way to persist those objects between requests, so you are forced to do so.

Getting rid or limiting use of database, is only possible when you do a mental shift. Shift from having a stateless back-end to having a stateful back-end. And you need proper tools to do so. Ruby won’t cut it. Elixir/Erlang will.

There are excellent examples of such architecture. You can orchestrate one yourself too with GenServers and storing/reading state using term_to_binary/from_binary and saving to files as you please. A ready to use “framework” would be here:

Good example, it actually does use PostgreSQL but not in the usual way. Your domain is modeled in memory. Database is used to store events/audit trail, and it can be used to generate projections used by read state. But it does not have to. I was thinking the read projections could be exactly the same as domain models - just kept in memory.

There’s another false assumption people do have about keeping state in memory. They think this will eat up a lot of memory and they won’t be able to do it. I have heard it multiple times: “but we can’t just keep it in memory, it won’t fit”.

Well, this is false because:
a) computers have crazy high amounts of RAM these days
b) if you want to use your SQL database in performant manner, you need to make sure the data fits into memory anyway
c) you don’t have that much data - do you?

Qqwy · February 1, 2017, 2:23pm

Yes. It is definitely true that running map/select queries in interpreted languages will be slower than doing this in a compiled language. Therefore, for languages like Ruby or Python, it makes sense to do as much as possible inside the database itself. It is very interesting that this has created a notion of ‘databases are fast’, rather than ‘interpreted languages are slow’.

OvermindDL1 · February 1, 2017, 4:01pm

/me coughs

The last personal webserver I built for private use was averaging about 50 million new records a day, all of which had to be stored and compressed and the data was unique and timestamped. >.>

Overall it was fairly simple though, just HUGE amounts of data, that was only processed for reports once a month.

Qqwy · February 1, 2017, 5:38pm

This sounds to me like you should’ve stored a running tally instead of storing all of the records individually in a database.

OvermindDL1 · February 1, 2017, 6:08pm

Not possible in this case, the data was entirely unique and arbitrary JSON.

mkunikow · February 1, 2017, 8:40pm

I think you should check
http://doc.akka.io/docs/akka/current/scala/persistence.html

Also you can store state in memory , but sometimes you may need to present this state in different way (transform data, calculate, aggregate) https://www.confluent.io/blog/making-sense-of-stream-processing/

Harrisonl · February 1, 2017, 10:15pm

Thanks for everyone’s replies on this, there is some great information here.

Just to make sure I’m following correctly, the general consesus isn’t ‘don’t use a database’ but rather only use a database when there is no alternative?

In saying that, an alternative might be to store the state in GenServer on the backend, which could handle it’s own persistence via some sort of datastore (text files etc.) which it can use to regenerate it’s state if needed.

So basically, there has to be a mental shift between going from a ‘stateless’ backend where every restart basically is the same as other running versions, to ‘stateful’ where the backend has state stored in a variety of processes, some of which is persisted and some of which is not?

So for example - was this the right way for me to do the below?

When I built https://sublid.com I used the con_cache package to store all a users subscription in memory, so all reads/writes went there. When there was a request to update a subscription, the cache would get updated, whilst a background task would then go about updating the database.

sasajuric · February 1, 2017, 11:37pm

I’d say it more like use a database when it pulls enough weight. If you need to persist data reliably, ensure backups, control access permissions, or query the data in arbitrary ways then a full-blown db is better then reimplementing from scratch. If your needs are simpler, then some home-grown persistence might be simpler and sufficient.

This is precisely the case for which I built con_cache I persisted subscriptions into a non-distributed mnesia database, and this allowed me to survive restarts properly. Now, in my case this data was not super critical so if in some strange case I lost the disk-based data (which in fact happened a few times) the consequences wouldn’t be dramatical. If I had stronger persistency needs, I’d likely use a more mature db (although today’s GitLab incident proves that even that’s not a proper guarantee).

Harrisonl · February 2, 2017, 2:05am

Thanks @sasajuric that makes sense!

Also thanks heaps for your package - I pretty much use it in every one of my phoenix apps because of how simple it is to use!

nightire · May 4, 2017, 7:16am

I’d suggest this talk https://www.youtube.com/watch?v=fkDhU-2NWJ8, it explains some basic principles and practices.