"Do you really need a database?"

smedegaard · April 11, 2017, 9:03am

I’m pretty new to Elixir. I’ve read some books, done a Udemy course, done some exercism and hackerrank problems and started on a Phoenix app.
I was a happy camper, but following the Elixir buzz on this forum, Elixir Fountain and other places I started getting the feeling that the “1.2 way” of doing Phoenix apps was kinda frowned upon by some people.
So I stopped development on my Phoenix app and read a bit more about OTP, watched Chris presentation of Phoenix 1.3 a bunch of times, bought and started reading Lance’s new prag prog book and tried to apply those thoughts and ideas to my app.

Now i bassically feel like a poor noob that is getting told that

“You probably don’t need a database”

But I really can’t see how anything else but a live session kinda thing like a game or a chat could work without persisting data in a database.

Am I misunderstanding this totally? Are there any other poor noobs out there that are getting the same vibe?

tyro · April 11, 2017, 9:10am

Well, it really depends on what you are doing. There’s no one size fits all solution here. That said, I reckon most web apps do use, and do need, at least one database.

With regard to phoenix, there are certainly some new ‘best practices’ in 1.3 but they don’t concern whether you should use a db or not.

NobbZ · April 11, 2017, 9:11am

Well, most of this thoughts about not needing a database assume a basic thing: You never shut down your application.

If this is a fact and you can guarantee that it will never shutdown, then leave DBs completely.

But since you can’t guarantee zero downtime, you have to persist your data somehow.

To persist your data and getting it back into your live system, you have a couple of options:

Put them into a databse of any kind (SQL or not doesn’t matter) and reload it when necessary
Log changes of the system into a file. On restart load that file and “replay”
Combine both ways.
probably others

One of the most important things to remember to understand that mantra is, that it means “you don’t need the database as primary source of truth, but it may be a valid backup strategy”.

DianaOlympos · April 11, 2017, 9:19am

To follow on others : Have an app that write in multiple place so there is always one that is never shut down.

Of course it is not the most practical thing, and in general you would not do that like that.

But the use case i see the most out there for database (and i include redis and co into it) is to make it handle your distribution and concurrency problems.

gon782 · April 11, 2017, 9:24am

As the saying goes (I forget who originally said it):

“The database is not the truth: it’s a cache of a subset of the truth.”

It’s very valid to use a classic SQL database sparingly in a BEAM application, using it only for persisting in case of emergency reload of “start state”, because we have so many more choices between “no database” and “the database is everything” to work with.

Lots of other languages lack the abstractions for creating data store layers that can work both as the primary source of truth and also persist things to other sources, but they’re so cheap in terms of developer time because of OTP that it’s almost wasteful not to use them and treat everything as if it’s Ruby.

I guess what I want to say, in short, is the question should probably be “Do you really need to read/write to the database all the time, or can you just use it as a bucket where you pull starting state from?”.

smedegaard · April 11, 2017, 9:41am

Thanks for the feedback guys.

I guess what I want to say, in short, is the question should probably be “Do you really need to read/write to the database all the time, or can you just use it as a bucket where you pull starting state from?”.

What would be a “good” strategy for dumping stuff into that bucket?

In the Joe Armstrong example of if he were to rewrite Twitter today, he would have each user and each tweet be a process. How often would have the app dump tweets into the database at a set interval?

The Phoenix app…SORRY Elixir app … that I started on was a Home brewing app. I had users, recipies that consisted of malt profiles, yeasts, hops and so on. All of these things can be represented as prosesses and communicate via messaging. A batch of a given recipe would have a brewdate and so forth. Everything is fine (hopefully), until the server crashes.

So is the only question how vulnerable the app is to data loss?

gon782 · April 11, 2017, 10:07am

That seems to me like the fundamental question. With regards to “everything should be a process”, as in every tweet and so on I guess I don’t really feel all that convinced everything needs to be a process.

In terms of setting time limits on when things should be cached to the DB and so on I’d say you have to make that judgment call on a case by case basis. There are things that you wouldn’t mind losing a minute of in case of emergency and things where you need absolute guarantees that they’ve been written to reliable storage immediately. Different data calls for different parameters.

Lately I’ve toyed with the idea of representing tournaments spawned on-demand as gen_statems and having these be restored from DB data on startup, meaning I’d probably have to store the last finished matches in the tournament, etc… In the end, it’s about exploring what data you need and if it’s a feasible idea at all, just like any other development. This involves a lot of decisions that aren’t necessarily 100% right/wrong. As an example maybe you’d say that any started matches that are disrupted by a node failure would have to be replayed from the start, as live game data isn’t saved.

All in all, decide what your limit of data loss is and follow the breadcrumbs to your solution.

smedegaard · April 11, 2017, 10:54am

But a general pattern is to cache the state to a DB at some interval of time? Or is it more common to persist it at a given user or system event?

I’m asking because I’m having a hard time seeing a “natural” point where I would cache the state.

I mean, if every process caches to DB I see little point in doing this. Rewriting the state of the whole app every few minutes seems like an expensive task. So would it be more natural to let supervisors cache whatever hasen’t been cached since the last cache.

There’s a lot to take in here, so sorry if the questions are not very well formulated.

AstonJ · April 11, 2017, 4:39pm

Great discussion!

Where did he say that? Any links?

One thing I’d worry about (if things weren’t in the DB) is running out of memory - guess not as big an issue when the app is something like stack overflow (who I believe keep all data in memory) but when you get to the size of a big social network it might be a different matter?

cdegroot · April 11, 2017, 4:49pm

What does that “DB” word do there? ;-).

I’ve had a couple of times in the past where we went full-out TDD and YAGNI, and ended up with no database. Just some code dumping data out to disk in whatever the native format was (in this world, :erlang.term_to_binary is your friend). As someone remarked, a database is especially useful if you need to coordinate multiple instances of your app, so in modern systems it often serves more as a distribution tool than a persistence tool, if you want that categorization. To that end, it is often the simplest (most readily available, best documented) tool for the job.

cdegroot · April 11, 2017, 4:52pm

AFAICT Pat Helland, e.g. http://queue.acm.org/detail.cfm?id=2884038. You’re welcome

(and yes, yes yes - databases throw so much information away while condensing the stream of events that created them down to a mere snapshot).

LostKobrakai · April 11, 2017, 4:55pm

Not to forget the things DBs do beyond pure persistance. I mean sql might very well be easier/quicker in building reports compared to querying all that data from various genservers. Also rolling your own data indexing might not be your thing.

NobbZ · April 11, 2017, 5:28pm

It’s been in an elixir fountain somewhere between issue 30 and 60…

cdegroot · April 12, 2017, 12:05am

But don’t underestimate the raw search power of modern CPUs. In 2006 I built a real estate site which basically had the whole dataset in memory; a search was basically just a traversal through the ~125k properties listed, including doing stupid stuff like substring matching as a proxy for text search. Queries came back in <<100ms, customer was happy, we never optimized it or considered adding a “proper” search engine.

LostKobrakai · April 12, 2017, 7:31am

Sure, but building a custom search does need time and resources, whereas a db does come with that capabilities. No need to reinvent the wheel without a use-case for something custom.

smedegaard · April 12, 2017, 7:34am

Yeah, modern DB’s have a lot of features that often come in handy I guess.

I’ve gone full circle in the one day this post have been up. I’m back at thinking user authentication, search capabilities and so on makes it easier for me to just stick to the beaten path for now and use Postgres.

Thanks everybody for the great feedback

Qqwy · April 12, 2017, 8:17am

I think that another main reason people might use a database often, is that when using an interpreted language, code (selection, filtering, ordering, reducing) that runs on the database is significantly faster than when this is done in the interpreted language, as the relational databases are written in a compiled language.

However, when working in a language that already compiles, this speed difference is not nearly as significant. What @LostKobrakai says is true: The database have been fine-tuned to do what they are best at, and they probably will be faster. But there are two other factors to not forget:

Developer efficiency: Is the added complexity and the added amount of moving parts to my app worth it in extra setup and maintenance it will take?
Communication: Because the database is in a separate OS process, you’re forced to use OS pipes or sockets to communicate. This is obviously a lot slower than in-process (again, talking about an OS process here) memory access. For many queries, especially larger ones that return a lot of results, the overhead of serialization+sending+receiving+deserialization might be significant.

Qqwy · April 12, 2017, 8:24am

As for the Twitter example: I do not think every tweet should be kept in random access memory indefinitely: At some point they could and should be archived, resulting in a more scalable system. Furthermore, I think that a tweet definitely should not be a process, as I believe it is not a machine, but only a mostly static piece of data. We can filter/sort/order pieces of data, but we cannot filter/sort/order processes directly in a meaningful way.

I see a tweet as a letter: piece of paper with information written on it. It is not a creature, a clock or another automaton. It’s in the original greek etymological origin of the word ‘automaton’: Something that acts of its own will.

DianaOlympos · April 12, 2017, 8:45am

Something to keep in mind about that Joe quote : he probably means an “Actor” in the Actor Model meaning far more than an “Erlang process”. In the Actor Model “abstract” pov, making everything an actor can totally make sense because it is your main way to deal with memory access.

In our real life BEAM application, we may want to collocate that into a single process for “optimisation” and “implementation” purpose.

easco · April 12, 2017, 5:27pm

A bit off topic, but -

All of these things can be represented as prosesses and communicate via messaging. A batch of a given recipe would have a brewdate and so forth.

Be very careful here. If I’m reading your remarks correctly, it sounds like you are trying to model entities as processes (a practice with a “bad smell” common to Object Oriented programmers, myself included, who move into the functional world).

If that’s the case I would recommend reading To Spawn, or not to Spawn?. There are other, similar articles on the web.

I’m not saying “don’t do that” but give it some careful thought.