Database free applications using Files for storing and managing state. Please critique

sreyansjain · April 27, 2019, 4:39am

Can someone please give me an example of an application where erlang term_to_binary and binary_to_term have been used for state management using OTP instead of a database.
Generally the way I (and maybe most others) have been modelling applications is by thinking in terms of database tables and their relationships. This has been going on for long and is often easier than thinking probably in DDD terms, especially for junior developers. They way hardware prices have come down I think many applications (not internet scale) can easily fit their data in the RAM. So it would be great if someone can advise me how to use OTP and files for state management of smaller applications.
Also I think its very difficult to rule out the utility of a database (mainly POSTGRES) because of the reporting requirements. Should we go for the file based approach how do we tackle this?
I am more confused by reading Doing without database in the 21st century.
Would going the databaseless route won’t be very hard? ACID etc.
If someone can shed some light on this it would be really enlightening and helpful.
Thanks

tomekowal · April 27, 2019, 4:57am

Everything depends on the use case.
I am pretty sure you have some concrete examples in mind while asking. It might be useful to share your particular scenario.

In this book Functional Web Development with Elixir, OTP, and Phoenix author shows how to make an interactive game. When the game is ready and works in memory, he adds saving state in ETS in case of crashes and mentions that it could use dets (disk based term storage).

He does exactly the thing you describe as confusing: starts without touching the DB and ends up with an excellent model decoupled from storage. I highly encourage reading it!

[EDIT] I would also treat Why Relational Databases Are So Bad with a grain of salt. The timesheet example might ring a bell in people. Adding sequence ids to rows in data and work to produce a report from SQL is repetitive. The same goes with retrieving employee name from a different table.

What the author doesn’t mention is that storing employees and hours separtely and normalising the database allows to easily reverse the query: “who worked at a given time?” Try that with saving timesheets as a list of hours in a file

Again: everything depends on the use case. Maybe you will never need that reversed query?

hpopp · April 27, 2019, 5:51am

I recently tried this approach building a new production service. It managed lists of contacts with some additional business logic, each list getting its own GenServer with state backed up to S3. Each GenServer would gracefully shut down after a few minutes of inactivity to avoid memory leaks. In short, we scrapped the whole thing and reimplemented with postgres.

Building without a database allows you to write some of the most expressive code you’ll ever create, at the expense of having to write a lot of things you take for granted in traditional design.

Big issues I faced:

Needing sagas almost immediately. Simple pieces of information had be duplicated in a few places, and these updates had to be atomic.
Data migrations. My initial version directly wrote the struct with term_to_binary, but obviously this gets hairy if data needs to change. Lacking the time to implement a proper migration strategy due to deadlines, I ultimately decided to abandon the whole approach.

It’s entirely possible I implemented it wrong, and I do intend to try it again in the future. Ultimately I felt like the system I had built wasn’t nearly as stable or simple as a database, and given the scheduling deadlines, stability and simplicity had to come first.

As a side note, the original business requirements had us under the impression that contact lists would be no greater than 10K entries. We had a 45K list day one in production, with the expectation to have multiple 100K+ lists in the coming weeks.

idi527 · April 27, 2019, 6:13am

I tried a similar approach last summer, but with sqlites stored on google storage. Each user had an sqlite database. And I also switched back to Postgres since the ecto adapter for sqlite wasn’t quite as nice to work with as the one for Postgres. I also worried a bit about race conditions, where two clients would start at the same time and both download and start writing to the same database.

I mostly did that to try out hosting the application entirely on preemtible/spot instances in the cloud.

dimitarvp · April 27, 2019, 6:20am

The no-DB approach sounds awesome when you’re building exercise apps but unless you need barely any reporting or joining of data, then it doesn’t scale beyond the first two weeks of active development.

The problem is never “what is the exact way my app is persisting information?”. The problem always has been “how do I query and aggregate it?”.

IMO the way relational DBs build indices and joins – and the internal storage mechanism in general, including transactions – needs to be ported to an embedded DB. People around here periodically attempt to write apps using only (D)ETS as a storage and the conclusion always seems to be “it’s too hard to arbitrarily query data”.

If I had the time and was paid for it I’d seriously attempt writing such an embedded DB engine.

…Or, it might be worth it to contribute to the sqlite Elixir adapter. But then again, sqlite only allows a single write operation at a time.

idi527 · April 27, 2019, 6:29am

I know you didn’t reply to me, but let me leave some notes on how I’d handle some of this:

I planned to use some ad-hoc approach (and later switch to spark) to build materialised views. I never got to do it, but I don’t see much reason why it wouldn’t work.

…Or, it might be worth it to contribute to the sqlite Elixir adapter. But then again, sqlite only allows a single write operation at a time.

With WAL enabled, so as not to block reads, it’s enough if used carefully. For example, since in my setup all users had their own databases (colocated with the user processes at execution time), all requests took less than 1ms (but it could eventually get worse as more users → more databases → more files → more filesystem thrashing, there’s a way around it with storing multiple sqlite databases in a single lmdb, but I never really looked into it), whereas after switching to Postgres it was about ~30ms. The app was a simple voice messaging app, so the requests were “do I have new messages?”, “get message history with X”, “send message to Y”, “get prekeys for convo with Z”.

benwilson512 · April 27, 2019, 12:48pm

IMHO If the purpose of the application is to store and manage data, then you can either use a database or build a database, and you probably don’t want to build a database.

However, there are applications who have jobs that isn’t storing and managing data. For example, we have applications which route data and others that serve as kiosk systems. For these, a database was superfluous. Just my $0.02.

keathley · April 27, 2019, 6:16pm

This is exactly right. Unless your company sells a database don’t build a database.

To add on to this, I don’t know of many companies that don’t need to store something. My advice is to put your state into a reliable database (postgres is a good default). As other people have said this empowers reporting, etl, and a whole bunch of other benefits.

It’s not the only reason, but the main reason to build stateful systems - meaning bringing your application’s state into processes - is to reduce latency. I have a lot of empirical evidence to support that a “stateless” elixir service backed by a database will take you a really long way. IMO you need to prove that postgres or your db of choice won’t be fast enough before you start bringing more state into your application.

peerreynders · April 27, 2019, 8:15pm

Unless your company develops databases, the database probably isn’t your application. That’s the basis of opinion pieces like No DB:

The database is just a detail that you don’t need to figure out right away.
The center of your application are the use cases of your application. (not the database)

i.e. relational databases can be very useful but their existence shouldn’t dominate the architecture - possibly to the point where the UI is assembling dynamic SQL.

And in some circumstances other approaches to managing/handling data can be more effective:

“Turning the database inside out with Apache Samza” by Martin Kleppmann

keathley · April 27, 2019, 8:46pm

I’ve worked on multiple systems that democratized their data through kafka and other mechanisms. Based on those anecdotes I’m more then comfortable suggesting that most companies should not do this.

All that aside, most companies are information systems. Meaning they take data, store data, and present that data to users as information. That being the case the database absolutely does matter. Its not “just a detail”. Its an integral part of your business and you should choose databases that have the tradeoffs your business needs.

peerreynders · April 27, 2019, 9:22pm

The message is that your core problem should dictate which data handling technology is appropriate.

Before the emergence of NewSQL and stream processing it wasn’t uncommon for “the” relational database to be the foundation and crown jewel of the all business processing - the core of a brittle and tightly coupled BBoM. Unfortunately technology centric design can happen all too easily with any technology.

sreyansjain · April 28, 2019, 3:32am

Thanks for your response.

I am trying to create a small feedback application where students can give feedback/rating to their teachers and need to show average feedback/rating for each teacher etc.

I read Functional Web Development with Elixir OTP, and Phoenix and was wondering can this strategy be used for something which is not a game. I get the use case of the game when sometimes you just need to persist the final score/result and the running game can be managed in memory.
But can we use similar strategy for apps where everything needs to be persisted permanently?

sreyansjain · April 28, 2019, 3:35am

Thank you so much for sharing the practical problems you faced.
Using a database gives us so many nice guarantees that we appreciate deeply when they are not available.

peerreynders · April 28, 2019, 11:51am

At the most basic level that would only require that the individual feedback/rating submissions are persisted (preferably as structured records) to an append only file. That file can then later be processed by a separate program to aggregate the per teacher results.

Keeping an in memory representation of the data and processing the submissions in realtime only makes sense if all feedback submissions are made in a fairly short time period (hours to a day, rather than weeks) and there is a need to amend the displayed results as the submissions are coming in (like for an election coverage).

Things get a bit more interesting once you are trying ensure that all submissions are authorized and not duplicated - but again there are different ways to go about that.

So the priority is to identify your application’s use cases in detail - not which technology will be used to persist your data.

tomekowal · April 28, 2019, 9:05pm

That use case is compelling! If you have time to try, you could follow the development process like in the book:
a) Write the logic for use cases like rating, querying averages and so on in pure Elixir
b) Try storing the results in the GenServer for each teacher (similarly to having one game process)
c) Figure out in the end what do you want form the storage.

I assume you would want to do some aggregation in the end. How many votes there were, who was the best teacher and having three entities (student, teacher and score) begs for a relational database.

The point of DDD is not “do not use relational database”. It is “do not pollute your logic with DB details”. E.g. your code might model a teacher as a struct that has a list of scores. It is pure Elixir model, and your application logic does not care if it uses left join or right join or stores the scores as JSONB or whatever else you think is optimal.

An added benefit of starting with the in-memory model is more interactivity. Maybe you could send a push notification to a teacher when his score drops below three stars?

Starting from the database makes you think on the wrong level. E.g. you will need some authorisation and authentication both for teachers and students. If you start from the database, you immediately think: “should I create another users table and have foreign keys in teachers and students tables? Maybe teachers and students should be in one table and have a role field?”.

Starting from the use case, you create a struct called User or Account, implement the feature and defer retrieving the users for later. Maybe you’ll end up with getting users from LDAP or private school students database, and you won’t have to persist them at all?

In general, I disagree with the author of “Doing without database…”. DBs are there. They solve problems almost everyone has. Let’s use them. I agree with DDD approach of not caring about DB as long as possible.

Developers sometimes say stuff like: “you should abstract the database! what if you need to change it?” That happens so rarely that I would dismiss it. Abstracting database away has other benefits: your business logic gets simpler and easier to test in isolation. In case, you do a big migration, you’ve decoupled code that performs business from the code that loads stuff, and you can change only the latter without touching “business layer”.

TL;DR DDD=good stuff; NO_DB movement=meh

LostKobrakai · April 29, 2019, 8:31am

Without c) this basically sound like what commanded allows you to do. To an extend even c), but you’d need to use either a supported eventstore or write one.

gdub01 · April 29, 2019, 12:08pm

There are 2 no db ideas that some are using… which include using a git repo to persist data and using a hosted spreadsheet service.

Some static site generators will read markdown and other files before compiling into a static site. NetlifyCMS is an open source project that allows people to create markdown files using a CMS… and each save is actually a git commit to master. A hook automatically rebuilds the site after each commit. So it’s great for sites that don’t change super often or sites that aren’t massive. It can include large file storage and an identity system so authors don’t need access to the git repo.

Second one is using a hosted spreadsheet like google sheets or airtable as the db. https://sheety.co <- allows u to turn your google sheet into an easy api.

sreyansjain · April 29, 2019, 12:40pm

That is insightful. Thanks.

domvas · April 29, 2019, 3:43pm

Maybe I’ missing the point, but nobody here is trying to NOT use a database.
The first line in Wikipedia ‘database’ page is:

Then, if your application manages data, you will have a database. And as it has always been the case, the real question is “which database should I use?” and the subsidiary questions I’ve read all along this post:

Do I want to rely on an external system?
Do I want to use a new language for data querying?
What about (write / query / both) performances?
Do I have will / time / skills to build my own database?
Do I want to decouple my business logic from my storage?
What kind of data do I want to store?
Do I need to store this data?
Can I afford data loss?
What about legal? (GDPR vs append-only DB )
etc.

These are pure tech questions, imho and to summarize a bit fast, DDD adds another question:

Do I want to express my need in a data way or a more user friendly (that even non-tech could understand) way?

And ultimately, the mother of all questions is:

What are my use cases?

It’s strange that almost each time I read something about a DB (system or not), it tends to turn out to be some apology of this DB system against one (or all) others. All DB systems are good if they fit their use cases: you don’t use MongoDb if you care about your data, you don’t use RDBMS if you want to model a social network, you don’t use graph database to compute stats, and so on.

dimitarvp · April 29, 2019, 5:26pm

Like half the points you enumerated are political or, at least, are not decided by the programmers writing the code on the ground. They are decided by at least team leaders, more likely CTOs.

As for DB-vs-DB: there are unquestionable benefits of using one DB as opposed to another but that’s a huge topic on its own.

As for “should we even use a DB” – this discussion is not monopolized at all. But most people naturally come to the conclusion that the DB helps.