Hi, a newbie to web dev and Phoenix & Elixir here.
I need advice on how to conceptually understand the topic of database updates and deployment from dev to prod for a SQLite based Liveview Web App.
I’ve got a python based scheduled data pipeline that scrapes, extract and transforms data from Internet and in the end, it writes it all into a duckdb file based database. After that, it uses a built-in feature called sqlite_scanner that generates a SQLite db from the duckdb file. Depending on the day, the final database is updated once or maximum twice per day.
My dev environment uses this SQLite db as a backend for the app. The reason being is that I want to save costs as much as I can, especially because it is just an initial efforts and prototyping. Besides., I already got a relatively powerful VPS server and I want to reuse it. Also, I don’t expect my app to be used by more than 10,000 people anyway. My questions are:
1 - how do I deploy to prod to my VPS server in case with a file based database like SQLite?
I have my eyes on piku, but I don’t know how to deal with having a file based database that is about 15Gb and therefore too big for github to handle. I would prefer not to use docker.
2 - How to handle the updates to the database? If my Python data pipeline generates a new SQLite file every day, how can I push it to prod , into the existing and running app without glitches? How do I swap one database file for the newer? I don’t mind the users to experience a 1 second glitch either.
3 - Let’s say my data pipeline got to create a new table in the existing database or a new field in the existing table, how do I deal with that on the Phoenix side, do I somehow run automated migrations or as long as my UI components and the logic are ready to receive new fields on new tables are fine, then I don’t need to do anything else ?
1 - Generating users and auth. I want to reuse what comes standard with Phoenix Liveview
2 - At some point later, I will probably build and internal bookmarking functionality where the logged users would be able to save certain pages (based on url params) to their list of favourites.
I probably will have to somehow enable the updated and newly generated SQLite DB to respect existing user tables or maybe even use two SQLIte dbs in one app. One for the app itself, which will be mostly tables and charts and the other SQLite db for user management.
For your user data, I would caution against using SQLite if you expect thousands of users. SQLite has no write concurrency (at all), and its transactions per second are measured in tens. Which is fine as long as you don’t need many transactions, but just keep in mind it’s not going to scale very far (for writes). If you have a powerful VPS (as you say), consider just running Postgres on the VPS to handle the user data. Then if you ever need to scale in the future you can just get a bigger server.
As for your questions:
If you mean for your user data, check out Litestream and LiteFS (by the same guy, at flyio). They handle backing up/persisting the database to object storage or another server, in slightly different ways. I can vouch for Litestream in particular, very low maintenance (I’ve never had to touch it). Great for personal projects with SQLite.
Well the easiest way might be to run the Python pipeline on the same server, so you already have the file. But if you want or need to use a different machine, you could throw the file into S3 or similar, or rsync/sftp it onto the server. If you were feeling clever you could add an endpoint to your Phoenix server (behind some sort of auth of course) and use that to upload the file over https (I quite like that solution actually).
As far as switching over, I’m not actually sure the best approach there. There’s probably some way to close the existing Repo and re-open it (perhaps someone else can chime in?). You might also be able to engage in some shenanigans with dynamic repos.
Ecto migrations and schemas are separate. If the database is only written by your Python pipeline, as long as the correct schemas are in place in your Elixir app they will just work. If you change the database tables you have to change the Ecto schemas in your code and deploy the changes.
And since it seems like the SQLite databases would be immutable (created by DuckDB), a single readonly connection can be used for multiplexed statements and “unlimited” concurrency. A file watcher can be set up to switch the connection to new database files once they become available.
Might be worth knowing that recently the SQLite team shipped a new command line tool, sqlite3_rsync, which could solve the problem of updating the file in-place and efficiently.
I’d also advise using separate Ecto repositories and database files for your user data and this apparently read-only data dump, as it would make your life easier when it comes to database backups and what not.
thanks @rhcarvalho - sqlite3_rsync looks very interesting, especially with fact that both ORIGIN and REPLICA can be in a live/running state. This is really good.
I’d also advise using separate Ecto repositories and database files for your user data and this apparently read-only data dump
Totally agree. It will make the generation of the actual app data disentangled from the user management. Less headaches. Any sample docs or guides on how to deal with two DBs in the same app and both being accessed by Ecto? I’m not sure how does it work. One Repo with two DBs or do I need to create a second Repo2 for that? Thanks
If you can guarantee that the database file won’t change, you can open a single connection and execute multiple statements on it concurrently. This mostly means that you don’t need to implement custom pooling strategies and would get pretty good raw performance, limited mostly by SQLite internals.
2 - Any example of this file watcher or something standard like LItestream/LiteFS should be enough?
I think it’d be a bit different from Litestream/LiteFS. You can poll or subscribe for updates on your local file system or object storage and switch the connection (e.g. by replacing a reference in ETS, the old connection would eventually get garbage collected, once all statements complete, and there are no more references to it) when the changes occur.
Do you think duckdb could be used as a direct substitute for SQLite?
For simple lookups SQLite is faster in my experience. It’s faster in creating prepared statements, faster with params in prepared statements (DuckDB slows down once a prepared statement contains params), and probably safer too. DuckDB is faster for analytical queries.
Yes. I have only done small tests with it prior to the official release, but it feels solid just like SQLite itself.
This tool didn’t exist when LiteStream and LiteFS were created (I hugely admire Ben Johnson for those projects and more). And it drove changes to SQLite itself, like being able to query and modify DB page information using SQL: The SQLITE_DBPAGE Virtual Table
If those features existed earlier, they would have certainly influenced the designs and implementations of LiteStream, LiteFS and other tools in this space.
In summary, you duplicate config and your App.Repo module, and update application.ex to start both repositories on startup.
I recall some threads here in the forum about e.g. using a database per tenant on SQLite.
The introductory guides probably cover what needs to be done - the difference is you end up with 2 (or more) repositories in your app, and parts of your app explicitly reference one or the other.
Thanks, that’s very valuable. I was hoping duckdb would be the panacea and the all in one tool for both, analytics/transformation and as a backend. Well, it’s good to hear that SQLite is faster as a backend because it’s very well integrated with Elixir/Phoenix.
I did some experiments with duckdbex for reading parquet files. It was good but messy due no support with Ecto.
Anyway, good luck with your own duckdb for Elixir and the Ecto support.
I knew the figures in the faq were outdated, but I didn’t think they were that outdated! Funny, looks like they finally added a disclaimer to that answer just a couple months ago.
I tried this a while ago with Ecto and I seem to recall being a couple of orders of magnitude lower than what you’re getting there, but perhaps Ecto was actually the bottleneck? I assume you’ve done some optimizations of your own, and if so nice work!
Still, 20,000 is extraordinary. I’m surprised your drive can even fsync that fast. Are you using an enterprise ssd, or are drives just that fast now?
For DuckDB you can also use the ADBC bindings written for the Explorer project. I believe you can also use it with Explorer if that’s what you’re going for.
But you should stick with SQLite unless you have pretty heavy analytical queries. Once you need DuckDB, you’ll know (because your queries will get very very slow).