We have a phoenix app with a (rather big) postgres database behind it.
Our scientists want to query that data in unpredictable ways while also having access to the business logic.
One example would be to run a prognostic crop disease model on historical data. Or to see from a hundred fields when someone did not do their weekly leaf counting.
My thought is to give them access to a livebook which has access to the code. I’d use either a duplicated (syncs every few hours) database for that or ensure read-only access to the database.
What are your thoughts on that?
Is that a stupid use case where a better technology exists?
Duplication versus read-only. Which one is easier / more secure?
Any links to how I can make the business code available to the livebook? Forking a repo and adding it as a dependency (might fail because that dependency is not made for being a dependency)? Or adding livebook to the business code?
The first is adding a custom environment for mix, something like science controlled by this env var:
MIX_ENV=science
Second, I’d look into a data warehouse or an analytical database. Depending on your deployment it would probably make sense to sync the Postgres to a datawarehouse like Redshift if you’re on AWS or BigQuery if you’re on GCP. With this setup you can relieve analytic pressure from postgres (if they’re planning on doing data analysis). To access the business logic, it could be possible to give use a config/science.exs or something similar that setups the connection to the datawarehouse just for the science environment.
Third, as for importing the business logic into livebook I can’t say that I’ve tried this, but you could try to add livebook as a dependency to your main app. If you have like:
{livebook, "0.6.3", only: [:dev, :science]}
then it won’t be added to your production release and I think the business logic would be exposed to livebook? This would route business logic queries over the science environment’s database conneciton.
This is awesome.
For the database I was originally thinking of just having a duplicated database that synchronizes nightly. I have to make sure that it runs the same code as production because of the migrations and also copy it after a release. Nonetheless I will have a look at Redshift and BigQuery.
Thank you @mpope !
Streaming Replication - PostgreSQL wiki This is what you mean? I want to avoid most cloud provider dependencies and a quick google search returned mostly azure and google cloud links.
Synchronization is done nightly because I don’t want to destroy people’s work or create synchronization issues if they actually manipulate the data on the staging environment.
@mpope I did finally use a custom environment. Thank you for that hint.