Using livebook for a scientist view for data analysis

joe · September 28, 2022, 8:13am

We have a phoenix app with a (rather big) postgres database behind it.
Our scientists want to query that data in unpredictable ways while also having access to the business logic.
One example would be to run a prognostic crop disease model on historical data. Or to see from a hundred fields when someone did not do their weekly leaf counting.

My thought is to give them access to a livebook which has access to the code. I’d use either a duplicated (syncs every few hours) database for that or ensure read-only access to the database.

What are your thoughts on that?

Is that a stupid use case where a better technology exists?

Duplication versus read-only. Which one is easier / more secure?

Any links to how I can make the business code available to the livebook? Forking a repo and adding it as a dependency (might fail because that dependency is not made for being a dependency)? Or adding livebook to the business code?

Any help / thoughts are greatly appreciated.

joe · October 5, 2022, 12:00pm

It has been a couple of days. I’m still very much looking for advice here.

mpope · October 5, 2022, 6:22pm

I’d tackle this in three ways.

The first is adding a custom environment for mix, something like science controlled by this env var:

MIX_ENV=science

Second, I’d look into a data warehouse or an analytical database. Depending on your deployment it would probably make sense to sync the Postgres to a datawarehouse like Redshift if you’re on AWS or BigQuery if you’re on GCP. With this setup you can relieve analytic pressure from postgres (if they’re planning on doing data analysis). To access the business logic, it could be possible to give use a config/science.exs or something similar that setups the connection to the datawarehouse just for the science environment.

Third, as for importing the business logic into livebook I can’t say that I’ve tried this, but you could try to add livebook as a dependency to your main app. If you have like:

{livebook, "0.6.3", only: [:dev, :science]}

then it won’t be added to your production release and I think the business logic would be exposed to livebook? This would route business logic queries over the science environment’s database conneciton.

joe · October 21, 2022, 6:20am

This is awesome.
For the database I was originally thinking of just having a duplicated database that synchronizes nightly. I have to make sure that it runs the same code as production because of the migrations and also copy it after a release. Nonetheless I will have a look at Redshift and BigQuery.
Thank you @mpope !

olivermt · October 21, 2022, 6:23am

Considered a read replica?

joe · October 21, 2022, 6:31am

Streaming Replication - PostgreSQL wiki This is what you mean? I want to avoid most cloud provider dependencies and a quick google search returned mostly azure and google cloud links.

derpycoder · October 21, 2022, 7:14am

I know just the approach that can help you.

Your requirement is OLAP (online analytical processing), so it’s fitting to use a OLAP database such as Clickhouse.

Checkout how Plausible Analytics use it.

Here’s a blog post: Why Clickhouse is the Next Database you Should Explore

joe · March 30, 2023, 10:03am

I got it to work by the way. See deployment - Connect elixir livebook to running phoenix application - Stack Overflow for some pointers if you want to do the same thing.

Synchronization is done nightly because I don’t want to destroy people’s work or create synchronization issues if they actually manipulate the data on the staging environment.

@mpope I did finally use a custom environment. Thank you for that hint.