IntegrateDB - Database sharing toolkit for Postgres

thruflo · January 28, 2021, 2:11pm

Hi,

I’d like to introduce IntegrateDB – Intro — IntegrateDB v0.1.0 – a database sharing toolkit project that I’ve made in Elixir.

I’d love for some feedback and any thoughts on it. The idea is inspired by projects like React, Tailwind and LiveView, in that it aims to boost developer productivity by deliberately coupling concerns that “received wisdom” holds should be kept separate.

Basically integrating services directly through the database is usually seen as a smell, so we all build APIs and loosely coupled SOAs. But in many cases this can be overkill and actually if you could integrate through the database you could eliminate layers and go much faster. Especially if you’re making LiveView apps where you don’t need an API and you have Ecto at your disposal …

The problem is that having multiple services using the same data is a recipe for disaster – unless you have some way of declaring and controlling data dependencies whilst enabling schema evolution. Which (as you may have guessed!) is exactly what IntegrateDB does

It’s a Phoenix-based web application that sits alongside a Postgres database. You interact with a JSON API to declare data dependencies and generate scoped db user credentials (which you then give to your applications for them to connect directly to the database — they don’t talk to IntegrateDB at all, it’s not a proxy or an extension).

It then adds a user defined function to your database which you can call from your migrations to validate that they don’t break any of your declared data dependencies. It also sets up and consumes a logical replication feed (hence being Postgres only for now), with the idea being to convert events that you register for into higher level notifications for your apps.

All of which is designed to allow you to break the rules and integrate straight through the database without shooting yourself in the foot.

The code is at GitHub - integratedb/integrate: Core IntegrateDB source code repository., the docs are at IntegrateDB v0.1.0 — Documentation and the website is https://integratedb.org

I’d love to get some feedback on the project.

Thanks,

James.

halostatue · January 28, 2021, 3:10pm

This seems really good. The only comment I have on reading is that when you are able to move more toward security, it might be worth looking at what Hashicorp Vault does regarding temporary credentials for access so that the stakeholder bootstrapping process might be able to not have setup configurations that are wholly manual.

thruflo · January 28, 2021, 3:33pm

@halostatue yup, that’s a great suggestion, thank you.

I have been thinking about some kind of more automatic bootstrapping, for example accepting an env var with a pub key for the the root user. Maybe a config file for stakeholders. Would be very interesting to think about pushing stakeholder creds to a secure store automatically. Could be vault, or perhaps into a k8s secret?

Exadra37 · January 28, 2021, 8:05pm

Example - integrate a reporting application

Imagine that you have a primary web application with a Postgres database and that you want to integrate a reporting system. Assume you’ve installed IntegrateDB and bootstrapped a root user. You then start by creating a Stakeholder representing your secondary reporting application:
curl -X POST -H "..." -d '{"name": "reporter"}' /api/v1/stakeholders
This creates a Postgres database user (reporter) with access scoped to its own private DDL schema (reporter.*) and returns the database user credentials in the response data:
{
  "data": {
    "credentials": {
      "username": "reporter", 
      "password": "... randomly generated ...",
    }
  }
}
Save the credentials (somewhere safe!) and provide them to your reporting application as the database user and password it should use to connect directly to the Postgres database.

Unless I misunderstood you want to give the user and password to a client in the web to access the database directly?

Also, the cURL request can be done from a client in the internet?

thruflo · January 28, 2021, 9:21pm

Unless I misunderstood you want to give the user and password to a client in the web to access the database directly?

Yup, I think it’s a slight misunderstanding in that the user and password are meant to be given to an application, not to a client. I.e.: think a Phoenix app which you would typically provide a DATABASE_URL to, not a React client that might typically talk to an API endpoint.

Thanks for flagging up as I should make that clear. You definitely shouldn’t expose the db user credentials publicly

Also, the cURL request can be done from a client in the internet?

IntegrateDB is an application you would deploy on your infrastructure (server / cloud / cluster) fairly close to / alongside your database. So you can lock down access as you prefer – network firewall, RBAC, reverse proxy, etc.

The API is also envisaged as something more for an operator to talk to (a CLI is on the roadmap) or to use as part of an provisioning workflow. It would be unusual for your application to talk to it.

Hope that makes sense

Exadra37 · January 28, 2021, 9:31pm

It makes more sense now, but from a security point of view I am always suspicious when databases are being exposed directly as an API.

You should strive to make clear that the API is internal and should be on a private network, but even so I would require authentication to the internal API.

thruflo · January 28, 2021, 9:38pm

Yup, maybe a “security considerations” section in the docs.

I’m conscious that there’s more that could be done to make access to the IntegrateDB app more fine grained – right now any user can do anything to any stakeholder including claiming access to any arbitrary data. That’s definitely an area for improvement.

It’s also interesting to consider whether it needs to be a running application service at all. Perhaps it could be spun up and down, or just run locally or even just be reimagined as a CLI tool?!

Exadra37 · January 28, 2021, 10:33pm

Why just not a library that other applications can use programmatically?

brightball · January 28, 2021, 10:49pm

It’s an often overlooked and under appreciated need. I once had a project where I needed to connect two systems and my options were to either build an API so one application could read from the other…or connect the databases so I could just query both at the same time.

Went with door #2 and the entire urgent project was done in 2 weeks. Not a single hiccup was experienced.

thruflo · January 29, 2021, 9:18am

Exactly

There are some nuances. Particularly the replication consumer and notifications system as envisaged need to be a running service. However, it may well be possible to refactor out so these are different tools in the box – a library / local ops client and a notification service. The latter could run with just replication permission and wouldn’t expose the db in the same way.

Thanks for the suggestion, much to ruminate on!

thruflo · January 29, 2021, 9:19am

@brightball could I use this as a quote if I get round to posting some?!

brightball · January 29, 2021, 1:19pm

I’ll give you a better quote, but yes. Feel free.

thruflo · January 29, 2021, 2:21pm

Awesome and by all means I’d love a better quote if you’d like to make one up (here or I’m thruflo at gmail) thanks!

brightball · January 29, 2021, 2:53pm

Here you go. I was in the car when I responded earlier.

“It’s not right for every situation, but several years ago I worked on a project with two separate code bases with recurring subscription systems. Due to a dispute with one of them and some limitations in services, we had to quickly request for every single user to manually update their subscriptions from one platform to the other. In order to make it as easy as possible, the new system needed to be able to know everything about the subscription on the old system so that users would only need to update their payment information and click a button. Our choices were to build an entire API on the old system with a legacy code base that nobody understood or to let the new database connect to the legacy database directly to perform joins. We ended up going with the second option and had a working version in production under two weeks with no issues. Everyone migrated successfully and the crisis was averted. Sometimes SQL is the best API.”

thruflo · January 31, 2021, 5:45pm

@brightball awesome, thanks so much for taking the trouble to write this

eteeselink · February 1, 2021, 7:23pm

This looks really nice! I also had a look at the code base but there’s one fundamental thing I don’t understand.

What I understand IntegrateDB does is create db accounts which can only access those fields they claim access to, and then verify that DB migrations don’t violate those claims. So far so good.

But in the code I see a whole lot of stuff going on to fully consume all logical replication messages being sent. Doesn’t that mean that every change to the DB data gets sent to IntegrateDB? Every insert, update and delete? What’s that for? My (likely incorrect) understanding is that IntegrateDB is a pretty passive program until migrations or new db users happen. What am I missing?

Either way, very cool, this looks like an excellent way to turn a bad practice good!

thruflo · February 2, 2021, 7:53am

@eteeselink thanks for the feedback and for digging into the code

You are right that at the moment it’s consuming every replication update and that this isn’t necessary. Let me try and explain!

What I’d conceived of initially was to combine the data claims with the ability to use the same machinery to configure notifications, as per usage.html#notifications. So the intention is to consume the logical replication feed and then “forward on” updates that a secondary app has registered for over web hook / socket / queue / etc. Now, I haven’t got round to implementing the event handling part of this yet and it’s on my todo list when implementing to restrict the publication feed to stream updates to just the registered tables.

What I have implemented atm is using the logical replication machinery as a way to tell IntegrateDB to re-sync the data access claims after a migration. User specification data like fields: ["*"] is expanded to snapshot the actual columns that match the spec when saving data access claims. This needs to be re-done whenever the DDL changes, which means somehow triggering the process. Right now, I have an integrated_sync() function in the db that inserts a new record to the integratedb.sync table and then the only thing the replication consumer actually does atm (as per replication.ex#L56-L66) is watch for these changes.

This could and should be locked down – the publication feed only needs the sync table at the moment. (In fact, I just pushed a temporary fix: replication: limit publication to just the sync table for now. · integratedb/integrate@0b4a7a3 · GitHub).

However, following some of the previous feedback in this thread, I’m thinking a more comprehensive solution is to separate out the “ops client” (which as you say can be passive and just needs to create db accounts which can only access those fields they claim access to, and then verify that DB migrations don’t violate those claims) from the notification service, which needs to be a running application to consume and forward on the events. This would mean that you wouldn’t need to enable replication at all unless you want to run the notification service.

The wrinkle is that I still need to expand spec data into claims somehow, which maybe means moving some logic into the database. That’s maybe a better solution anyway as it then could be done transactionally as part of the migration.

Hope that makes sense and thanks again for pointing this out!

radiospiel · February 4, 2021, 11:14am

thruflo I am really totally lost on what it really does. Simplify your architecture/ Increase your development velocity/ Without shooting yourself in the foot all sounds really great, but a pointer to an easy to follow scenario, maybe in the form of a blog post (ideally with a link to some code examples) would probably be really helpful to explain what you are achieving.

thruflo · February 4, 2021, 11:35am

Hey @radiospiel did you see the docs, e.g.: Intro — IntegrateDB v0.1.0

radiospiel · February 8, 2021, 10:14am

Hey James,

In fact I did. But I am still lost

(I am probably not the target audience though)

/eno