Need options and tradeoffs for storing and accessing digraphs (directed graphs)

sudostack · May 31, 2020, 8:25pm

Scenario(s) (ordered from most needed to least needed)

I have user profiles that need to display an aggregate sum of the subscribers and subscriptions for a given user (edge count)
Want to be able to display some subset of subscribers/subscriptions for a user in some pre-defined set of orders (likely paginated): recently added
I may want to produce at some point or run a query for n-degrees (depth from root) of separations (e.g. subscribers of subscribers), but it’s not an immediate use case

What I’ve looked at/heard about/questions

For aggregations, I’m wondering that if I stick with PostgreSQL whether it’d make sense to use materialized views and triggers anticipating the possibility of highly concurrent updates to these counts
I’ve looked at @bcardarella’s question here from years ago: Ecto preload for tag has_many tags and see recursive queries for traversal
Should I consider a graph database, like Neo4J having had no prior experience with them? And would it be easy to transfer these relationships over at a later point?

krstfk · June 1, 2020, 1:32pm

Hi,

I would recommend against this solution. Given that you’ll have only one type of vertices and a single relation between them, and given the access pattern you describe, the added complexity of using (eg) neo4j seems uncalled for.

I love neo4j, but at this point, using an elixir to neo4j driver would imply learning cypher, and get familiar with the intricacies of neo4j as there are no mature interfaces right now.

That’s not to say that the neo4j driver is not usable, it is very much so, and Florin - the author of the driver - and Dom - one of its main contributor - are great and I expect that the driver will improve over time, and we may get higher level libraries soonish thanks to @domvas

I just don’t think it would be worth it in this context.

I would probably try to do it in Postgres only using the resources you have quoted from the Ecto related post. If you can, maybe check Chapter 2 of Designing Data Intensive Applications.

Another solution worth investigating would be erlang digraph . It’s a digraph solution backed by ets tables. Granted it has its pros and cons :

pros
- It’s part of the regular Erlang/Beam ecosystem, so it should be fairly familiar
- It’s backed by ETS tables, so it should be reasonably fast
- Tables must be private or protected, so the locking mechanism will be a given (which is also a con)
- You can use all the algorithms from digraph utils.
cons
- It’s ets so the database is stored in ram.
- Those ets tables are either private or protected, meaning you can only ever update from a single process ( but this is also a pro, since you will need to ensure that two vertices that you link do exist at any given time – I would test wether it can handle the expected load).
- If the process owning the table dies, your data disappear. I’m assuming here that you have a way to rebuild the data from the pg db, but even then you would need to be aware of the time it takes and how you rebuild it (eg, you could rebuild the whole graph upon startup, which would mean unavailable results or inconsistent results for a while or you could rebuild it as you go (is node x in the graph, if not construct the graph for node x) which would mean higher latency.)

Hope this helps, somewhat.

sudostack · June 2, 2020, 6:23am

Major major thanks, @krstfk!

I appreciate the awesome tradeoff assessment! Yeah, I already don’t do enough raw SQL anymore to say I’m good at that now – one of the pitfalls of ORM usage, so learning Cypher could be quite the rabbit-hole.

Ah, good ol’ DDIA. I need to revisit the book. Will pass through chapter 2 again as tonight’s reading.

I was definitely not aware of the digraph module in Erlang. Thank you for sharing that! It’s really fleshed out! It seems like a good tool to have if/when I encounter the scenario that I need to handle enormous load with concurrent updates (hot-spots).

LostKobrakai · June 2, 2020, 7:25am

There’s also the third party libgraph package on hex, if you need graphs, but have them not backed by ets.

sudostack · June 2, 2020, 4:16pm

Thanks for this! I had starred this repo from BitWalker a long time ago and didn’t even look at Elixir libs until you threw it back out there. I really hadn’t thought about modeling the problem at the application layer – mostly the shepherding between a durable persistence layer

domvas · June 2, 2020, 7:28pm

I agree with @krstfk on the complexity of using Neo4j. Even if it is simple to grasp, doing things nicely from modeling to maintenance is not so simple.

Regarding your use cases, the only one where Neo4j will really fit is not a immediate one, so be nice with yourself and start with postgres which has nice capabilies and can allow you to do graph with agensgrah.
If you encounter the limit of the system (especially if you want to do realtime stuff) then you can add Neo4j on the side for your use case. In fact, it’s common to have a relational database as the source of truth, and have some data copied into a graph database to address some issues / use cases.

Last but not least, translating a relational model to a graph model is really easier than the opposite