Technology/Design For Implementing Large Scale Joins In A Distributed System

big-data
design

#1

NOTE, I’m only asking this here because I find the forum is full of lots of smart people. I realize this question has nothing to do with Elixir.

I’ve been working on an real-time Elasticsearch backed reporting/worklist application for a while now and we have hit a wall when it comes to performing joins. Basically we are handling certain joins application side, which is slow, and not a sustainable approach. The Elasticsearch data itself is a near real-time copy of our main postgres server. Users create report criteria and get a report/worklist within seconds to work on. Every time a report is loaded ES is queried to get the results, which are returned within seconds.

After talking with other engineers, we decided to start looking at graph databases since the data does have a large amount of connections. So far I’ve worked with neo4j and dgraph, but both fall short in running aggregates and sorting large datasets in a timely manner.

I haven’t investigated many big data technologies, because the bulk of them are either batch or stream processing. I’m looking for a solution that handles large datasets, allows for timely queries against the entire dataset and handles joins.

My particular use case can be eventually consistent, workloads are read heavy, but I need to carry over current functionality which allows for sorting result sets in the millions.

A few different approaches I’ve thought about.

  • A custom sql schema to perform the work in a normal RDB using graphql, (most likely postgres)
  • Allow for longer running batch creation, and then keep reports up-to-date as data is changed using the same process the keeps ES in sync with our primary databse. IE, rather than rely on data being immediately available from the ES query, populate a report and continuously update it as new data comes from the primary database.
  • Some other backend technology that can handle these types of workloads effeciently.

Any members have suggestions for technology or design for a system like this?