Technology/Design For Implementing Large Scale Joins In A Distributed System

big-data
design

#1

NOTE, I’m only asking this here because I find the forum is full of lots of smart people. I realize this question has nothing to do with Elixir.

I’ve been working on an real-time Elasticsearch backed reporting/worklist application for a while now and we have hit a wall when it comes to performing joins. Basically we are handling certain joins application side, which is slow, and not a sustainable approach. The Elasticsearch data itself is a near real-time copy of our main postgres server. Users create report criteria and get a report/worklist within seconds to work on. Every time a report is loaded ES is queried to get the results, which are returned within seconds.

After talking with other engineers, we decided to start looking at graph databases since the data does have a large amount of connections. So far I’ve worked with neo4j and dgraph, but both fall short in running aggregates and sorting large datasets in a timely manner.

I haven’t investigated many big data technologies, because the bulk of them are either batch or stream processing. I’m looking for a solution that handles large datasets, allows for timely queries against the entire dataset and handles joins.

My particular use case can be eventually consistent, workloads are read heavy, but I need to carry over current functionality which allows for sorting result sets in the millions.

A few different approaches I’ve thought about.

  • A custom sql schema to perform the work in a normal RDB using graphql, (most likely postgres)
  • Allow for longer running batch creation, and then keep reports up-to-date as data is changed using the same process the keeps ES in sync with our primary databse. IE, rather than rely on data being immediately available from the ES query, populate a report and continuously update it as new data comes from the primary database.
  • Some other backend technology that can handle these types of workloads effeciently.

Any members have suggestions for technology or design for a system like this?


#2

ElasticSearch is really not the ideal if you need joins. But one think you should ask yourself is: do you really need joins? What’s the problem you are trying to solve with ElasticSearch? Maybe the problem you have is that you’re trying to use big data solutions as relational databases, which they aren’t.

For example, if you use elasticsearch and you need the data from another dataset with a given dataset, maybe you should put that data on the same dataset, since the document structure on elasticsearch allows you to do that and makes it really easy and fast to query that type of data.

Did you get what I mean?


#3

Do you mean denormalize the data more? If so, this has been discussed. The additional denormalization would substantially increase the size of the data and so many updates would be happening per document it would degrade performance. At this point we are still reviewing some other big data technologies but we haven’t come across any magical tech that would help yet.


#4

You might want to look at Presto [ http://prestodb.github.io/ ], it provides an SQL interface over other data-sources. A quick search suggests that it has an Elasticsearch connector [ https://github.com/harbby/presto-connectors/tree/master/presto-elasticsearch ].


#5

Yep.

Well, about the size, it’s usually not a problem for big data solutions. Not so sure about Elasticsearch though. But about the insert/update performance, well, you probably noticed that is a problem either way, Elasticsearch was not created with this in mind, that’s one of the reasons it’s not recommended to use it as your main database. So indexing data on Elasticsearch should always be done in an asynchronous way, not locking users to update it.

Well, I worked with Google’s BigQuery too, and yeah, it has the same “problem” about performance on inserts. I know BigTable is a more low-latency big data solution, but it’s very expensive it, or at least was.


#6

I am not one of that smart guys, but sometime ago a read a lot about big data and real-time analytics for a a project at work and I came across some interesting stuff that would have used if I was the one deciding the stack. Unfortunately at my work they went with mysql and Elastic search and we also had problems with the queries times.

So let me share what I found interesting to use…

My first option

Did you have though of throwing your data into Kafka and use Kafka KSQL to query it with SQL:

CREATE STREAM vip_actions AS 
SELECT userid, page, action 
FROM clickstream c 
LEFT JOIN users u ON c.userid = u.user_id 
WHERE u.level = 'Platinum';

My second option

Did you ever looked into MemSql?

Run both transactional and analytical workloads with an integrated, familiar, durable SQL database.

It’s able to perform queries across data in memory and disk and seems very well suited for real-time analytics on big data.

And seems it can work with Kafka to… more here.

Despite being a paid database, they have a free version that we can evaluate quickly with a docker image.

To have a feeling for how both Kafka and MemSQL play together, they have this docker image to ingest Tweets.

Unfortunately the free version cannot be clustered securely and also misses other features from the paid version, thus for production is always necessary to use the paid one.

Conclusion

While I don’t have direct experience with the type of system you are dealing with I hope that someone can take soemthing from what I shared.


#7

Never tried it but I came across clickhouse recently. Have you considered it? If so, why doesn’t it fit your requirements?


#8

I haven’t tried memsql. We’ve tried some graph databases and are currently working on a poc using snowflake. Memsql I don’t know much about. Clickhouse was researched and determined it wouldn’t fit our needs due to its method of aggregating data. Snowflake seems promising, standard SQL interface, but with an archtiecture to handle big data/analytical workloads.