How to store and scale big data

Hello Alchemist,

While this doesn’t have much to do with elixir, i hope you will allow me to feast a bit on your knowledge :smile:
I’m currently working on a statistics engine, which process game files to generates and persist a bunch of stats for that game.
For each stat entry i save a lot of meta data for that specific stat (unstructured data), which is why i thought that Postgres could be a good solution, as i can rely on their jsonb format for the meta data.
This is what the structure looks like now

     Column      |       Type        | Collation | Nullable |
-----------------+-------------------+-----------+----------+
 id              | bigint            |           | not null | 
 value           | double precision  |           | not null | 
 stat_identifier | character varying |           | not null | 
 player_id       | character varying |           | not null | 
 user_id         | bigint            |           |          | 
 team_id         | bigint            |           |          | 
 ghost_id        | bigint            |           |          | 
 stat_meta_id    | bigint            |           | not null | 
 meta            | jsonb             |           |          |

The meta field could contain something like this

{"side": "CT", "time": 24.7734375, "round": 1, "stage": "de_inferno", "weapon": "P2000", "overtime": false, "was_kill": false, "hit_group": "chest", "victim_id": "76561198856413397", "was_headshot": false, "victim_position": {"x": 197.15223693847656, "y": 1380.577392578125, "z": 104.56356811523438}, "was_self_damage": false, "was_team_damage": false, "attacker_position": {"x": 418.1552429199219, "y": 2116.254638671875, "z": 131.56341552734375}}

The goal here is to allow for some advanced queries cross multiple matches. Like how many time do we get killed in overtime, on x map, in y location, by z weapon.
This is perfectly possible with this structure. My problem is scalability. An average game gives me around 12000 entries in that table, 500 games takes up 2GB of storage.
I’m likely to process 1000+ a day, exponentially.

I don’t have that much experience with big data (which is guess this categorize as?), what are my options here? Should i look elsewhere than postgres?

I used most of yesterday to do some research of how to scale such a solution.
The obvious solution of course is to limit the amount of stats put into the database which definitely a option, but last resort.

Cassandra could look like a solution, while is does not support unstructured data, empty fields does not take any space so i could just add the meta fields to the table.
However i find i hard to model for Cassandra as the philosophy is to build tables around filters/queries (ex. users_by_id), but the filters applied to my queries will vary and the combinations are many.

I could really use some a tips of direction, any help is greatly appreciated.
Thank you

Giving your description I would stick with Postgres at this point.

What is your expected storage requirement ? GB or into the TB range ? Within 10s of GB range I would use Mnesia.

Once you go over TB range I would consider looking into a Datalake like LeoFS (written in Erlang).

Thanks for you reply :slightly_smiling_face:.

It’s actually a system replace a current engine that we have, which harvest almost zero meta data. So currently we have around 40000 games that are ready to be ran with the new engine. This will take up about 160GB. The service is currently running in a early access mode, so it could raise significantly when we open the doors. I think the 1TB mark will be hit fairly quickly.

Have yet to try it, but something like https://www.citusdata.com/blog/2015/05/05/compressing-jsonb-cstore-fdw/ could maybe prolong the postgres option.

I will take a look at LeoFS :slight_smile:

Maybe something like this https://aws.amazon.com/big-data/datalakes-and-analytics/

1 Like

https://www.timescale.com/ (postgres extension) might be an option.

Some possibly relevant features:

2 Likes

If you’re currently on AWS, this application sounds like a decent fit for Redshift. You’d want to expand out the keys in meta to columns (Redshift is specifically designed to work with wiiiiiiiiiiide rows). It has a Postgres-flavored interface, so you wouldn’t need to change much about how you’re querying the data.

1 Like

So your initial conclusion is correct, Cassandra would be quite difficult to work with for your task.

Thanks for all the replies lads, some great suggestions!

@al2o3cr I think i’m gonna give Redshift a try with some ultra wide columns :slight_smile:, as we are on the amazon stack already. :slight_smile: