Build correctly Massive Data Structure

codeRaider · November 9, 2020, 11:40pm

TL;DR
Better way and Efficient methods to build a massive data structure (like a snapshot), or this is job for another language?

In-deep
In a Redis server, we have segmented objects by parameters. The base object is institute, this have a base json object with parameters like address, phone number, code, id, etc. Then, this institute have nested keys with data (json object), like institute:<code>:teachers, institute:<code>:executive and others.
There is a builder that create this massive structure by many institutes with a list of institute codes, one by one, this generate a base structure and merge and add the nested objects.

A little example:

%{
    "institute-a" => %{
      address: "Street A 453",
      id: 12,
      status: "active",
      online: true,
      location: %{
        country: "United State",
        state: "CA",
        city: "Los Angeles"
      },
      teachers: %{
        "john-doe" => %{
          age: 35,
          asignatures: ["Art", "Music"],
          id: 999
        },
        ...
      },
      courses: %{
        "A1" => {
          students: %{
            "carol-doe" => %{
              age: 15,
              grades: %{
                "music" => "A",
                "arts" => "B+"
              },
              ....
            }
          }
        }
      }
    },
    ....
  }

The duration to build this structure was nice (5 secs), but more institutes was added to the Redis server, already we have 500 institutes.
I try many forms, but, when the data is so massive, the request time is like 15~20secs to show this request data or send as response in a channel event.
I know elixir is not friendly with this kind of process, this is temporary, but I need ideas to be more efficient and respecting the elixir way to code.

hauleth · November 10, 2020, 1:21am

How often does that data change?

codeRaider · November 10, 2020, 3:42am

Concurrently in the day, in a time period like 9am to 3pm are the high peak of changes (creation, modify, delete, etc), monday to friday. The low concurrency is between 3pm to 9am of the next day and weekend.
Image as reference:

darnahsan · November 10, 2020, 7:06am

You can have a look at this it might be of some help or atleast give you idea over hwo to tackle your problem

codeRaider · November 10, 2020, 7:49am

In mid-term I proposed a refactor with a new project with a current elixir version (the project is in Elixir 1.6.5) to create a better scalable and updated system. Use Rustler to manage this kind of complex structures is a point in the proposal to the Leader. Yep, a goal for next year is learn Rust.

chulkilee · November 10, 2020, 12:18pm

Did you find out what part is actually taking time mostly?

If building the data is 5 sec but the whole request takes 15-20… it means something takes 10-15s

Things to check before going further (e.g. rustler)

Use iolist not string for redis command (if the library supports)
Check Redis library/server setting which makes thing slow - especially redis response time (not your app response time)
Do not rebuild data again
Avoid copying large data across erlang nodes - maybe better to refetch from phx channel process

codeRaider · November 10, 2020, 2:20pm

The part that take mostl time is build N+1 structure. As I said, this each process, one by one, is where take so much time (the base data is append with more nested attribute in the process). I use Enum.map to process N+1 code of institute. I was think to use reduce in this case. The problem is when a Client connect to the channel, the API send this structure to show in a front page or get in an endpoint, 20 to 30 secs is a lot. If this continue, this could take more if we reach 1000 institutes.
I will check if Redix library use IO List.
We have a Redis Cluster Server as an Azure services. The point of this data is be persistent, quick and easy to process.
We thought to create just one institute object by redis key (JSON object storaged). The process to build could be more simple, but by the multi-concurrent data transactions (changes) that we have, the fear of data conflict in the process (desynchronized), this could be a headache. This data is important (sensitive) for many process and other clients that get this.

chulkilee · November 10, 2020, 3:04pm

500 is too small number to notice the slowness hm.

Check whether you have N+1 query problem. You may need to batch process but should not have N+1 query problem. Are you fetching institutes and all their associated data at once from database? Or do you need to iterate associated records per institute (e.g. they’re in redis…)?

It is still not clear where the thing goes wrong. Do you get it from redis? Or build it, put it to redis, and return to client? Without knowing what is the problem exactly, it’s hard to fix it.

chasers · November 10, 2020, 3:21pm

I’m not sure it sounds like Flow may be something to look into for a short term solution. It should help spread the map process out across all your cores.

codeRaider · November 10, 2020, 3:22pm

500 is too small number to notice the slowness hm.

Yes, but we want to prepare the API to hold >1000 objects.

Check whether you have N+1 query problem. You may need to batch process but should not have N+1 query problem. Are you fetching institutes and all their associated data at once from database? Or do you need to iterate associated records per institute (e.g. they’re in redis…)?

All this records are in Redis.

It is still not clear where the thing goes wrong. Do you get it from redis? Or build it, put it to redis, and return to client?

Get data from Redis, conform the N+1 institute objects with associated data (the nested attributes have N+1 objects and growth). All the time this is executed when is required (response on join event, channel event or an endpoint get request). The clients need sometimes specific data structure, with less data, without index keys, etc.

Without knowing what is the problem exactly, it’s hard to fix it.

The front devs said this is so slow the data response (15 to 20 secs). They think this could take between 5~10secs avg even if we have 20k objects in the massive structure.

codeRaider · November 10, 2020, 3:26pm

Oh! This is interesting. I’ll check if this is compatible with our current elixir version, I hope. Thnks!

chasers · November 10, 2020, 3:35pm

And yeah if you can cache the response for even like a second it may help quite a bit. Cachex provides a nice built in ttl feature that will automatically bust your cache after a set timeframe.

chulkilee · November 10, 2020, 3:47pm

Okay then you have N+1 query problem - this is data fetching problem not necessarily elixir.

You may make it faster by performing N+1 queries in parallel (use Task for instance) - but in this case you loose the benefit of Redis pipelining.

Anyway you’re accessing the data not in “Redis way”. You may prefetch most values if knows the pattern in advance - e.g. using SCAN or it’s family.

Or you can use Lua script to preprocess a little bit on redis side.

As you have redis, you may use redis for caching as well - which is helpful if transformation is expensive compared to network bandwidth between erlang nodes and redis instance.

codeRaider · November 10, 2020, 3:52pm

I don’t know if cache the cached data is correctly I try to make clean this process, but situations like this, I think is required to storage on memory the data if is concurrent its use to use quick (Redis data could be a backup data). The problem is when this data is updated, change or delete. The process will be processed in Redis and Cache in paralelle (Like Sync data).

fsa · November 12, 2020, 1:51am

I know very little about this topic - but before doing any code changes or system alterations - check if the increase in the size of the data being processed is causing swaps to disks anywhere… just a thought