Database tables optimalization

sztosz · October 15, 2018, 6:54am

Hi. I have two huge (I know, huge depends on where you sit ) tables A and B, with over 60 million records each, and daily the grow about a million each. Each table has a foreign key to other table, let’s call it client_id, a longish string, and some meta data. Most operation on those tables is just searching if given text is already added to a table, and if not, and other precondition are satisfied, it’s added to one of the table. Why the text, and what is being done with the text is rather unimportant detail, and I’m bound with NDA, so can’t tell more

My problem is that with the tables growing the access time, searching in those tables, where the key that I need to search by is the text… takes more and more time. What I am sure of, is that amount of records per client is finite, for some it may be thousands, for some millions, but it is finite.

I was thinking about splitting table A and B into many smaller tables "#{client_id}_A" and“#{client_id}_B” but am not sure if it’s really a good solution. Maybe there are some better alternatives?

And if I split them, how should I go with a code that handles tables? List of clients is known at runtime only, so should I write a macro that defines and compile a module per `"#{client_id}_A" table, each module including Ecto.Schema etc? Or maybe a generic module that functions take client_id and then call raw sql queries on proper tables? The generic module seems to be easier to write and then read by fellow programmers, but maybe macro has some advantages I’m not aware?

I’m looking for opinions, because I really am not sure how to approach this correctly.

axelson · October 15, 2018, 7:27am

What database are you using? The solution may vary significantly on that information.

sztosz · October 15, 2018, 7:27am

Postgresql 9

sanswork · October 15, 2018, 9:46am

Partitioning probably worth looking into.

https://www.postgresql.org/docs/9.0/static/ddl-partitioning.html

What is your indexing like currently?

sztosz · October 15, 2018, 10:05am

I’ll look into it, but on a glance don’t look like easy option, I’ll have to dig into that deeper

I have index on client_id, and also on (client_id and string).

sanswork · October 15, 2018, 10:22am

Easy option is to buy a bigger server.

Have you tried examining your queries to make sure they are using your indexes?

sztosz · October 15, 2018, 10:38am

Bigger server is the least viable option. I check things again, talk to the team and look at partitioning.

bdarla · October 15, 2018, 11:39am

Not sure whether it would fit, nevertheless, how about creating an auxiliary inverted key-value store wherein the key is the client id and the value is the list of table rows comprising information related to the particular client?

sribe · October 15, 2018, 12:42pm

Well it’s the right way to limit the amount of data searched for any single client. Partition on some modulus of the client id.

Then why are your search times growing out of control? You need to explain your queries and see what PG is doing with them.

sztosz · October 19, 2018, 10:23am

It seems production database had no proper indexes while development had. I’m to blame Nevertheless next step will be partitioning, if performance starts to degrade below some threshold. Thank you all for input and links.

abitdodgy · October 19, 2018, 10:35am

It’s hard to say without more detail, but other than the obvious, like proper indexing and so forth, maybe materialised views can help? I had a similar problem, although, admittedly on a smaller scale. A materialised view saved the day. The other thing I tried to do is, instead of searching multiple columns in the same query, I used regular expressions on the incoming params in Phoenix to decide which column to search instead of running the same search against multiple columns.

sztosz · October 22, 2018, 11:37am

Actually the core of problem is trivial. Check if record (an id in one column, a text in other) is in the table or not, if it isn’t then put it one of tables. There are, I estimate, roughly hundreds (maybe thousands) of this checks per second. So far with proper indices () it’s working OK, but I think I need to add more logging, and watch performance over longer period of time to be sure.