Hi. I have two huge (I know, huge depends on where you sit ) tables A and B, with over 60 million records each, and daily the grow about a million each. Each table has a foreign key to other table, let’s call it client_id, a longish string, and some meta data. Most operation on those tables is just searching if given text is already added to a table, and if not, and other precondition are satisfied, it’s added to one of the table. Why the text, and what is being done with the text is rather unimportant detail, and I’m bound with NDA, so can’t tell more
My problem is that with the tables growing the access time, searching in those tables, where the key that I need to search by is the text… takes more and more time. What I am sure of, is that amount of records per client is finite, for some it may be thousands, for some millions, but it is finite.
I was thinking about splitting table A and B into many smaller tables "#{client_id}_A" and“#{client_id}_B” but am not sure if it’s really a good solution. Maybe there are some better alternatives?
And if I split them, how should I go with a code that handles tables? List of clients is known at runtime only, so should I write a macro that defines and compile a module per `"#{client_id}_A" table, each module including Ecto.Schema etc? Or maybe a generic module that functions take client_id and then call raw sql queries on proper tables? The generic module seems to be easier to write and then read by fellow programmers, but maybe macro has some advantages I’m not aware?
I’m looking for opinions, because I really am not sure how to approach this correctly.
Not sure whether it would fit, nevertheless, how about creating an auxiliary inverted key-value store wherein the key is the client id and the value is the list of table rows comprising information related to the particular client?
It seems production database had no proper indexes while development had. I’m to blame Nevertheless next step will be partitioning, if performance starts to degrade below some threshold. Thank you all for input and links.
It’s hard to say without more detail, but other than the obvious, like proper indexing and so forth, maybe materialised views can help? I had a similar problem, although, admittedly on a smaller scale. A materialised view saved the day. The other thing I tried to do is, instead of searching multiple columns in the same query, I used regular expressions on the incoming params in Phoenix to decide which column to search instead of running the same search against multiple columns.
Actually the core of problem is trivial. Check if record (an id in one column, a text in other) is in the table or not, if it isn’t then put it one of tables. There are, I estimate, roughly hundreds (maybe thousands) of this checks per second. So far with proper indices () it’s working OK, but I think I need to add more logging, and watch performance over longer period of time to be sure.