Postgres design schema design question

andre1sk · March 21, 2017, 10:17pm

An app has to provide an ability to sort order and filter on fairly large dataset by any column, the current setup has an index on each column in data tables. This obviously is horrible for updates and inserts (updating 50+ indexes) in some instances. Anyone has any cool ideas/experience with this kind of setup.

OvermindDL1 · March 21, 2017, 10:23pm

Uh, hmm, first I’d tried to keep the data as immutable as possible, second probably remove some of those indexes if a lot of writes are happening, third if it is not per-user based returns then lots of lots of caching, I’d probably keeping a running updated cache of each heavily used variation… PostgreSQL has a nice feature I use for that called a MATERIALIZED VIEW, can save a little work there but not a ton… You have to re-materialize it manually, but many things can share it.

andre1sk · March 21, 2017, 10:57pm

For the major portions of tables they are insert heavy not update heavy.

Y would love to even to save on ram, but all queries are order by with limit against huge numbers of rows so index scan untill you hit 300 matches vs full table scan of possible few hundred millions of rows is what drove the original design

No data layer abstraction like 20 services hitting a single DB directly

For some joined things yep will save time

OvermindDL1 · March 21, 2017, 11:04pm

Cool, so immutable it all.

Now what about reading vs writing?

Something I do often is keep a running tally for things like that, keep a side table and insert there too. Now when people query you can limit them to, oh, 300 max or so, and for each index you keep the ‘top’ 300 of the sorts. This makes reading the sorted data for the limited top subset almost stupid fast, increases write time slightly due to another table write (and a janitor process to go over on occasion to pare it back down again), but easy to do. You do have to make a new table per each new ‘index’ you are doing, basically making custom indexes, but you can get rid of the 50 indexes on the main table then.

Uh, build something in front that emulates the postgresql then!

Or worst case, lots of views, replace some tables with views, lots of functions, etc… You can optimize a postgresql database internally pretty well, but hope you are comfortable with the true horrors of SQL if you do that, but it is possible.

Better yet, burn all the things that access the DB straight, that is retarded…

Materialized views can work for the recent top N list too, but I prefer to update those on the fly so could get rid of the main table index-hoarde.

andre1sk · March 22, 2017, 12:49am

Thank you cool idea

peerreynders · March 22, 2017, 1:46am

Basically “true” services need service autonomy i.e. they don’t share their data with anyone else (which either means that one single service manages the data and the way it is cached or the data is “duplicated” over multiple services but with varying service-specific representations)

Sounds like too much “reporting” on the OLTP system (like you didn’t know that already)

Rather that simply “guessing” based on the indices it may be worth logging the queries that are hitting the table (I know, I know, performance killer) in order to determine:

Which queries are actually common
Relative frequency of each query
Cost of common queries

That way you should get a better idea which queries you should go after first and optimize those “side tables” for these queries, holding just enough data to be relevant/accurate. In effect you would be taking the first steps of refactoring towards segregating the data for these diverse “services”.

andre1sk · March 23, 2017, 9:06am

This is large commercial product and I am first week on new job so my ability to change it fundamentally is limited [quote=“peerreynders, post:6, topic:4239”]
Rather that simply “guessing” based on the indices it may be worth logging the queries that are hitting the table (I know, I know, performance killer) in order to determine:

Which queries are actually common
Relative frequency of each query
Cost of common queries
[/quote]
That’s the problem all queries are basically add hoc people are looking for a needle in a haystack

peerreynders · March 23, 2017, 2:08pm

Now I don’t know what is going on there but it sounds to me like you may have been dropped into a situation where there is ignored/unacknowledged technological debt. There often is the expectation that advances in technology will counter the accumulated technological debt but in general it rarely works out that way. So by “fundamentally limiting your ability to effect change” they might as well be asking you to address the issue of a million dollar debt with a single quarter. Just like financial debt, technological debt has to be managed otherwise it turns into technical bankruptcy which can kill the business.

Often “ad hoc queries” is a euphemistic description of a situation where “everybody and their brother” is issuing manual queries (from Excel spreadsheets, Access reports, etc) against a mission critical OLTP RDBMS - where I come from that is considered a cardinal sin.

The old fashioned solution is to simply duplicate the database, which is synced every night and redirect all the reporting traffic (“ad hoc queries”) to the duplicate DB where they don’t interfere with mission critical operations.

The more modern approach is to go the BI (Business Intelligence) route and to duplicate the business data into a schema optimized for analysis (often using a star schema) where the various BI tools can do their thing, feeding various dashboards and such. Often one beneficial side effect is that historical data only relevant to reporting can be removed from the operations database - which may bring additional performance improvements.

So far it looks like you a being asked for a “quick fix” where there is none.

andre1sk · March 24, 2017, 10:48pm

peerreynders:

Often “ad hoc queries” is a euphemistic description of a situation where “everybody and their brother” is issuing manual queries (from Excel spreadsheets, Access reports, etc) against a mission critical OLTP RDBMS - where I come from that is considered a cardinal sin.

The old fashioned solution is to simply duplicate the database, which is synced every night and redirect all the reporting traffic (“ad hoc queries”) to the duplicate DB where they don’t interfere with mission critical operations.

The more modern approach is to go the BI (Business Intelligence) route and to duplicate the business data into a schema optimized for analysis (often using a star schema) where the various BI tools can do their thing, feeding various dashboards and such. Often one beneficial side effect is that historical data only relevant to reporting can be removed from the operations database - which may bring additional performance improvements.

So far it looks like you a being asked for a “quick fix” where there is none.

This is not really BI although similar in function the sole purpose of this database is to collect data and let people dig through it quickly users are very few and they are very technical most likely better than I am the only purpose of the system is to save them time.