Simple Index Checking with PostgreSQL

Matt Barr of mySidewalk takes up the Write Stuff challenge this time round, looking at quick and simple way to see how well your indexing is performing on your PostgreSQL database.

At mySidewalk, we provide planners, engineers, elected offices, and their communities with a powerful suite of engagement, statistical, and geographic tools. Our mission requires us to manage and process large, complex, and constantly growing datasets but we have a simple philosophy around performance: it’s not a feature, it’s the feature. Scale, reliability, and efficiency are mission critical, too. Using PostgreSQL as our primary database allows us to pursue these goals with its rich table and index usage statistics. I’m going to illustrate an easy way to pull and analyze statistics about your PostgreSQL database.

Assumptions: What you need in order to make use of this post

Gathering Your Statistics: The query and its output

Gathering and arranging these statistics is a fairly short query. Each of the statistics views involved in the query is well documented (specifically pg_stat_user_tables, pg_statio_user_tables, and pg_statio_user_indexes). Remember that pages, blocks, and buffers are often used interchangeable in the PostgreSQL community and documentation. Remember also that these statistics can be reset manually but are not usually; they often represent the life of the database (more on this later).

Column breakdown

Note that all ratios are out of 1.0 and higher is better. Other numbers are simple counts of either rows or pages.

Analyzing Your Ratios: Logarithmic rule of thumb

In analyzing these statistics, the ratios will be the most telling. A good index use ratio to target is highly dependent on your workload and the size of the table but for speedy answers to queries on non-trivial tables, you’ll want to be using indices (trying to push toward 100% by identifying queries that are not taking advantage of any index).

In contrast, common wisdom about page cache hit ratios is that, for most applications, .99 or higher is desirable. Further, it’s helpful to group these ratios logarithmically such that you can distinguish >.99, >.9, and everything else easily (if your application domain demands an even better hit ratio, you can just extend this logic, i.e. separate out >.999 also).

A good spreadsheet tool with conditional formatting makes this analysis trivial. Let’s walk through some real life, production examples using Google Sheets.

Example 1: Fully optimized

\`dashboard\_public\_characteristic\`, an optimized table The above represents a highly optimized table with layout, indices, and clustering (heap order in PostgreSQL) designed for the queries it services. Indexes are used ~10,000 times as often as table scans and page hits are ~100 times as common as misses. The one remark you can make about this table is that one of the indexes is used as much as 20 times more frequently. This insight has actually sparked an internal discussion about moving the surrogate primary key to a natural key (which is represented by the first index anyway) and dropping the pkey index and column; this would only drive the other ratios even higher.

Example 2: Lower than optimal index hit ratio

\`activity\_contentview\`, suboptimal index hit ratio We keep track of individual views of interesting content in our application for fine grained reporting and analytics. For obvious reasons, this table is written to almost constantly and contains a lot of data. From that point of view, a 97% hit ratio on the biggest most used index is impressive, however, it is under our target hit ratio of .99 and thus a topic for optimization conversation.

In this particular situation, we’ve denormalized certain aggregates on content views to another table just to get to the ratios we have. The likely next step is partitioning older data to shrink the table and its indexes and optimizing the hit ratios of the more frequently accessed newest data (consequently seeing poorer ratios on the less frequently accessed partitions).

Example 3: It hurts to watch

\`election\_streetsegment\`, a grim situation Historically, most applications and backend systems that serve ballot data use street segments (ranges of street addresses that receive a particular ballot combination), precincts (basically polling locations), and precinct splits (partial precincts that will receive the same ballot). In our initial attempts delivering sample ballots to users, we used the same data model as the systems from which we were aggregating the data. A street segment table is large (both in row count and row width), poorly ordered, and accessed randomly and these things show in the ratios presented.

In cases where all ratios are suboptimal, the problem most often comes down to one of data layout. We sidestepped the issue by designing a new data model depending on the geographical boundaries of specific election contests (which leverages of our geo-statistics knowledge and makes the data organized and compact). We maintain the street segment data in order to answer queries about polling location, however, so it could use optimization. A quick analysis of the sheet reveals:

Dropping several of the indexes then creating and clustering on a new index (optimizing the most frequent access pattern) has yielded promising results so far.
\`election\_streetsegment\` after optimizations

Suggestions and next steps

The method above allows you to be proactive about performance issues rather than putting out performance fires once they reach “unacceptable”. I hope you try this on your database and find the information and examples instructive. Additional ideas and resources to pursue can be found below.

Further reading

Matt Barr works at mySidewalk helping to create the technology that powers a toolset directed at making communities better. Civics, democracy, technology, and the great outdoors are his passions.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.