Elephant-watching - Monitoring PostgreSQL

In Robert M. Wysocki's latest Write Stuff article, he looks at the wider aspects of monitoring PostgreSQL and the tools you can use to get a better view of your database's performance.

Proper monitoring is the key to keeping your system in good health and it's especially important in the database field. No matter if you have a busy system processing a lot of transactions each second or you just want to keep tabs on your data warehouse handling only few heavy queries - you want to graph most of the metrics in order to be able to see how they're trending. There are many solutions for doing that, but let me tell you a secret: in this endeavor Grafana will be your best friend.

The goal

Usually when talking about monitoring we really mean monitoring and alerting – those two are a bit different but complementary, and there's very little point in doing one but not the other. We can monitor events and metrics and when an event occurs we may decide to alert on it. But usually it's better to act beforehand.

Probably the best way to keep track of your metrics' trends is to visualize them with graphs. And that's where Grafana comes in. Unlike one of my favorite systems – Munin – Grafana allows the creation of dynamic graphs and lets you choose a time period in a very flexible fashion.

For busy OLTP systems it's important to have a live overview of what's happening, and the biggest pain with Munin was the need to wait (by default 5 minutes) for the graphs to be updated. Thanks to Grafana (and few other parts of the pipeline discussed below) this is no longer the issue.

But if you want to view your metrics over quite a long period of time, you may experience a slowdown since most of the work with generating graphs is done in your browser. But have no fear, making Grafana generate static PNG file is just a few clicks away!

One thing to remember though – your dashboard won't be complete if you only include database-related graphs in it. To have a proper monitoring setup for your precious RDBMS, you should also have a clear view of system-level metrics like CPU usage (including I/O waits), system load, memory usage and of course disk-related statistics like I/O activity (both IOPS and latency), swap activity and usage, and disk space utilization. You also shouldn't forget about network activity, sometimes it's also useful to see hardware-level metrics like interrupts graphed over a period of time, though it may be less important if you're running your database in a virtualized environment, which happens surprisingly more often than most DBAs would like it.

Another question worth answering before you start to work on your monitoring dash is what is most important to you in your setup? Or, to express the same idea in a slightly different way: "what breaks most often for you?". When you look at your dashboard, you want to know at first glance when something is wrong. That's why you need to think about the placement of your graphs so that the most important ones will be clearly visible.

The pipeline

Grafana's just a part of modern graphing pipeline – it is responsible for visualizing metrics, putting the graphs together into dashboards and it also provides a nice WebUI simplifying those tasks.

But to be able to create graphs first you'll need data points. Usually a full stack graphing solution utilizing Grafana consists also of Graphite for transporting and storing metrics and CollectD for gathering them.

While Graphite is quite a complex piece of software using few autonomous daemons, CollectD is very simple and fast. This allows having almost continuous data point collection with very small gaps of say 5 seconds between them. That's a huge difference to the default Munin configuration mentioned above.

A scalable setup of those components might look like that:

graphite setup

Each monitored hosts runs a collectd process that connects to one of carbon-relays (a part of Graphite suite); each relay relays received data to a central server where carbon (another part of Graphite suite) uses whisper (yet another one) to write collected metrics to files in an organized structure. It's also possible to have multiple instances of carbon and whisper running to provide even better scalability, HA and backup.

I won't go into details about setting up the full pipeline since it's outside the scope of this article. Suffice to say that's a job for sysadmins. DBAs really shouldn't be concerned with anything other than configuring CollectD and creating graphs and dashboards and the rest of this article will cover just that.

The practice

Some time ago I learned about a new "sport" - birdwatching. Well, I'd like to coin the notion of elephant-watching now. Actually most of the kudos for this should go to Sebastian "tokkee" Harl since it was he who used the phrase "Watch your elephants" in his presentation given at PgCong.EU at 2012 in Prague.

So the goal is to have a comprehensive database statistics dashboard presenting the most important metrics in a clear and visible manner that can communicate, in an obvious way, any problems that may happen. To achieve this goal we have to combine few of CollectD's plugins. First, we're gonna need to configure the basic ones like "cpu", "memory", "disk" or "df". This is really simple, in some cases - like "cpu" for example - all that you need to do is to LoadPlugin. And again, this can and should be handled by your friendly sysadmin.

But there are also some plugins that we can bend to fit our specific, database-oriented needs. Sebastian in his presentation gives a great example with "processes" plugin - we can use it to group processes and graph things like memory used by PostgreSQL's autovacuum workers, I/O utilized by background writers, CPU time used by statistics collectors and so on. This can complement system-level metrics, like general CPU or I/O activity in a very interesting way.

And then there's postgresql CollectD plugin with some nice set of default options including TPS, queries or I/O metrics available per cluster as well as per table.

Using the predefined metrics can be fine, but with more active databases we usually want to also collect some custom ones from the totally awesome PostgreSQL statistics collector. This is, of course, possible with this plugin - thanks to the Query block you can define a custom Statement, fetch its results and store it as metric. One important thing to remember though – use the proper type, for example a counter or gauge; some of the metrics, like for example number of scans or tuples fetched, are always increasing while others, like number of connections, are the actual values. You may want to take a look at types.db and grep it by pg_.

When you're done, your dashboard might look somewhat like this:

pg dash

As you can see, Grafana gives a lot of freedom with visualizing your data; you can choose between different graph types (line, area, stacked etc), you can present a single statistic as a number and you can also show a graph in it's background.

Presenting single stats can be very useful in making your graph obvious when it comes to spotting problems. In my example you can see a single stat showing average cache hit ratio at the top of the image; it's green, but it'll turn red should the ratio drop below a defined threshold. Same can be done with other metrics like CPU usage, I/O wait, swap activity and so on.

One another Grafana's great advantage over Munin is it's powerful query and transformation language. It's very easy to combine metrics together, correlate them, graph an average, sum etc.

Let's take a look at how I created a CPU usage graph:

cpu graph

As mentioned above, CollectD sends metrics to Graphite, Whisper then stores them in a hierachical structure. To make use of a metric in Grafana, you first have to find it in this structure. But it wouldn't be reasonable to create a new dashboard for every monitored server, right? That's why Grafana let's you use variables.

If you take a look at the screenshot, you'll notice I defined 5 variables - each in a simple manner, a as part of CollectD "path", so system_role for example is defined as collectd.* and environment as collectd.${system_role}.*. Selectors for each variable are available at the top of your dashboard and this way your dash simply becomes reusable.

In this particular graph there are two data series defined: A - named (using alias() function) user+system and B - wait. You should notice that I'm using summarize() function – this way the graph can be smoother than it would be if we were presenting all the data points. Actually, I should have used avg in summarize() instead of sum but I figured that out after capturing the screenshot. You can also see, that for series A I use sumSeriesWithWildcard() - this allows me to combine CPU time spent in userspace and in kernel - the function summarizes 8th node (0-based), so {user,system} in this case. Apart from that I'm doing average over all the system CPUs (nodes 6 and 7) and in the end - average over all the servers selected (you can select more than one).

This is quite a simple use case, but there are much more functions to choose from.

The conclusion

I'd like to end this article with a note, that you should always remember: graphing and alerting are two elements of monitoring that should go together.

I hope this little tutorial will help you setting up your Grafana dash for PostgreSQL, but please do remember that a good alerting solution should also be in place.

Also, there are many other interesting projects out there to help with PostgreSQL monitoring, to name only a few:

Robert M. Wysocki is a freelance PostgreSQL consultant, experienced GNU/Linux administrator and author of travellingsysop.net technical blog

This article is licensed with CC-BY-NC-SA 4.0 by Compose.