Categories
hosting monitoring postgresql replication

Monitoring PostgreSQL Streaming Replication

Having migrated to PostgreSQL 9.2 and set up Streaming Replication successfully, the next obvious thing you want to know is, “Is it still working?” Fortunately, PostgreSQL comes out-of-the-box with a few tools that can help

pg_stat_replication

pg_stat_replication is really only useful on a master database server. It shows the slaves connected to the master, and some other opaque information about ‘where it’s up to’, for example

template1=# select * from pg_stat_replication;
  pid  | usesysid |  usename   | application_name | client_addr | client_hostname | client_port |         backend_start        |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+------------+------------------+-------------+-----------------+-------------+------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 30213 |   xxxxxx | replicator | walreceiver      | 10.10.0.10  |                 |       52889 | 2013-01-24 17:06:06.11849+00 | streaming | 9/42008AD0    | 9/42008AD0     | 9/42008AD0     | 9/42007FB0      |             0 | async
(1 row)

To a noob like me the _location fields are a bit opaque (apart from being hex and probably related to WAL segments), but Alexander Fortin’s post to the postgresql mailing list helps make this a bit clearer.

I’ve actually converted that statement to not require his custom hex_to_int function, like so:

SELECT
    client_addr,
    sent_offset - (
        replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) AS byte_lag
FROM (
    SELECT
        client_addr,
        ('x' || lpad(split_part(sent_location,   '/', 1), 8, '0'))::bit(32)::bigint AS sent_xlog,
        ('x' || lpad(split_part(replay_location, '/', 1), 8, '0'))::bit(32)::bigint AS replay_xlog,
        ('x' || lpad(split_part(sent_location,   '/', 2), 8, '0'))::bit(32)::bigint AS sent_offset,
        ('x' || lpad(split_part(replay_location, '/', 2), 8, '0'))::bit(32)::bigint AS replay_offset
    FROM pg_stat_replication
) AS s;

which gives an output like this:

 client_addr | byte_lag 
-------------+----------
 10.10.0.10  |        0

meaning that slave 10.10.0.10 is 0 bytes behind the master. Most of the time in my experience it stays pretty low, but how big is too big? As long as it stays below the size of 1-2 WAL segment (use “SHOW wal_segment_size” to work out what your config is) then things are fine. If it drops back further, you might want to find out what’s going on.

SELECT pg_last_xact_replay_timestamp()

There’s a few functions that only work on the slave too, most notably pg_last_xact_replay_timestamp(). It shows the timestamp of the last transaction applied to the slave. With a reasonably busy server, using something like:

SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INTEGER AS lag_seconds;

is good enough to see how far behind the slave is. However, if there’s not enough activity, although the server is completely caught up, this number will continue to increase as now() gets further and further away from the last committed transaction timestamp. Instead, per this post of the mailing list, compare the location information too:

SELECT
CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() 
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

Finally, for anyone using ServerDensity, I updated reinbach’s plugin to include a simple count check of connected slaves on a master, and the slave lag function above.
See it here: https://github.com/dansketcher/sd-postgresql

Categories
hosting postgresql replication software development

PostgreSQL on Windows Server 2008 problems

Having been running PostgreSQL 8.4 for a number of years in production (if it ain’t broke…) the time came to do an upgrade (8.4 on Centos 5 was getting a bit dated). We decided, after some discussion, to take up our hosting provider on a managed HA Windows cluster with shared storage and a redundant slave replica. This was mainly to reduce our own internal server management overhead, as they would be taking care of all parts of the servers’ maintenance.

This all sounded great. After some tests at moderate load, we migrated the databases one by one via a piped pg_dump/pg_restore, which was simplicity itself apart from the outage. But then, we started getting some reports of slowness, which was worrying. I updated the application to log slow queries (easier to do this at the application level so that they could go into Splunk) and I was seeing SOME instances (not all) of query execution previously taking ~3sec take anywhere from 9 seconds to 200(!) seconds. As you can imagine, this was very disconcerting.

Of course, beforehand the server was configured using Windows ‘best practice’ settings, which is mostly the same as Linux but to have a smaller shared_buffers and effective_cache_size settings per the PostgreSQL Wiki page.

The primary approach we started to take was to use EXPLAIN ANALYSE extensively to work out if it was just a case of this new server not being ‘run in’ effectively and not having enough stats. Some solid gains were made, but the problem was not ‘average’ query time, it was the ‘worst case’ query time. Some queries would run fine one day but terribly the next, and of course once the PostgreSQL cache was primed it was as speedy as ever. Many indexes were changed and updated, the DB was VACUUM ANALYSEd, all the usual stuff. In a display of desperation, we even ‘defragged’ the disk (which would have felt very retro if it was not so unpleasant)

At no point was RAM, CPU or Disk IO even stretched. The servers were effectively over-specced, as we ramped them up to see if that would help, and it did, but not to the extent needed in those worst-case queries.

Another very annoying thing was that apparently the PostgreSQL server was not able to maintain connections properly. Despite having more than enough in the max_connections setting, we consistently saw these kinds of errors reported by the client (namespace from the ruby-pg client):

PGError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

PGError: no connection to the server

PG::Error: could not receive data from server: Connection timed out

The last of these errors was particularly nasty, as usually the statement was executed successfully, but the client had no idea and no record, so while each DB was internally consistent, inter-db communication processes would get out of sync and needed to be manually reconciled.

So, how did we solve these problems?

We didn’t. We moved everything off the cluster to a new Linux PostgreSQL 9.2 master/slave streaming replication setup. Many many thanks to Bartek Ciszkowski for his excellent writeup, which laid a very solid foundation for our new setup. Since the new DB servers went into production, all of the work we’ve done with indexes has meant the sites are performing faster than ever, and we’ve not seen one of the errors above either (touch wood!).

Incidentally, we use (async) streaming replication with another PITR recovery setup, which uses old-style backup commands rather than pg_basebackup, so that we can rsync update the base regularly without having to store a new copy every time. Also note, because of the implementation of replication, you can’t run these backup commands on the slave, they have to be run on the master, which is a shame. Still, we rsync to the slave for the PITR backup so at least it’s not doing double disk IO.

Useful links:
Relink A PostgreSQL 9.2 Streaming Replication Primer
New Replication and Recovery Features in PostgreSQL 9.1
And of course, the PostgreSQL docs