diff options
-rw-r--r-- | doc/src/sgml/backup.sgml | 1384 | ||||
-rw-r--r-- | doc/src/sgml/high-availability.sgml | 1389 |
2 files changed, 1389 insertions, 1384 deletions
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index f3888c6e60b..76151c1f6c5 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.140 2010/02/03 17:25:05 momjian Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.141 2010/02/09 16:50:25 heikki Exp $ --> <chapter id="backup"> <title>Backup and Restore</title> @@ -1492,1388 +1492,6 @@ archive_command = 'local_backup_script.sh' </sect2> </sect1> - <sect1 id="warm-standby"> - <title>Warm Standby Servers for High Availability</title> - - <indexterm zone="backup"> - <primary>warm standby</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>PITR standby</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>standby server</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>log shipping</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>witness server</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>STONITH</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>high availability</primary> - </indexterm> - - <para> - Continuous archiving can be used to create a <firstterm>high - availability</> (HA) cluster configuration with one or more - <firstterm>standby servers</> ready to take over operations if the - primary server fails. This capability is widely referred to as - <firstterm>warm standby</> or <firstterm>log shipping</>. - </para> - - <para> - The primary and standby server work together to provide this capability, - though the servers are only loosely coupled. The primary server operates - in continuous archiving mode, while each standby server operates in - continuous recovery mode, reading the WAL files from the primary. No - changes to the database tables are required to enable this capability, - so it offers low administration overhead compared to some other - replication approaches. This configuration also has relatively low - performance impact on the primary server. - </para> - - <para> - Directly moving WAL records from one database server to another - is typically described as log shipping. <productname>PostgreSQL</> - implements file-based log shipping, which means that WAL records are - transferred one file (WAL segment) at a time. WAL files (16MB) can be - shipped easily and cheaply over any distance, whether it be to an - adjacent system, another system at the same site, or another system on - the far side of the globe. The bandwidth required for this technique - varies according to the transaction rate of the primary server. - Record-based log shipping is also possible with custom-developed - procedures, as discussed in <xref linkend="warm-standby-record">. - </para> - - <para> - It should be noted that the log shipping is asynchronous, i.e., the WAL - records are shipped after transaction commit. As a result there is a - window for data loss should the primary server suffer a catastrophic - failure: transactions not yet shipped will be lost. The length of the - window of data loss can be limited by use of the - <varname>archive_timeout</varname> parameter, which can be set as low - as a few seconds if required. However such a low setting will - substantially increase the bandwidth required for file shipping. - If you need a window of less than a minute or so, it's probably better - to consider record-based log shipping. - </para> - - <para> - The standby server is not available for access, since it is continually - performing recovery processing. Recovery performance is sufficiently - good that the standby will typically be only moments away from full - availability once it has been activated. As a result, we refer to this - capability as a warm standby configuration that offers high - availability. Restoring a server from an archived base backup and - rollforward will take considerably longer, so that technique only - offers a solution for disaster recovery, not high availability. - </para> - - <sect2 id="warm-standby-planning"> - <title>Planning</title> - - <para> - It is usually wise to create the primary and standby servers - so that they are as similar as possible, at least from the - perspective of the database server. In particular, the path names - associated with tablespaces will be passed across unmodified, so both - primary and standby servers must have the same mount paths for - tablespaces if that feature is used. Keep in mind that if - <xref linkend="sql-createtablespace" endterm="sql-createtablespace-title"> - is executed on the primary, any new mount point needed for it must - be created on the primary and all standby servers before the command - is executed. Hardware need not be exactly the same, but experience shows - that maintaining two identical systems is easier than maintaining two - dissimilar ones over the lifetime of the application and system. - In any case the hardware architecture must be the same — shipping - from, say, a 32-bit to a 64-bit system will not work. - </para> - - <para> - In general, log shipping between servers running different major - <productname>PostgreSQL</> release - levels is not possible. It is the policy of the PostgreSQL Global - Development Group not to make changes to disk formats during minor release - upgrades, so it is likely that running different minor release levels - on primary and standby servers will work successfully. However, no - formal support for that is offered and you are advised to keep primary - and standby servers at the same release level as much as possible. - When updating to a new minor release, the safest policy is to update - the standby servers first — a new minor release is more likely - to be able to read WAL files from a previous minor release than vice - versa. - </para> - - <para> - There is no special mode required to enable a standby server. The - operations that occur on both primary and standby servers are - normal continuous archiving and recovery tasks. The only point of - contact between the two database servers is the archive of WAL files - that both share: primary writing to the archive, standby reading from - the archive. Care must be taken to ensure that WAL archives from separate - primary servers do not become mixed together or confused. The archive - need not be large if it is only required for standby operation. - </para> - - <para> - The magic that makes the two loosely coupled servers work together is - simply a <varname>restore_command</> used on the standby that, - when asked for the next WAL file, waits for it to become available from - the primary. The <varname>restore_command</> is specified in the - <filename>recovery.conf</> file on the standby server. Normal recovery - processing would request a file from the WAL archive, reporting failure - if the file was unavailable. For standby processing it is normal for - the next WAL file to be unavailable, so we must be patient and wait for - it to appear. For files ending in <literal>.backup</> or - <literal>.history</> there is no need to wait, and a non-zero return - code must be returned. A waiting <varname>restore_command</> can be - written as a custom script that loops after polling for the existence of - the next WAL file. There must also be some way to trigger failover, which - should interrupt the <varname>restore_command</>, break the loop and - return a file-not-found error to the standby server. This ends recovery - and the standby will then come up as a normal server. - </para> - - <para> - Pseudocode for a suitable <varname>restore_command</> is: -<programlisting> -triggered = false; -while (!NextWALFileReady() && !triggered) -{ - sleep(100000L); /* wait for ~0.1 sec */ - if (CheckForExternalTrigger()) - triggered = true; -} -if (!triggered) - CopyWALFileForRecovery(); -</programlisting> - </para> - - <para> - A working example of a waiting <varname>restore_command</> is provided - as a <filename>contrib</> module named <application>pg_standby</>. It - should be used as a reference on how to correctly implement the logic - described above. It can also be extended as needed to support specific - configurations and environments. - </para> - - <para> - <productname>PostgreSQL</productname> does not provide the system - software required to identify a failure on the primary and notify - the standby database server. Many such tools exist and are well - integrated with the operating system facilities required for - successful failover, such as IP address migration. - </para> - - <para> - The method for triggering failover is an important part of planning - and design. One potential option is the <varname>restore_command</> - command. It is executed once for each WAL file, but the process - running the <varname>restore_command</> is created and dies for - each file, so there is no daemon or server process, and we cannot - use signals or a signal handler. Therefore, the - <varname>restore_command</> is not suitable to trigger failover. - It is possible to use a simple timeout facility, especially if - used in conjunction with a known <varname>archive_timeout</> - setting on the primary. However, this is somewhat error prone - since a network problem or busy primary server might be sufficient - to initiate failover. A notification mechanism such as the explicit - creation of a trigger file is ideal, if this can be arranged. - </para> - - <para> - The size of the WAL archive can be minimized by using the <literal>%r</> - option of the <varname>restore_command</>. This option specifies the - last archive file name that needs to be kept to allow the recovery to - restart correctly. This can be used to truncate the archive once - files are no longer required, assuming the archive is writable from the - standby server. - </para> - </sect2> - - <sect2 id="warm-standby-config"> - <title>Implementation</title> - - <para> - The short procedure for configuring a standby server is as follows. For - full details of each step, refer to previous sections as noted. - <orderedlist> - <listitem> - <para> - Set up primary and standby systems as nearly identical as - possible, including two identical copies of - <productname>PostgreSQL</> at the same release level. - </para> - </listitem> - <listitem> - <para> - Set up continuous archiving from the primary to a WAL archive - directory on the standby server. Ensure that - <xref linkend="guc-archive-mode">, - <xref linkend="guc-archive-command"> and - <xref linkend="guc-archive-timeout"> - are set appropriately on the primary - (see <xref linkend="backup-archiving-wal">). - </para> - </listitem> - <listitem> - <para> - Make a base backup of the primary server (see <xref - linkend="backup-base-backup">), and load this data onto the standby. - </para> - </listitem> - <listitem> - <para> - Begin recovery on the standby server from the local WAL - archive, using a <filename>recovery.conf</> that specifies a - <varname>restore_command</> that waits as described - previously (see <xref linkend="backup-pitr-recovery">). - </para> - </listitem> - </orderedlist> - </para> - - <para> - Recovery treats the WAL archive as read-only, so once a WAL file has - been copied to the standby system it can be copied to tape at the same - time as it is being read by the standby database server. - Thus, running a standby server for high availability can be performed at - the same time as files are stored for longer term disaster recovery - purposes. - </para> - - <para> - For testing purposes, it is possible to run both primary and standby - servers on the same system. This does not provide any worthwhile - improvement in server robustness, nor would it be described as HA. - </para> - </sect2> - - <sect2 id="warm-standby-failover"> - <title>Failover</title> - - <para> - If the primary server fails then the standby server should begin - failover procedures. - </para> - - <para> - If the standby server fails then no failover need take place. If the - standby server can be restarted, even some time later, then the recovery - process can also be immediately restarted, taking advantage of - restartable recovery. If the standby server cannot be restarted, then a - full new standby server instance should be created. - </para> - - <para> - If the primary server fails and the standby server becomes the - new primary, and then the old primary restarts, you must have - a mechanism for informing old primary that it is no longer the primary. This is - sometimes known as STONITH (Shoot The Other Node In The Head), which is - necessary to avoid situations where both systems think they are the - primary, which will lead to confusion and ultimately data loss. - </para> - - <para> - Many failover systems use just two systems, the primary and the standby, - connected by some kind of heartbeat mechanism to continually verify the - connectivity between the two and the viability of the primary. It is - also possible to use a third system (called a witness server) to prevent - some cases of inappropriate failover, but the additional complexity - might not be worthwhile unless it is set up with sufficient care and - rigorous testing. - </para> - - <para> - Once failover to the standby occurs, we have only a - single server in operation. This is known as a degenerate state. - The former standby is now the primary, but the former primary is down - and might stay down. To return to normal operation we must - fully recreate a standby server, - either on the former primary system when it comes up, or on a third, - possibly new, system. Once complete the primary and standby can be - considered to have switched roles. Some people choose to use a third - server to provide backup for the new primary until the new standby - server is recreated, - though clearly this complicates the system configuration and - operational processes. - </para> - - <para> - So, switching from primary to standby server can be fast but requires - some time to re-prepare the failover cluster. Regular switching from - primary to standby is useful, since it allows regular downtime on - each system for maintenance. This also serves as a test of the - failover mechanism to ensure that it will really work when you need it. - Written administration procedures are advised. - </para> - </sect2> - - <sect2 id="warm-standby-record"> - <title>Record-based Log Shipping</title> - - <para> - <productname>PostgreSQL</productname> directly supports file-based - log shipping as described above. It is also possible to implement - record-based log shipping, though this requires custom development. - </para> - - <para> - An external program can call the <function>pg_xlogfile_name_offset()</> - function (see <xref linkend="functions-admin">) - to find out the file name and the exact byte offset within it of - the current end of WAL. It can then access the WAL file directly - and copy the data from the last known end of WAL through the current end - over to the standby servers. With this approach, the window for data - loss is the polling cycle time of the copying program, which can be very - small, and there is no wasted bandwidth from forcing partially-used - segment files to be archived. Note that the standby servers' - <varname>restore_command</> scripts can only deal with whole WAL files, - so the incrementally copied data is not ordinarily made available to - the standby servers. It is of use only when the primary dies — - then the last partial WAL file is fed to the standby before allowing - it to come up. The correct implementation of this process requires - cooperation of the <varname>restore_command</> script with the data - copying program. - </para> - - <para> - Starting with <productname>PostgreSQL</> version 8.5, you can use - streaming replication (see <xref linkend="streaming-replication">) to - achieve the same with less effort. - </para> - </sect2> - - <sect2 id="streaming-replication"> - <title>Streaming Replication</title> - - <para> - <productname>PostgreSQL</> includes a simple streaming replication - mechanism, which lets the standby server to stay more up-to-date than - file-based replication allows. The standby connects to the primary - and the primary starts streaming WAL records from where the standby - left off, and continues streaming them as they are generated, without - waiting for the WAL file to be filled. So with streaming replication, - <varname>archive_timeout</> does not need to be configured. - </para> - - <para> - Streaming replication relies on file-based continuous archiving for - making the base backup and for allowing a standby to catch up if it's - disconnected from the primary for long enough for the primary to - delete old WAL files still required by the standby. - </para> - - <sect3 id="streaming-replication-setup"> - <title>Setup</title> - <para> - The short procedure for configuring streaming replication is as follows. - For full details of each step, refer to other sections as noted. - <orderedlist> - <listitem> - <para> - Set up primary and standby systems as near identically as possible, - including two identical copies of <productname>PostgreSQL</> at the - same release level. - </para> - </listitem> - <listitem> - <para> - Set up continuous archiving from the primary to a WAL archive located - in a directory on the standby server. Ensure that - <xref linkend="guc-archive-mode">, - <xref linkend="guc-archive-command"> and - <xref linkend="guc-archive-timeout"> - are set appropriately on the primary - (see <xref linkend="backup-archiving-wal">). - </para> - </listitem> - - <listitem> - <para> - Set up connections and authentication so that the standby server can - successfully connect to the pseudo <literal>replication</> database of - the primary server (see - <xref linkend="streaming-replication-authentication">). Ensure that - <xref linkend="guc-listen-addresses"> and <filename>pg_hba.conf</> are - configured appropriately on the primary. - </para> - <para> - On systems that support the keepalive socket option, setting - <xref linkend="guc-tcp-keepalives-idle">, - <xref linkend="guc-tcp-keepalives-interval"> and - <xref linkend="guc-tcp-keepalives-count"> helps you to find the - troubles with replication (e.g., the network outage or the failure of - the standby server) as soon as possible. - </para> - </listitem> - <listitem> - <para> - Set the maximum number of concurrent connections from the standby servers - (see <xref linkend="guc-max-wal-senders"> for details). - </para> - </listitem> - <listitem> - <para> - Enable WAL archiving in the primary server because we need to make a base - backup of it later (see <xref linkend="guc-archive-mode"> and - <xref linkend="guc-archive-command"> for details). - </para> - </listitem> - <listitem> - <para> - Start the <productname>PostgreSQL</> server on the primary. - </para> - </listitem> - <listitem> - <para> - Make a base backup of the primary server (see - <xref linkend="backup-base-backup">), and load this data onto the - standby. Note that all files present in <filename>pg_xlog</> - and <filename>pg_xlog/archive_status</> on the <emphasis>standby</> - server should be removed because they might be obsolete. - </para> - </listitem> - <listitem> - <para> - Set up WAL archiving, connections and authentication like the primary - server, because the standby server might work as a primary server after - failover. Ensure that your settings are consistent with the - <emphasis>future</> environment after the primary and the standby - server are interchanged by failover. If you're setting up the standby - server for e.g reporting purposes, with no plans to fail over to it, - configure the standby accordingly. - </para> - </listitem> - <listitem> - <para> - Create a recovery command file <filename>recovery.conf</> in the data - directory on the standby server. - </para> - - <variablelist id="replication-config-settings" xreflabel="Replication Settings"> - <varlistentry id="standby-mode" xreflabel="standby_mode"> - <term><varname>standby_mode</varname> (<type>boolean</type>)</term> - <listitem> - <para> - Specifies whether to start the <productname>PostgreSQL</> server as - a standby. If this parameter is <literal>on</>, the streaming - replication is enabled and the standby server will try to connect - to the primary to receive and apply WAL records continuously. The - default is <literal>off</>, which allows only an archive recovery - without replication. So, streaming replication requires this - parameter to be explicitly set to <literal>on</>. - </para> - </listitem> - </varlistentry> - <varlistentry id="primary-conninfo" xreflabel="primary_conninfo"> - <term><varname>primary_conninfo</varname> (<type>string</type>)</term> - <listitem> - <para> - Specifies a connection string which is used for the standby server - to connect with the primary. This string is in the same format as - described in <xref linkend="libpq-connect">. If any option is - unspecified in this string, then the corresponding environment - variable (see <xref linkend="libpq-envars">) is checked. If the - environment variable is not set either, then the indicated built-in - defaults are used. - </para> - <para> - The built-in replication requires that a host name (or host address) - or port number which the primary server listens on should be - specified in this string, respectively. Also ensure that a role with - the <literal>SUPERUSER</> and <literal>LOGIN</> privileges on the - primary is set (see - <xref linkend="streaming-replication-authentication">). Note that - the password needs to be set if the primary demands password - authentication. - </para> - </listitem> - </varlistentry> - <varlistentry id="trigger-file" xreflabel="trigger_file"> - <term><varname>trigger_file</varname> (<type>string</type>)</term> - <listitem> - <para> - Specifies a trigger file whose presence activates the standby. - If no trigger file is specified, the standby never exits - recovery. - </para> - </listitem> - </varlistentry> - </variablelist> - </listitem> - <listitem> - <para> - Start the <productname>PostgreSQL</> server on the standby. The standby - server will go into recovery mode and proceeds to receive WAL records - from the primary and apply them continuously. - </para> - </listitem> - </orderedlist> - </para> - </sect3> - <sect3 id="streaming-replication-authentication"> - <title>Authentication</title> - <para> - It's very important that the access privilege for replication are set - properly so that only trusted users can read the WAL stream, because it's - easy to extract serious information from it. - </para> - <para> - Only superuser is allowed to connect to the primary as the replication - standby. So a role with the <literal>SUPERUSER</> and <literal>LOGIN</> - privileges needs to be created in the primary. - </para> - <para> - Client authentication for replication is controlled by the - <filename>pg_hba.conf</> record specifying <literal>replication</> in the - <replaceable>database</> field. For example, if the standby is running on - host IP <literal>192.168.1.100</> and the superuser's name for replication - is <literal>foo</>, the administrator can add the following line to the - <filename>pg_hba.conf</> file on the primary. - -<programlisting> -# Allow the user "foo" from host 192.168.1.100 to connect to the primary -# as a replication standby if the user's password is correctly supplied. -# -# TYPE DATABASE USER CIDR-ADDRESS METHOD -host replication foo 192.168.1.100/32 md5 -</programlisting> - </para> - <para> - The host name and port number of the primary, user name to connect as, - and password are specified in the <filename>recovery.conf</> file or - the corresponding environment variable on the standby. - For example, if the primary is running on host IP <literal>192.168.1.50</>, - port <literal>5432</literal>, the superuser's name for replication is - <literal>foo</>, and the password is <literal>foopass</>, the administrator - can add the following line to the <filename>recovery.conf</> file on the - standby. - -<programlisting> -# The standby connects to the primary that is running on host 192.168.1.50 -# and port 5432 as the user "foo" whose password is "foopass". -primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' -</programlisting> - </para> - </sect3> - </sect2> - - <sect2 id="backup-incremental-updated"> - <title>Incrementally Updated Backups</title> - - <indexterm zone="backup"> - <primary>incrementally updated backups</primary> - </indexterm> - - <indexterm zone="backup"> - <primary>change accumulation</primary> - </indexterm> - - <para> - In a warm standby configuration, it is possible to offload the expense of - taking periodic base backups from the primary server; instead base backups - can be made by backing - up a standby server's files. This concept is generally known as - incrementally updated backups, log change accumulation, or more simply, - change accumulation. - </para> - - <para> - If we take a file system backup of the standby server's data - directory while it is processing - logs shipped from the primary, we will be able to reload that backup and - restart the standby's recovery process from the last restart point. - We no longer need to keep WAL files from before the standby's restart point. - If we need to recover, it will be faster to recover from the incrementally - updated backup than from the original base backup. - </para> - - <para> - Since the standby server is not <quote>live</>, it is not possible to - use <function>pg_start_backup()</> and <function>pg_stop_backup()</> - to manage the backup process; it will be up to you to determine how - far back you need to keep WAL segment files to have a recoverable - backup. You can do this by running <application>pg_controldata</> - on the standby server to inspect the control file and determine the - current checkpoint WAL location, or by using the - <varname>log_checkpoints</> option to print values to the standby's - server log. - </para> - </sect2> - </sect1> - - <sect1 id="hot-standby"> - <title>Hot Standby</title> - - <indexterm zone="backup"> - <primary>Hot Standby</primary> - </indexterm> - - <para> - Hot Standby is the term used to describe the ability to connect to - the server and run queries while the server is in archive recovery. This - is useful for both log shipping replication and for restoring a backup - to an exact state with great precision. - The term Hot Standby also refers to the ability of the server to move - from recovery through to normal running while users continue running - queries and/or continue their connections. - </para> - - <para> - Running queries in recovery is in many ways the same as normal running - though there are a large number of usage and administrative points - to note. - </para> - - <sect2 id="hot-standby-users"> - <title>User's Overview</title> - - <para> - Users can connect to the database while the server is in recovery - and perform read-only queries. Read-only access to catalogs and views - will also occur as normal. - </para> - - <para> - The data on the standby takes some time to arrive from the primary server - so there will be a measurable delay between primary and standby. Running the - same query nearly simultaneously on both primary and standby might therefore - return differing results. We say that data on the standby is eventually - consistent with the primary. - Queries executed on the standby will be correct with regard to the transactions - that had been recovered at the start of the query, or start of first statement, - in the case of serializable transactions. In comparison with the primary, - the standby returns query results that could have been obtained on the primary - at some exact moment in the past. - </para> - - <para> - When a transaction is started in recovery, the parameter - <varname>transaction_read_only</> will be forced to be true, regardless of the - <varname>default_transaction_read_only</> setting in <filename>postgresql.conf</>. - It can't be manually set to false either. As a result, all transactions - started during recovery will be limited to read-only actions only. In all - other ways, connected sessions will appear identical to sessions - initiated during normal processing mode. There are no special commands - required to initiate a connection at this time, so all interfaces - work normally without change. After recovery finishes, the session - will allow normal read-write transactions at the start of the next - transaction, if these are requested. - </para> - - <para> - Read-only here means "no writes to the permanent database tables". - There are no problems with queries that make use of transient sort and - work files. - </para> - - <para> - The following actions are allowed - - <itemizedlist> - <listitem> - <para> - Query access - SELECT, COPY TO including views and SELECT RULEs - </para> - </listitem> - <listitem> - <para> - Cursor commands - DECLARE, FETCH, CLOSE, - </para> - </listitem> - <listitem> - <para> - Parameters - SHOW, SET, RESET - </para> - </listitem> - <listitem> - <para> - Transaction management commands - <itemizedlist> - <listitem> - <para> - BEGIN, END, ABORT, START TRANSACTION - </para> - </listitem> - <listitem> - <para> - SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT - </para> - </listitem> - <listitem> - <para> - EXCEPTION blocks and other internal subtransactions - </para> - </listitem> - </itemizedlist> - </para> - </listitem> - <listitem> - <para> - LOCK TABLE, though only when explicitly in one of these modes: - ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. - </para> - </listitem> - <listitem> - <para> - Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD - </para> - </listitem> - <listitem> - <para> - Plugins and extensions - LOAD - </para> - </listitem> - </itemizedlist> - </para> - - <para> - These actions produce error messages - - <itemizedlist> - <listitem> - <para> - Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. - Note that there are no allowed actions that result in a trigger - being executed during recovery. - </para> - </listitem> - <listitem> - <para> - Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. - This also applies to temporary tables currently because currently their - definition causes writes to catalog tables. - </para> - </listitem> - <listitem> - <para> - SELECT ... FOR SHARE | UPDATE which cause row locks to be written - </para> - </listitem> - <listitem> - <para> - RULEs on SELECT statements that generate DML commands. - </para> - </listitem> - <listitem> - <para> - LOCK TABLE, in short default form, since it requests ACCESS EXCLUSIVE MODE. - LOCK TABLE that explicitly requests a mode higher than ROW EXCLUSIVE MODE. - </para> - </listitem> - <listitem> - <para> - Transaction management commands that explicitly set non-read only state - <itemizedlist> - <listitem> - <para> - BEGIN READ WRITE, - START TRANSACTION READ WRITE - </para> - </listitem> - <listitem> - <para> - SET TRANSACTION READ WRITE, - SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE - </para> - </listitem> - <listitem> - <para> - SET transaction_read_only = off - </para> - </listitem> - </itemizedlist> - </para> - </listitem> - <listitem> - <para> - Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, - ROLLBACK PREPARED because even read-only transactions need to write - WAL in the prepare phase (the first phase of two phase commit). - </para> - </listitem> - <listitem> - <para> - sequence update - nextval() - </para> - </listitem> - <listitem> - <para> - LISTEN, UNLISTEN, NOTIFY since they currently write to system tables - </para> - </listitem> - </itemizedlist> - </para> - - <para> - Note that current behaviour of read only transactions when not in - recovery is to allow the last two actions, so there are small and - subtle differences in behaviour between read-only transactions - run on standby and during normal running. - It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and - temporary tables may be lifted in a future release, if their internal - implementation is altered to make this possible. - </para> - - <para> - If failover or switchover occurs the database will switch to normal - processing mode. Sessions will remain connected while the server - changes mode. Current transactions will continue, though will remain - read-only. After recovery is complete, it will be possible to initiate - read-write transactions. - </para> - - <para> - Users will be able to tell whether their session is read-only by - issuing SHOW transaction_read_only. In addition a set of - functions <xref linkend="functions-recovery-info-table"> allow users to - access information about Hot Standby. These allow you to write - functions that are aware of the current state of the database. These - can be used to monitor the progress of recovery, or to allow you to - write complex programs that restore the database to particular states. - </para> - - <para> - In recovery, transactions will not be permitted to take any table lock - higher than RowExclusiveLock. In addition, transactions may never assign - a TransactionId and may never write WAL. - Any <command>LOCK TABLE</> command that runs on the standby and requests - a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected. - </para> - - <para> - In general queries will not experience lock conflicts with the database - changes made by recovery. This is becase recovery follows normal - concurrency control mechanisms, known as <acronym>MVCC</>. There are - some types of change that will cause conflicts, covered in the following - section. - </para> - </sect2> - - <sect2 id="hot-standby-conflict"> - <title>Handling query conflicts</title> - - <para> - The primary and standby nodes are in many ways loosely connected. Actions - on the primary will have an effect on the standby. As a result, there is - potential for negative interactions or conflicts between them. The easiest - conflict to understand is performance: if a huge data load is taking place - on the primary then this will generate a similar stream of WAL records on the - standby, so standby queries may contend for system resources, such as I/O. - </para> - - <para> - There are also additional types of conflict that can occur with Hot Standby. - These conflicts are <emphasis>hard conflicts</> in the sense that we may - need to cancel queries and in some cases disconnect sessions to resolve them. - The user is provided with a number of optional ways to handle these - conflicts, though we must first understand the possible reasons behind a conflict. - - <itemizedlist> - <listitem> - <para> - Access Exclusive Locks from primary node, including both explicit - LOCK commands and various kinds of DDL action - </para> - </listitem> - <listitem> - <para> - Dropping tablespaces on the primary while standby queries are using - those tablespaces for temporary work files (work_mem overflow) - </para> - </listitem> - <listitem> - <para> - Dropping databases on the primary while users are connected to that - database on the standby. - </para> - </listitem> - <listitem> - <para> - Waiting to acquire buffer cleanup locks - </para> - </listitem> - <listitem> - <para> - Early cleanup of data still visible to the current query's snapshot - </para> - </listitem> - </itemizedlist> - </para> - - <para> - Some WAL redo actions will be for DDL actions. These DDL actions are - repeating actions that have already committed on the primary node, so - they must not fail on the standby node. These DDL locks take priority - and will automatically *cancel* any read-only transactions that get in - their way, after a grace period. This is similar to the possibility of - being canceled by the deadlock detector, but in this case the standby - process always wins, since the replayed actions must not fail. This - also ensures that replication doesn't fall behind while we wait for a - query to complete. Again, we assume that the standby is there for high - availability purposes primarily. - </para> - - <para> - An example of the above would be an Administrator on Primary server - runs a <command>DROP TABLE</> on a table that's currently being queried - in the standby server. - Clearly the query cannot continue if we let the <command>DROP TABLE</> - proceed. If this situation occurred on the primary, the <command>DROP TABLE</> - would wait until the query has finished. When the query is on the standby - and the <command>DROP TABLE</> is on the primary, the primary doesn't have - information about which queries are running on the standby and so the query - does not wait on the primary. The WAL change records come through to the - standby while the standby query is still running, causing a conflict. - </para> - - <para> - The most common reason for conflict between standby queries and WAL redo is - "early cleanup". Normally, <productname>PostgreSQL</> allows cleanup of old - row versions when there are no users who may need to see them to ensure correct - visibility of data (the heart of MVCC). If there is a standby query that has - been running for longer than any query on the primary then it is possible - for old row versions to be removed by either a vacuum or HOT. This will - then generate WAL records that, if applied, would remove data on the - standby that might *potentially* be required by the standby query. - In more technical language, the primary's xmin horizon is later than - the standby's xmin horizon, allowing dead rows to be removed. - </para> - - <para> - Experienced users should note that both row version cleanup and row version - freezing will potentially conflict with recovery queries. Running a - manual <command>VACUUM FREEZE</> is likely to cause conflicts even on tables - with no updated or deleted rows. - </para> - - <para> - We have a number of choices for resolving query conflicts. The default - is that we wait and hope the query completes. The server will wait - automatically until the lag between primary and standby is at most - <varname>max_standby_delay</> seconds. Once that grace period expires, - we take one of the following actions: - - <itemizedlist> - <listitem> - <para> - If the conflict is caused by a lock, we cancel the conflicting standby - transaction immediately. If the transaction is idle-in-transaction - then currently we abort the session instead, though this may change - in the future. - </para> - </listitem> - - <listitem> - <para> - If the conflict is caused by cleanup records we tell the standby query - that a conflict has occurred and that it must cancel itself to avoid the - risk that it silently fails to read relevant data because - that data has been removed. (This is regrettably very similar to the - much feared and iconic error message "snapshot too old"). Some cleanup - records only cause conflict with older queries, though some types of - cleanup record affect all queries. - </para> - - <para> - If cancellation does occur, the query and/or transaction can always - be re-executed. The error is dynamic and will not necessarily occur - the same way if the query is executed again. - </para> - </listitem> - </itemizedlist> - </para> - - <para> - <varname>max_standby_delay</> is set in <filename>postgresql.conf</>. - The parameter applies to the server as a whole so if the delay is all used - up by a single query then there may be little or no waiting for queries that - follow immediately, though they will have benefited equally from the initial - waiting period. The server may take time to catch up again before the grace - period is available again, though if there is a heavy and constant stream - of conflicts it may seldom catch up fully. - </para> - - <para> - Users should be clear that tables that are regularly and heavily updated on - primary server will quickly cause cancellation of longer running queries on - the standby. In those cases <varname>max_standby_delay</> can be - considered somewhat but not exactly the same as setting - <varname>statement_timeout</>. - </para> - - <para> - Other remedial actions exist if the number of cancellations is unacceptable. - The first option is to connect to primary server and keep a query active - for as long as we need to run queries on the standby. This guarantees that - a WAL cleanup record is never generated and we don't ever get query - conflicts as described above. This could be done using contrib/dblink - and pg_sleep(), or via other mechanisms. If you do this, you should note - that this will delay cleanup of dead rows by vacuum or HOT and many - people may find this undesirable. However, we should remember that - primary and standby nodes are linked via the WAL, so this situation is no - different to the case where we ran the query on the primary node itself - except we have the benefit of off-loading the execution onto the standby. - </para> - - <para> - It is also possible to set <varname>vacuum_defer_cleanup_age</> on the primary - to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow - more time for queries to execute before they are cancelled on the standby, - without the need for setting a high <varname>max_standby_delay</>. - </para> - - <para> - Three-way deadlocks are possible between AccessExclusiveLocks arriving from - the primary, cleanup WAL records that require buffer cleanup locks and - user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks - are resolved by time-out when we exceed <varname>max_standby_delay</>. - </para> - - <para> - Dropping tablespaces or databases is discussed in the administrator's - section since they are not typical user situations. - </para> - </sect2> - - <sect2 id="hot-standby-admin"> - <title>Administrator's Overview</title> - - <para> - If there is a <filename>recovery.conf</> file present the server will start - in Hot Standby mode by default, though <varname>recovery_connections</> can - be disabled via <filename>postgresql.conf</>, if required. The server may take - some time to enable recovery connections since the server must first complete - sufficient recovery to provide a consistent state against which queries - can run before enabling read only connections. Look for these messages - in the server logs - -<programlisting> -LOG: initializing recovery connections - -... then some time later ... - -LOG: consistent recovery state reached -LOG: database system is ready to accept read only connections -</programlisting> - - Consistency information is recorded once per checkpoint on the primary, as long - as <varname>recovery_connections</> is enabled (on the primary). If this parameter - is disabled, it will not be possible to enable recovery connections on the standby. - The consistent state can also be delayed in the presence of both of these conditions - - <itemizedlist> - <listitem> - <para> - a write transaction has more than 64 subtransactions - </para> - </listitem> - <listitem> - <para> - very long-lived write transactions - </para> - </listitem> - </itemizedlist> - - If you are running file-based log shipping ("warm standby"), you may need - to wait until the next WAL file arrives, which could be as long as the - <varname>archive_timeout</> setting on the primary. - </para> - - <para> - The setting of some parameters on the standby will need reconfiguration - if they have been changed on the primary. The value on the standby must - be equal to or greater than the value on the primary. If these parameters - are not set high enough then the standby will not be able to track work - correctly from recovering transactions. If these values are set too low the - the server will halt. Higher values can then be supplied and the server - restarted to begin recovery again. - - <itemizedlist> - <listitem> - <para> - <varname>max_connections</> - </para> - </listitem> - <listitem> - <para> - <varname>max_prepared_transactions</> - </para> - </listitem> - <listitem> - <para> - <varname>max_locks_per_transaction</> - </para> - </listitem> - </itemizedlist> - </para> - - <para> - It is important that the administrator consider the appropriate setting - of <varname>max_standby_delay</>, set in <filename>postgresql.conf</>. - There is no optimal setting and should be set according to business - priorities. For example if the server is primarily tasked as a High - Availability server, then you may wish to lower - <varname>max_standby_delay</> or even set it to zero, though that is a - very aggressive setting. If the standby server is tasked as an additional - server for decision support queries then it may be acceptable to set this - to a value of many hours (in seconds). - </para> - - <para> - Transaction status "hint bits" written on primary are not WAL-logged, - so data on standby will likely re-write the hints again on the standby. - Thus the main database blocks will produce write I/Os even though - all users are read-only; no changes have occurred to the data values - themselves. Users will be able to write large sort temp files and - re-generate relcache info files, so there is no part of the database - that is truly read-only during hot standby mode. There is no restriction - on the use of set returning functions, or other users of tuplestore/tuplesort - code. Note also that writes to remote databases will still be possible, - even though the transaction is read-only locally. - </para> - - <para> - The following types of administrator command are not accepted - during recovery mode - - <itemizedlist> - <listitem> - <para> - Data Definition Language (DDL) - e.g. CREATE INDEX - </para> - </listitem> - <listitem> - <para> - Privilege and Ownership - GRANT, REVOKE, REASSIGN - </para> - </listitem> - <listitem> - <para> - Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX - </para> - </listitem> - </itemizedlist> - </para> - - <para> - Note again that some of these commands are actually allowed during - "read only" mode transactions on the primary. - </para> - - <para> - As a result, you cannot create additional indexes that exist solely - on the standby, nor can statistics that exist solely on the standby. - If these administrator commands are needed they should be executed - on the primary so that the changes will propagate through to the - standby. - </para> - - <para> - <function>pg_cancel_backend()</> will work on user backends, but not the - Startup process, which performs recovery. pg_stat_activity does not - show an entry for the Startup process, nor do recovering transactions - show as active. As a result, pg_prepared_xacts is always empty during - recovery. If you wish to resolve in-doubt prepared transactions - then look at pg_prepared_xacts on the primary and issue commands to - resolve those transactions there. - </para> - - <para> - pg_locks will show locks held by backends as normal. pg_locks also shows - a virtual transaction managed by the Startup process that owns all - AccessExclusiveLocks held by transactions being replayed by recovery. - Note that Startup process does not acquire locks to - make database changes and thus locks other than AccessExclusiveLocks - do not show in pg_locks for the Startup process, they are just presumed - to exist. - </para> - - <para> - <productname>check_pgsql</> will work, but it is very simple. - <productname>check_postgres</> will also work, though many some actions - could give different or confusing results. - e.g. last vacuum time will not be maintained for example, since no - vacuum occurs on the standby (though vacuums running on the primary do - send their changes to the standby). - </para> - - <para> - WAL file control commands will not work during recovery - e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.. - </para> - - <para> - Dynamically loadable modules work, including pg_stat_statements. - </para> - - <para> - Advisory locks work normally in recovery, including deadlock detection. - Note that advisory locks are never WAL logged, so it is not possible for - an advisory lock on either the primary or the standby to conflict with WAL - replay. Nor is it possible to acquire an advisory lock on the primary - and have it initiate a similar advisory lock on the standby. Advisory - locks relate only to a single server on which they are acquired. - </para> - - <para> - Trigger-based replication systems such as <productname>Slony</>, - <productname>Londiste</> and <productname>Bucardo</> won't run on the - standby at all, though they will run happily on the primary server as - long as the changes are not sent to standby servers to be applied. - WAL replay is not trigger-based so you cannot relay from the - standby to any system that requires additional database writes or - relies on the use of triggers. - </para> - - <para> - New oids cannot be assigned, though some <acronym>UUID</> generators may still - work as long as they do not rely on writing new status to the database. - </para> - - <para> - Currently, temp table creation is not allowed during read only - transactions, so in some cases existing scripts will not run correctly. - It is possible we may relax that restriction in a later release. This is - both a SQL Standard compliance issue and a technical issue. - </para> - - <para> - <command>DROP TABLESPACE</> can only succeed if the tablespace is empty. - Some standby users may be actively using the tablespace via their - <varname>temp_tablespaces</> parameter. If there are temp files in the - tablespace we currently cancel all active queries to ensure that temp - files are removed, so that we can remove the tablespace and continue with - WAL replay. - </para> - - <para> - Running <command>DROP DATABASE</>, <command>ALTER DATABASE ... SET TABLESPACE</>, - or <command>ALTER DATABASE ... RENAME</> on primary will generate a log message - that will cause all users connected to that database on the standby to be - forcibly disconnected. This action occurs immediately, whatever the setting of - <varname>max_standby_delay</>. - </para> - - <para> - In normal running, if you issue <command>DROP USER</> or <command>DROP ROLE</> - for a role with login capability while that user is still connected then - nothing happens to the connected user - they remain connected. The user cannot - reconnect however. This behaviour applies in recovery also, so a - <command>DROP USER</> on the primary does not disconnect that user on the standby. - </para> - - <para> - Stats collector is active during recovery. All scans, reads, blocks, - index usage etc will all be recorded normally on the standby. Replayed - actions will not duplicate their effects on primary, so replaying an - insert will not increment the Inserts column of pg_stat_user_tables. - The stats file is deleted at start of recovery, so stats from primary - and standby will differ; this is considered a feature not a bug. - </para> - - <para> - Autovacuum is not active during recovery, though will start normally - at the end of recovery. - </para> - - <para> - Background writer is active during recovery and will perform - restartpoints (similar to checkpoints on primary) and normal block - cleaning activities. The <command>CHECKPOINT</> command is accepted during recovery, - though performs a restartpoint rather than a new checkpoint. - </para> - </sect2> - - <sect2 id="hot-standby-parameters"> - <title>Hot Standby Parameter Reference</title> - - <para> - Various parameters have been mentioned above in the <xref linkend="hot-standby-admin"> - and <xref linkend="hot-standby-conflict"> sections. - </para> - - <para> - On the primary, parameters <varname>recovery_connections</> and - <varname>vacuum_defer_cleanup_age</> can be used to enable and control the - primary server to assist the successful configuration of Hot Standby servers. - <varname>max_standby_delay</> has no effect if set on the primary. - </para> - - <para> - On the standby, parameters <varname>recovery_connections</> and - <varname>max_standby_delay</> can be used to enable and control Hot Standby. - standby server to assist the successful configuration of Hot Standby servers. - <varname>vacuum_defer_cleanup_age</> has no effect during recovery. - </para> - </sect2> - - <sect2 id="hot-standby-caveats"> - <title>Caveats</title> - - <para> - At this writing, there are several limitations of Hot Standby. - These can and probably will be fixed in future releases: - - <itemizedlist> - <listitem> - <para> - Operations on hash indexes are not presently WAL-logged, so - replay will not update these indexes. Hash indexes will not be - used for query plans during recovery. - </para> - </listitem> - <listitem> - <para> - Full knowledge of running transactions is required before snapshots - may be taken. Transactions that take use large numbers of subtransactions - (currently greater than 64) will delay the start of read only - connections until the completion of the longest running write transaction. - If this situation occurs explanatory messages will be sent to server log. - </para> - </listitem> - <listitem> - <para> - Valid starting points for recovery connections are generated at each - checkpoint on the master. If the standby is shutdown while the master - is in a shutdown state it may not be possible to re-enter Hot Standby - until the primary is started up so that it generates further starting - points in the WAL logs. This is not considered a serious issue - because the standby is usually switched into the primary role while - the first node is taken down. - </para> - </listitem> - <listitem> - <para> - At the end of recovery, AccessExclusiveLocks held by prepared transactions - will require twice the normal number of lock table entries. If you plan - on running either a large number of concurrent prepared transactions - that normally take AccessExclusiveLocks, or you plan on having one - large transaction that takes many AccessExclusiveLocks then you are - advised to select a larger value of <varname>max_locks_per_transaction</>, - up to, but never more than twice the value of the parameter setting on - the primary server in rare extremes. You need not consider this at all if - your setting of <varname>max_prepared_transactions</> is <literal>0</>. - </para> - </listitem> - </itemizedlist> - - </para> - </sect2> - - </sect1> - <sect1 id="migration"> <title>Migration Between Releases</title> diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 0e8351ed634..ad8f4b01135 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/high-availability.sgml,v 1.39 2010/02/08 13:41:06 momjian Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/high-availability.sgml,v 1.40 2010/02/09 16:50:25 heikki Exp $ --> <chapter id="high-availability"> <title>High Availability, Load Balancing, and Replication</title> @@ -79,6 +79,9 @@ also available. </para> + <sect1 id="different-replication-solutions"> + <title>Comparison of different solutions</title> + <variablelist> <varlistentry> @@ -450,4 +453,1388 @@ protocol to make nodes agree on a serializable transactional order. </variablelist> + </sect1> + + <sect1 id="warm-standby"> + <title>File-based Log Shipping</title> + + <indexterm zone="high-availability"> + <primary>warm standby</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>PITR standby</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>standby server</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>log shipping</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>witness server</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>STONITH</primary> + </indexterm> + + <para> + Continuous archiving can be used to create a <firstterm>high + availability</> (HA) cluster configuration with one or more + <firstterm>standby servers</> ready to take over operations if the + primary server fails. This capability is widely referred to as + <firstterm>warm standby</> or <firstterm>log shipping</>. + </para> + + <para> + The primary and standby server work together to provide this capability, + though the servers are only loosely coupled. The primary server operates + in continuous archiving mode, while each standby server operates in + continuous recovery mode, reading the WAL files from the primary. No + changes to the database tables are required to enable this capability, + so it offers low administration overhead compared to some other + replication approaches. This configuration also has relatively low + performance impact on the primary server. + </para> + + <para> + Directly moving WAL records from one database server to another + is typically described as log shipping. <productname>PostgreSQL</> + implements file-based log shipping, which means that WAL records are + transferred one file (WAL segment) at a time. WAL files (16MB) can be + shipped easily and cheaply over any distance, whether it be to an + adjacent system, another system at the same site, or another system on + the far side of the globe. The bandwidth required for this technique + varies according to the transaction rate of the primary server. + Record-based log shipping is also possible with custom-developed + procedures, as discussed in <xref linkend="warm-standby-record">. + </para> + + <para> + It should be noted that the log shipping is asynchronous, i.e., the WAL + records are shipped after transaction commit. As a result there is a + window for data loss should the primary server suffer a catastrophic + failure: transactions not yet shipped will be lost. The length of the + window of data loss can be limited by use of the + <varname>archive_timeout</varname> parameter, which can be set as low + as a few seconds if required. However such a low setting will + substantially increase the bandwidth required for file shipping. + If you need a window of less than a minute or so, it's probably better + to consider record-based log shipping. + </para> + + <para> + The standby server is not available for access, since it is continually + performing recovery processing. Recovery performance is sufficiently + good that the standby will typically be only moments away from full + availability once it has been activated. As a result, we refer to this + capability as a warm standby configuration that offers high + availability. Restoring a server from an archived base backup and + rollforward will take considerably longer, so that technique only + offers a solution for disaster recovery, not high availability. + </para> + + <sect2 id="warm-standby-planning"> + <title>Planning</title> + + <para> + It is usually wise to create the primary and standby servers + so that they are as similar as possible, at least from the + perspective of the database server. In particular, the path names + associated with tablespaces will be passed across unmodified, so both + primary and standby servers must have the same mount paths for + tablespaces if that feature is used. Keep in mind that if + <xref linkend="sql-createtablespace" endterm="sql-createtablespace-title"> + is executed on the primary, any new mount point needed for it must + be created on the primary and all standby servers before the command + is executed. Hardware need not be exactly the same, but experience shows + that maintaining two identical systems is easier than maintaining two + dissimilar ones over the lifetime of the application and system. + In any case the hardware architecture must be the same — shipping + from, say, a 32-bit to a 64-bit system will not work. + </para> + + <para> + In general, log shipping between servers running different major + <productname>PostgreSQL</> release + levels is not possible. It is the policy of the PostgreSQL Global + Development Group not to make changes to disk formats during minor release + upgrades, so it is likely that running different minor release levels + on primary and standby servers will work successfully. However, no + formal support for that is offered and you are advised to keep primary + and standby servers at the same release level as much as possible. + When updating to a new minor release, the safest policy is to update + the standby servers first — a new minor release is more likely + to be able to read WAL files from a previous minor release than vice + versa. + </para> + + <para> + There is no special mode required to enable a standby server. The + operations that occur on both primary and standby servers are + normal continuous archiving and recovery tasks. The only point of + contact between the two database servers is the archive of WAL files + that both share: primary writing to the archive, standby reading from + the archive. Care must be taken to ensure that WAL archives from separate + primary servers do not become mixed together or confused. The archive + need not be large if it is only required for standby operation. + </para> + + <para> + The magic that makes the two loosely coupled servers work together is + simply a <varname>restore_command</> used on the standby that, + when asked for the next WAL file, waits for it to become available from + the primary. The <varname>restore_command</> is specified in the + <filename>recovery.conf</> file on the standby server. Normal recovery + processing would request a file from the WAL archive, reporting failure + if the file was unavailable. For standby processing it is normal for + the next WAL file to be unavailable, so we must be patient and wait for + it to appear. For files ending in <literal>.backup</> or + <literal>.history</> there is no need to wait, and a non-zero return + code must be returned. A waiting <varname>restore_command</> can be + written as a custom script that loops after polling for the existence of + the next WAL file. There must also be some way to trigger failover, which + should interrupt the <varname>restore_command</>, break the loop and + return a file-not-found error to the standby server. This ends recovery + and the standby will then come up as a normal server. + </para> + + <para> + Pseudocode for a suitable <varname>restore_command</> is: +<programlisting> +triggered = false; +while (!NextWALFileReady() && !triggered) +{ + sleep(100000L); /* wait for ~0.1 sec */ + if (CheckForExternalTrigger()) + triggered = true; +} +if (!triggered) + CopyWALFileForRecovery(); +</programlisting> + </para> + + <para> + A working example of a waiting <varname>restore_command</> is provided + as a <filename>contrib</> module named <application>pg_standby</>. It + should be used as a reference on how to correctly implement the logic + described above. It can also be extended as needed to support specific + configurations and environments. + </para> + + <para> + <productname>PostgreSQL</productname> does not provide the system + software required to identify a failure on the primary and notify + the standby database server. Many such tools exist and are well + integrated with the operating system facilities required for + successful failover, such as IP address migration. + </para> + + <para> + The method for triggering failover is an important part of planning + and design. One potential option is the <varname>restore_command</> + command. It is executed once for each WAL file, but the process + running the <varname>restore_command</> is created and dies for + each file, so there is no daemon or server process, and we cannot + use signals or a signal handler. Therefore, the + <varname>restore_command</> is not suitable to trigger failover. + It is possible to use a simple timeout facility, especially if + used in conjunction with a known <varname>archive_timeout</> + setting on the primary. However, this is somewhat error prone + since a network problem or busy primary server might be sufficient + to initiate failover. A notification mechanism such as the explicit + creation of a trigger file is ideal, if this can be arranged. + </para> + + <para> + The size of the WAL archive can be minimized by using the <literal>%r</> + option of the <varname>restore_command</>. This option specifies the + last archive file name that needs to be kept to allow the recovery to + restart correctly. This can be used to truncate the archive once + files are no longer required, assuming the archive is writable from the + standby server. + </para> + </sect2> + + <sect2 id="warm-standby-config"> + <title>Implementation</title> + + <para> + The short procedure for configuring a standby server is as follows. For + full details of each step, refer to previous sections as noted. + <orderedlist> + <listitem> + <para> + Set up primary and standby systems as nearly identical as + possible, including two identical copies of + <productname>PostgreSQL</> at the same release level. + </para> + </listitem> + <listitem> + <para> + Set up continuous archiving from the primary to a WAL archive + directory on the standby server. Ensure that + <xref linkend="guc-archive-mode">, + <xref linkend="guc-archive-command"> and + <xref linkend="guc-archive-timeout"> + are set appropriately on the primary + (see <xref linkend="backup-archiving-wal">). + </para> + </listitem> + <listitem> + <para> + Make a base backup of the primary server (see <xref + linkend="backup-base-backup">), and load this data onto the standby. + </para> + </listitem> + <listitem> + <para> + Begin recovery on the standby server from the local WAL + archive, using a <filename>recovery.conf</> that specifies a + <varname>restore_command</> that waits as described + previously (see <xref linkend="backup-pitr-recovery">). + </para> + </listitem> + </orderedlist> + </para> + + <para> + Recovery treats the WAL archive as read-only, so once a WAL file has + been copied to the standby system it can be copied to tape at the same + time as it is being read by the standby database server. + Thus, running a standby server for high availability can be performed at + the same time as files are stored for longer term disaster recovery + purposes. + </para> + + <para> + For testing purposes, it is possible to run both primary and standby + servers on the same system. This does not provide any worthwhile + improvement in server robustness, nor would it be described as HA. + </para> + </sect2> + + <sect2 id="warm-standby-record"> + <title>Record-based Log Shipping</title> + + <para> + <productname>PostgreSQL</productname> directly supports file-based + log shipping as described above. It is also possible to implement + record-based log shipping, though this requires custom development. + </para> + + <para> + An external program can call the <function>pg_xlogfile_name_offset()</> + function (see <xref linkend="functions-admin">) + to find out the file name and the exact byte offset within it of + the current end of WAL. It can then access the WAL file directly + and copy the data from the last known end of WAL through the current end + over to the standby servers. With this approach, the window for data + loss is the polling cycle time of the copying program, which can be very + small, and there is no wasted bandwidth from forcing partially-used + segment files to be archived. Note that the standby servers' + <varname>restore_command</> scripts can only deal with whole WAL files, + so the incrementally copied data is not ordinarily made available to + the standby servers. It is of use only when the primary dies — + then the last partial WAL file is fed to the standby before allowing + it to come up. The correct implementation of this process requires + cooperation of the <varname>restore_command</> script with the data + copying program. + </para> + + <para> + Starting with <productname>PostgreSQL</> version 8.5, you can use + streaming replication (see <xref linkend="streaming-replication">) to + achieve the same with less effort. + </para> + </sect2> + </sect1> + + <sect1 id="streaming-replication"> + <title>Streaming Replication</title> + + <indexterm zone="high-availability"> + <primary>Streaming Replication</primary> + </indexterm> + + <para> + <productname>PostgreSQL</> includes a simple streaming replication + mechanism, which lets the standby server to stay more up-to-date than + file-based replication allows. The standby connects to the primary + and the primary starts streaming WAL records from where the standby + left off, and continues streaming them as they are generated, without + waiting for the WAL file to be filled. So with streaming replication, + <varname>archive_timeout</> does not need to be configured. + </para> + + <para> + Streaming replication relies on file-based continuous archiving for + making the base backup and for allowing a standby to catch up if it's + disconnected from the primary for long enough for the primary to + delete old WAL files still required by the standby. + </para> + + <sect2 id="streaming-replication-setup"> + <title>Setup</title> + <para> + The short procedure for configuring streaming replication is as follows. + For full details of each step, refer to other sections as noted. + <orderedlist> + <listitem> + <para> + Set up primary and standby systems as near identically as possible, + including two identical copies of <productname>PostgreSQL</> at the + same release level. + </para> + </listitem> + <listitem> + <para> + Set up continuous archiving from the primary to a WAL archive located + in a directory on the standby server. Ensure that + <xref linkend="guc-archive-mode">, + <xref linkend="guc-archive-command"> and + <xref linkend="guc-archive-timeout"> + are set appropriately on the primary + (see <xref linkend="backup-archiving-wal">). + </para> + </listitem> + + <listitem> + <para> + Set up connections and authentication so that the standby server can + successfully connect to the pseudo <literal>replication</> database of + the primary server (see + <xref linkend="streaming-replication-authentication">). Ensure that + <xref linkend="guc-listen-addresses"> and <filename>pg_hba.conf</> are + configured appropriately on the primary. + </para> + <para> + On systems that support the keepalive socket option, setting + <xref linkend="guc-tcp-keepalives-idle">, + <xref linkend="guc-tcp-keepalives-interval"> and + <xref linkend="guc-tcp-keepalives-count"> helps you to find the + troubles with replication (e.g., the network outage or the failure of + the standby server) as soon as possible. + </para> + </listitem> + <listitem> + <para> + Set the maximum number of concurrent connections from the standby servers + (see <xref linkend="guc-max-wal-senders"> for details). + </para> + </listitem> + <listitem> + <para> + Enable WAL archiving in the primary server because we need to make a base + backup of it later (see <xref linkend="guc-archive-mode"> and + <xref linkend="guc-archive-command"> for details). + </para> + </listitem> + <listitem> + <para> + Start the <productname>PostgreSQL</> server on the primary. + </para> + </listitem> + <listitem> + <para> + Make a base backup of the primary server (see + <xref linkend="backup-base-backup">), and load this data onto the + standby. Note that all files present in <filename>pg_xlog</> + and <filename>pg_xlog/archive_status</> on the <emphasis>standby</> + server should be removed because they might be obsolete. + </para> + </listitem> + <listitem> + <para> + Set up WAL archiving, connections and authentication like the primary + server, because the standby server might work as a primary server after + failover. Ensure that your settings are consistent with the + <emphasis>future</> environment after the primary and the standby + server are interchanged by failover. If you're setting up the standby + server for e.g reporting purposes, with no plans to fail over to it, + configure the standby accordingly. + </para> + </listitem> + <listitem> + <para> + Create a recovery command file <filename>recovery.conf</> in the data + directory on the standby server. + </para> + + <variablelist id="replication-config-settings" xreflabel="Replication Settings"> + <varlistentry id="standby-mode" xreflabel="standby_mode"> + <term><varname>standby_mode</varname> (<type>boolean</type>)</term> + <listitem> + <para> + Specifies whether to start the <productname>PostgreSQL</> server as + a standby. If this parameter is <literal>on</>, the streaming + replication is enabled and the standby server will try to connect + to the primary to receive and apply WAL records continuously. The + default is <literal>off</>, which allows only an archive recovery + without replication. So, streaming replication requires this + parameter to be explicitly set to <literal>on</>. + </para> + </listitem> + </varlistentry> + <varlistentry id="primary-conninfo" xreflabel="primary_conninfo"> + <term><varname>primary_conninfo</varname> (<type>string</type>)</term> + <listitem> + <para> + Specifies a connection string which is used for the standby server + to connect with the primary. This string is in the same format as + described in <xref linkend="libpq-connect">. If any option is + unspecified in this string, then the corresponding environment + variable (see <xref linkend="libpq-envars">) is checked. If the + environment variable is not set either, then the indicated built-in + defaults are used. + </para> + <para> + The built-in replication requires that a host name (or host address) + or port number which the primary server listens on should be + specified in this string, respectively. Also ensure that a role with + the <literal>SUPERUSER</> and <literal>LOGIN</> privileges on the + primary is set (see + <xref linkend="streaming-replication-authentication">). Note that + the password needs to be set if the primary demands password + authentication. + </para> + </listitem> + </varlistentry> + <varlistentry id="trigger-file" xreflabel="trigger_file"> + <term><varname>trigger_file</varname> (<type>string</type>)</term> + <listitem> + <para> + Specifies a trigger file whose presence activates the standby. + If no trigger file is specified, the standby never exits + recovery. + </para> + </listitem> + </varlistentry> + </variablelist> + </listitem> + <listitem> + <para> + Start the <productname>PostgreSQL</> server on the standby. The standby + server will go into recovery mode and proceeds to receive WAL records + from the primary and apply them continuously. + </para> + </listitem> + </orderedlist> + </para> + </sect2> + <sect2 id="streaming-replication-authentication"> + <title>Authentication</title> + <para> + It's very important that the access privilege for replication are set + properly so that only trusted users can read the WAL stream, because it's + easy to extract serious information from it. + </para> + <para> + Only superuser is allowed to connect to the primary as the replication + standby. So a role with the <literal>SUPERUSER</> and <literal>LOGIN</> + privileges needs to be created in the primary. + </para> + <para> + Client authentication for replication is controlled by the + <filename>pg_hba.conf</> record specifying <literal>replication</> in the + <replaceable>database</> field. For example, if the standby is running on + host IP <literal>192.168.1.100</> and the superuser's name for replication + is <literal>foo</>, the administrator can add the following line to the + <filename>pg_hba.conf</> file on the primary. + +<programlisting> +# Allow the user "foo" from host 192.168.1.100 to connect to the primary +# as a replication standby if the user's password is correctly supplied. +# +# TYPE DATABASE USER CIDR-ADDRESS METHOD +host replication foo 192.168.1.100/32 md5 +</programlisting> + </para> + <para> + The host name and port number of the primary, user name to connect as, + and password are specified in the <filename>recovery.conf</> file or + the corresponding environment variable on the standby. + For example, if the primary is running on host IP <literal>192.168.1.50</>, + port <literal>5432</literal>, the superuser's name for replication is + <literal>foo</>, and the password is <literal>foopass</>, the administrator + can add the following line to the <filename>recovery.conf</> file on the + standby. + +<programlisting> +# The standby connects to the primary that is running on host 192.168.1.50 +# and port 5432 as the user "foo" whose password is "foopass". +primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' +</programlisting> + </para> + </sect2> + </sect1> + + <sect1 id="warm-standby-failover"> + <title>Failover</title> + + <para> + If the primary server fails then the standby server should begin + failover procedures. + </para> + + <para> + If the standby server fails then no failover need take place. If the + standby server can be restarted, even some time later, then the recovery + process can also be immediately restarted, taking advantage of + restartable recovery. If the standby server cannot be restarted, then a + full new standby server instance should be created. + </para> + + <para> + If the primary server fails and the standby server becomes the + new primary, and then the old primary restarts, you must have + a mechanism for informing old primary that it is no longer the primary. This is + sometimes known as STONITH (Shoot The Other Node In The Head), which is + necessary to avoid situations where both systems think they are the + primary, which will lead to confusion and ultimately data loss. + </para> + + <para> + Many failover systems use just two systems, the primary and the standby, + connected by some kind of heartbeat mechanism to continually verify the + connectivity between the two and the viability of the primary. It is + also possible to use a third system (called a witness server) to prevent + some cases of inappropriate failover, but the additional complexity + might not be worthwhile unless it is set up with sufficient care and + rigorous testing. + </para> + + <para> + Once failover to the standby occurs, we have only a + single server in operation. This is known as a degenerate state. + The former standby is now the primary, but the former primary is down + and might stay down. To return to normal operation we must + fully recreate a standby server, + either on the former primary system when it comes up, or on a third, + possibly new, system. Once complete the primary and standby can be + considered to have switched roles. Some people choose to use a third + server to provide backup for the new primary until the new standby + server is recreated, + though clearly this complicates the system configuration and + operational processes. + </para> + + <para> + So, switching from primary to standby server can be fast but requires + some time to re-prepare the failover cluster. Regular switching from + primary to standby is useful, since it allows regular downtime on + each system for maintenance. This also serves as a test of the + failover mechanism to ensure that it will really work when you need it. + Written administration procedures are advised. + </para> + </sect1> + + <sect1 id="hot-standby"> + <title>Hot Standby</title> + + <indexterm zone="high-availability"> + <primary>Hot Standby</primary> + </indexterm> + + <para> + Hot Standby is the term used to describe the ability to connect to + the server and run queries while the server is in archive recovery. This + is useful for both log shipping replication and for restoring a backup + to an exact state with great precision. + The term Hot Standby also refers to the ability of the server to move + from recovery through to normal running while users continue running + queries and/or continue their connections. + </para> + + <para> + Running queries in recovery is in many ways the same as normal running + though there are a large number of usage and administrative points + to note. + </para> + + <sect2 id="hot-standby-users"> + <title>User's Overview</title> + + <para> + Users can connect to the database while the server is in recovery + and perform read-only queries. Read-only access to catalogs and views + will also occur as normal. + </para> + + <para> + The data on the standby takes some time to arrive from the primary server + so there will be a measurable delay between primary and standby. Running the + same query nearly simultaneously on both primary and standby might therefore + return differing results. We say that data on the standby is eventually + consistent with the primary. + Queries executed on the standby will be correct with regard to the transactions + that had been recovered at the start of the query, or start of first statement, + in the case of serializable transactions. In comparison with the primary, + the standby returns query results that could have been obtained on the primary + at some exact moment in the past. + </para> + + <para> + When a transaction is started in recovery, the parameter + <varname>transaction_read_only</> will be forced to be true, regardless of the + <varname>default_transaction_read_only</> setting in <filename>postgresql.conf</>. + It can't be manually set to false either. As a result, all transactions + started during recovery will be limited to read-only actions only. In all + other ways, connected sessions will appear identical to sessions + initiated during normal processing mode. There are no special commands + required to initiate a connection at this time, so all interfaces + work normally without change. After recovery finishes, the session + will allow normal read-write transactions at the start of the next + transaction, if these are requested. + </para> + + <para> + Read-only here means "no writes to the permanent database tables". + There are no problems with queries that make use of transient sort and + work files. + </para> + + <para> + The following actions are allowed + + <itemizedlist> + <listitem> + <para> + Query access - SELECT, COPY TO including views and SELECT RULEs + </para> + </listitem> + <listitem> + <para> + Cursor commands - DECLARE, FETCH, CLOSE, + </para> + </listitem> + <listitem> + <para> + Parameters - SHOW, SET, RESET + </para> + </listitem> + <listitem> + <para> + Transaction management commands + <itemizedlist> + <listitem> + <para> + BEGIN, END, ABORT, START TRANSACTION + </para> + </listitem> + <listitem> + <para> + SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT + </para> + </listitem> + <listitem> + <para> + EXCEPTION blocks and other internal subtransactions + </para> + </listitem> + </itemizedlist> + </para> + </listitem> + <listitem> + <para> + LOCK TABLE, though only when explicitly in one of these modes: + ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. + </para> + </listitem> + <listitem> + <para> + Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD + </para> + </listitem> + <listitem> + <para> + Plugins and extensions - LOAD + </para> + </listitem> + </itemizedlist> + </para> + + <para> + These actions produce error messages + + <itemizedlist> + <listitem> + <para> + Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. + Note that there are no allowed actions that result in a trigger + being executed during recovery. + </para> + </listitem> + <listitem> + <para> + Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. + This also applies to temporary tables currently because currently their + definition causes writes to catalog tables. + </para> + </listitem> + <listitem> + <para> + SELECT ... FOR SHARE | UPDATE which cause row locks to be written + </para> + </listitem> + <listitem> + <para> + RULEs on SELECT statements that generate DML commands. + </para> + </listitem> + <listitem> + <para> + LOCK TABLE, in short default form, since it requests ACCESS EXCLUSIVE MODE. + LOCK TABLE that explicitly requests a mode higher than ROW EXCLUSIVE MODE. + </para> + </listitem> + <listitem> + <para> + Transaction management commands that explicitly set non-read only state + <itemizedlist> + <listitem> + <para> + BEGIN READ WRITE, + START TRANSACTION READ WRITE + </para> + </listitem> + <listitem> + <para> + SET TRANSACTION READ WRITE, + SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE + </para> + </listitem> + <listitem> + <para> + SET transaction_read_only = off + </para> + </listitem> + </itemizedlist> + </para> + </listitem> + <listitem> + <para> + Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, + ROLLBACK PREPARED because even read-only transactions need to write + WAL in the prepare phase (the first phase of two phase commit). + </para> + </listitem> + <listitem> + <para> + sequence update - nextval() + </para> + </listitem> + <listitem> + <para> + LISTEN, UNLISTEN, NOTIFY since they currently write to system tables + </para> + </listitem> + </itemizedlist> + </para> + + <para> + Note that current behaviour of read only transactions when not in + recovery is to allow the last two actions, so there are small and + subtle differences in behaviour between read-only transactions + run on standby and during normal running. + It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and + temporary tables may be lifted in a future release, if their internal + implementation is altered to make this possible. + </para> + + <para> + If failover or switchover occurs the database will switch to normal + processing mode. Sessions will remain connected while the server + changes mode. Current transactions will continue, though will remain + read-only. After recovery is complete, it will be possible to initiate + read-write transactions. + </para> + + <para> + Users will be able to tell whether their session is read-only by + issuing SHOW transaction_read_only. In addition a set of + functions <xref linkend="functions-recovery-info-table"> allow users to + access information about Hot Standby. These allow you to write + functions that are aware of the current state of the database. These + can be used to monitor the progress of recovery, or to allow you to + write complex programs that restore the database to particular states. + </para> + + <para> + In recovery, transactions will not be permitted to take any table lock + higher than RowExclusiveLock. In addition, transactions may never assign + a TransactionId and may never write WAL. + Any <command>LOCK TABLE</> command that runs on the standby and requests + a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected. + </para> + + <para> + In general queries will not experience lock conflicts with the database + changes made by recovery. This is becase recovery follows normal + concurrency control mechanisms, known as <acronym>MVCC</>. There are + some types of change that will cause conflicts, covered in the following + section. + </para> + </sect2> + + <sect2 id="hot-standby-conflict"> + <title>Handling query conflicts</title> + + <para> + The primary and standby nodes are in many ways loosely connected. Actions + on the primary will have an effect on the standby. As a result, there is + potential for negative interactions or conflicts between them. The easiest + conflict to understand is performance: if a huge data load is taking place + on the primary then this will generate a similar stream of WAL records on the + standby, so standby queries may contend for system resources, such as I/O. + </para> + + <para> + There are also additional types of conflict that can occur with Hot Standby. + These conflicts are <emphasis>hard conflicts</> in the sense that we may + need to cancel queries and in some cases disconnect sessions to resolve them. + The user is provided with a number of optional ways to handle these + conflicts, though we must first understand the possible reasons behind a conflict. + + <itemizedlist> + <listitem> + <para> + Access Exclusive Locks from primary node, including both explicit + LOCK commands and various kinds of DDL action + </para> + </listitem> + <listitem> + <para> + Dropping tablespaces on the primary while standby queries are using + those tablespaces for temporary work files (work_mem overflow) + </para> + </listitem> + <listitem> + <para> + Dropping databases on the primary while users are connected to that + database on the standby. + </para> + </listitem> + <listitem> + <para> + Waiting to acquire buffer cleanup locks + </para> + </listitem> + <listitem> + <para> + Early cleanup of data still visible to the current query's snapshot + </para> + </listitem> + </itemizedlist> + </para> + + <para> + Some WAL redo actions will be for DDL actions. These DDL actions are + repeating actions that have already committed on the primary node, so + they must not fail on the standby node. These DDL locks take priority + and will automatically *cancel* any read-only transactions that get in + their way, after a grace period. This is similar to the possibility of + being canceled by the deadlock detector, but in this case the standby + process always wins, since the replayed actions must not fail. This + also ensures that replication doesn't fall behind while we wait for a + query to complete. Again, we assume that the standby is there for high + availability purposes primarily. + </para> + + <para> + An example of the above would be an Administrator on Primary server + runs a <command>DROP TABLE</> on a table that's currently being queried + in the standby server. + Clearly the query cannot continue if we let the <command>DROP TABLE</> + proceed. If this situation occurred on the primary, the <command>DROP TABLE</> + would wait until the query has finished. When the query is on the standby + and the <command>DROP TABLE</> is on the primary, the primary doesn't have + information about which queries are running on the standby and so the query + does not wait on the primary. The WAL change records come through to the + standby while the standby query is still running, causing a conflict. + </para> + + <para> + The most common reason for conflict between standby queries and WAL redo is + "early cleanup". Normally, <productname>PostgreSQL</> allows cleanup of old + row versions when there are no users who may need to see them to ensure correct + visibility of data (the heart of MVCC). If there is a standby query that has + been running for longer than any query on the primary then it is possible + for old row versions to be removed by either a vacuum or HOT. This will + then generate WAL records that, if applied, would remove data on the + standby that might *potentially* be required by the standby query. + In more technical language, the primary's xmin horizon is later than + the standby's xmin horizon, allowing dead rows to be removed. + </para> + + <para> + Experienced users should note that both row version cleanup and row version + freezing will potentially conflict with recovery queries. Running a + manual <command>VACUUM FREEZE</> is likely to cause conflicts even on tables + with no updated or deleted rows. + </para> + + <para> + We have a number of choices for resolving query conflicts. The default + is that we wait and hope the query completes. The server will wait + automatically until the lag between primary and standby is at most + <varname>max_standby_delay</> seconds. Once that grace period expires, + we take one of the following actions: + + <itemizedlist> + <listitem> + <para> + If the conflict is caused by a lock, we cancel the conflicting standby + transaction immediately. If the transaction is idle-in-transaction + then currently we abort the session instead, though this may change + in the future. + </para> + </listitem> + + <listitem> + <para> + If the conflict is caused by cleanup records we tell the standby query + that a conflict has occurred and that it must cancel itself to avoid the + risk that it silently fails to read relevant data because + that data has been removed. (This is regrettably very similar to the + much feared and iconic error message "snapshot too old"). Some cleanup + records only cause conflict with older queries, though some types of + cleanup record affect all queries. + </para> + + <para> + If cancellation does occur, the query and/or transaction can always + be re-executed. The error is dynamic and will not necessarily occur + the same way if the query is executed again. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + <varname>max_standby_delay</> is set in <filename>postgresql.conf</>. + The parameter applies to the server as a whole so if the delay is all used + up by a single query then there may be little or no waiting for queries that + follow immediately, though they will have benefited equally from the initial + waiting period. The server may take time to catch up again before the grace + period is available again, though if there is a heavy and constant stream + of conflicts it may seldom catch up fully. + </para> + + <para> + Users should be clear that tables that are regularly and heavily updated on + primary server will quickly cause cancellation of longer running queries on + the standby. In those cases <varname>max_standby_delay</> can be + considered somewhat but not exactly the same as setting + <varname>statement_timeout</>. + </para> + + <para> + Other remedial actions exist if the number of cancellations is unacceptable. + The first option is to connect to primary server and keep a query active + for as long as we need to run queries on the standby. This guarantees that + a WAL cleanup record is never generated and we don't ever get query + conflicts as described above. This could be done using contrib/dblink + and pg_sleep(), or via other mechanisms. If you do this, you should note + that this will delay cleanup of dead rows by vacuum or HOT and many + people may find this undesirable. However, we should remember that + primary and standby nodes are linked via the WAL, so this situation is no + different to the case where we ran the query on the primary node itself + except we have the benefit of off-loading the execution onto the standby. + </para> + + <para> + It is also possible to set <varname>vacuum_defer_cleanup_age</> on the primary + to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow + more time for queries to execute before they are cancelled on the standby, + without the need for setting a high <varname>max_standby_delay</>. + </para> + + <para> + Three-way deadlocks are possible between AccessExclusiveLocks arriving from + the primary, cleanup WAL records that require buffer cleanup locks and + user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks + are resolved by time-out when we exceed <varname>max_standby_delay</>. + </para> + + <para> + Dropping tablespaces or databases is discussed in the administrator's + section since they are not typical user situations. + </para> + </sect2> + + <sect2 id="hot-standby-admin"> + <title>Administrator's Overview</title> + + <para> + If there is a <filename>recovery.conf</> file present the server will start + in Hot Standby mode by default, though <varname>recovery_connections</> can + be disabled via <filename>postgresql.conf</>, if required. The server may take + some time to enable recovery connections since the server must first complete + sufficient recovery to provide a consistent state against which queries + can run before enabling read only connections. Look for these messages + in the server logs + +<programlisting> +LOG: initializing recovery connections + +... then some time later ... + +LOG: consistent recovery state reached +LOG: database system is ready to accept read only connections +</programlisting> + + Consistency information is recorded once per checkpoint on the primary, as long + as <varname>recovery_connections</> is enabled (on the primary). If this parameter + is disabled, it will not be possible to enable recovery connections on the standby. + The consistent state can also be delayed in the presence of both of these conditions + + <itemizedlist> + <listitem> + <para> + a write transaction has more than 64 subtransactions + </para> + </listitem> + <listitem> + <para> + very long-lived write transactions + </para> + </listitem> + </itemizedlist> + + If you are running file-based log shipping ("warm standby"), you may need + to wait until the next WAL file arrives, which could be as long as the + <varname>archive_timeout</> setting on the primary. + </para> + + <para> + The setting of some parameters on the standby will need reconfiguration + if they have been changed on the primary. The value on the standby must + be equal to or greater than the value on the primary. If these parameters + are not set high enough then the standby will not be able to track work + correctly from recovering transactions. If these values are set too low the + the server will halt. Higher values can then be supplied and the server + restarted to begin recovery again. + + <itemizedlist> + <listitem> + <para> + <varname>max_connections</> + </para> + </listitem> + <listitem> + <para> + <varname>max_prepared_transactions</> + </para> + </listitem> + <listitem> + <para> + <varname>max_locks_per_transaction</> + </para> + </listitem> + </itemizedlist> + </para> + + <para> + It is important that the administrator consider the appropriate setting + of <varname>max_standby_delay</>, set in <filename>postgresql.conf</>. + There is no optimal setting and should be set according to business + priorities. For example if the server is primarily tasked as a High + Availability server, then you may wish to lower + <varname>max_standby_delay</> or even set it to zero, though that is a + very aggressive setting. If the standby server is tasked as an additional + server for decision support queries then it may be acceptable to set this + to a value of many hours (in seconds). + </para> + + <para> + Transaction status "hint bits" written on primary are not WAL-logged, + so data on standby will likely re-write the hints again on the standby. + Thus the main database blocks will produce write I/Os even though + all users are read-only; no changes have occurred to the data values + themselves. Users will be able to write large sort temp files and + re-generate relcache info files, so there is no part of the database + that is truly read-only during hot standby mode. There is no restriction + on the use of set returning functions, or other users of tuplestore/tuplesort + code. Note also that writes to remote databases will still be possible, + even though the transaction is read-only locally. + </para> + + <para> + The following types of administrator command are not accepted + during recovery mode + + <itemizedlist> + <listitem> + <para> + Data Definition Language (DDL) - e.g. CREATE INDEX + </para> + </listitem> + <listitem> + <para> + Privilege and Ownership - GRANT, REVOKE, REASSIGN + </para> + </listitem> + <listitem> + <para> + Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX + </para> + </listitem> + </itemizedlist> + </para> + + <para> + Note again that some of these commands are actually allowed during + "read only" mode transactions on the primary. + </para> + + <para> + As a result, you cannot create additional indexes that exist solely + on the standby, nor can statistics that exist solely on the standby. + If these administrator commands are needed they should be executed + on the primary so that the changes will propagate through to the + standby. + </para> + + <para> + <function>pg_cancel_backend()</> will work on user backends, but not the + Startup process, which performs recovery. pg_stat_activity does not + show an entry for the Startup process, nor do recovering transactions + show as active. As a result, pg_prepared_xacts is always empty during + recovery. If you wish to resolve in-doubt prepared transactions + then look at pg_prepared_xacts on the primary and issue commands to + resolve those transactions there. + </para> + + <para> + pg_locks will show locks held by backends as normal. pg_locks also shows + a virtual transaction managed by the Startup process that owns all + AccessExclusiveLocks held by transactions being replayed by recovery. + Note that Startup process does not acquire locks to + make database changes and thus locks other than AccessExclusiveLocks + do not show in pg_locks for the Startup process, they are just presumed + to exist. + </para> + + <para> + <productname>check_pgsql</> will work, but it is very simple. + <productname>check_postgres</> will also work, though many some actions + could give different or confusing results. + e.g. last vacuum time will not be maintained for example, since no + vacuum occurs on the standby (though vacuums running on the primary do + send their changes to the standby). + </para> + + <para> + WAL file control commands will not work during recovery + e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.. + </para> + + <para> + Dynamically loadable modules work, including pg_stat_statements. + </para> + + <para> + Advisory locks work normally in recovery, including deadlock detection. + Note that advisory locks are never WAL logged, so it is not possible for + an advisory lock on either the primary or the standby to conflict with WAL + replay. Nor is it possible to acquire an advisory lock on the primary + and have it initiate a similar advisory lock on the standby. Advisory + locks relate only to a single server on which they are acquired. + </para> + + <para> + Trigger-based replication systems such as <productname>Slony</>, + <productname>Londiste</> and <productname>Bucardo</> won't run on the + standby at all, though they will run happily on the primary server as + long as the changes are not sent to standby servers to be applied. + WAL replay is not trigger-based so you cannot relay from the + standby to any system that requires additional database writes or + relies on the use of triggers. + </para> + + <para> + New oids cannot be assigned, though some <acronym>UUID</> generators may still + work as long as they do not rely on writing new status to the database. + </para> + + <para> + Currently, temp table creation is not allowed during read only + transactions, so in some cases existing scripts will not run correctly. + It is possible we may relax that restriction in a later release. This is + both a SQL Standard compliance issue and a technical issue. + </para> + + <para> + <command>DROP TABLESPACE</> can only succeed if the tablespace is empty. + Some standby users may be actively using the tablespace via their + <varname>temp_tablespaces</> parameter. If there are temp files in the + tablespace we currently cancel all active queries to ensure that temp + files are removed, so that we can remove the tablespace and continue with + WAL replay. + </para> + + <para> + Running <command>DROP DATABASE</>, <command>ALTER DATABASE ... SET TABLESPACE</>, + or <command>ALTER DATABASE ... RENAME</> on primary will generate a log message + that will cause all users connected to that database on the standby to be + forcibly disconnected. This action occurs immediately, whatever the setting of + <varname>max_standby_delay</>. + </para> + + <para> + In normal running, if you issue <command>DROP USER</> or <command>DROP ROLE</> + for a role with login capability while that user is still connected then + nothing happens to the connected user - they remain connected. The user cannot + reconnect however. This behaviour applies in recovery also, so a + <command>DROP USER</> on the primary does not disconnect that user on the standby. + </para> + + <para> + Stats collector is active during recovery. All scans, reads, blocks, + index usage etc will all be recorded normally on the standby. Replayed + actions will not duplicate their effects on primary, so replaying an + insert will not increment the Inserts column of pg_stat_user_tables. + The stats file is deleted at start of recovery, so stats from primary + and standby will differ; this is considered a feature not a bug. + </para> + + <para> + Autovacuum is not active during recovery, though will start normally + at the end of recovery. + </para> + + <para> + Background writer is active during recovery and will perform + restartpoints (similar to checkpoints on primary) and normal block + cleaning activities. The <command>CHECKPOINT</> command is accepted during recovery, + though performs a restartpoint rather than a new checkpoint. + </para> + </sect2> + + <sect2 id="hot-standby-parameters"> + <title>Hot Standby Parameter Reference</title> + + <para> + Various parameters have been mentioned above in the <xref linkend="hot-standby-admin"> + and <xref linkend="hot-standby-conflict"> sections. + </para> + + <para> + On the primary, parameters <varname>recovery_connections</> and + <varname>vacuum_defer_cleanup_age</> can be used to enable and control the + primary server to assist the successful configuration of Hot Standby servers. + <varname>max_standby_delay</> has no effect if set on the primary. + </para> + + <para> + On the standby, parameters <varname>recovery_connections</> and + <varname>max_standby_delay</> can be used to enable and control Hot Standby. + standby server to assist the successful configuration of Hot Standby servers. + <varname>vacuum_defer_cleanup_age</> has no effect during recovery. + </para> + </sect2> + + <sect2 id="hot-standby-caveats"> + <title>Caveats</title> + + <para> + At this writing, there are several limitations of Hot Standby. + These can and probably will be fixed in future releases: + + <itemizedlist> + <listitem> + <para> + Operations on hash indexes are not presently WAL-logged, so + replay will not update these indexes. Hash indexes will not be + used for query plans during recovery. + </para> + </listitem> + <listitem> + <para> + Full knowledge of running transactions is required before snapshots + may be taken. Transactions that take use large numbers of subtransactions + (currently greater than 64) will delay the start of read only + connections until the completion of the longest running write transaction. + If this situation occurs explanatory messages will be sent to server log. + </para> + </listitem> + <listitem> + <para> + Valid starting points for recovery connections are generated at each + checkpoint on the master. If the standby is shutdown while the master + is in a shutdown state it may not be possible to re-enter Hot Standby + until the primary is started up so that it generates further starting + points in the WAL logs. This is not considered a serious issue + because the standby is usually switched into the primary role while + the first node is taken down. + </para> + </listitem> + <listitem> + <para> + At the end of recovery, AccessExclusiveLocks held by prepared transactions + will require twice the normal number of lock table entries. If you plan + on running either a large number of concurrent prepared transactions + that normally take AccessExclusiveLocks, or you plan on having one + large transaction that takes many AccessExclusiveLocks then you are + advised to select a larger value of <varname>max_locks_per_transaction</>, + up to, but never more than twice the value of the parameter setting on + the primary server in rare extremes. You need not consider this at all if + your setting of <varname>max_prepared_transactions</> is <literal>0</>. + </para> + </listitem> + </itemizedlist> + + </para> + </sect2> + + </sect1> + + <sect1 id="backup-incremental-updated"> + <title>Incrementally Updated Backups</title> + + <indexterm zone="high-availability"> + <primary>incrementally updated backups</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>change accumulation</primary> + </indexterm> + + <para> + In a warm standby configuration, it is possible to offload the expense of + taking periodic base backups from the primary server; instead base backups + can be made by backing + up a standby server's files. This concept is generally known as + incrementally updated backups, log change accumulation, or more simply, + change accumulation. + </para> + + <para> + If we take a file system backup of the standby server's data + directory while it is processing + logs shipped from the primary, we will be able to reload that backup and + restart the standby's recovery process from the last restart point. + We no longer need to keep WAL files from before the standby's restart point. + If we need to recover, it will be faster to recover from the incrementally + updated backup than from the original base backup. + </para> + + <para> + Since the standby server is not <quote>live</>, it is not possible to + use <function>pg_start_backup()</> and <function>pg_stop_backup()</> + to manage the backup process; it will be up to you to determine how + far back you need to keep WAL segment files to have a recoverable + backup. You can do this by running <application>pg_controldata</> + on the standby server to inspect the control file and determine the + current checkpoint WAL location, or by using the + <varname>log_checkpoints</> option to print values to the standby's + server log. + </para> + </sect1> + </chapter> |