aboutsummaryrefslogtreecommitdiff
path: root/src
Commit message (Collapse)AuthorAge
* Minor header cleanup for the new iovec code.Thomas Munro2021-01-14
| | | | | | Remove redundant function declaration and improve header comment in pg_iovec.h. Move the new declaration in fd.h next to a group of more similar functions.
* Ensure that a standby is able to follow a primary on a newer timeline.Fujii Masao2021-01-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 709d003fbd refactored WAL-reading code, but accidentally caused WalSndSegmentOpen() to fail to follow a timeline switch while reading from a historic timeline. This issue caused a standby to fail to follow a primary on a newer timeline when WAL archiving is enabled. If there is a timeline switch within the segment, WalSndSegmentOpen() should read from the WAL segment belonging to the new timeline. But previously since it failed to follow a timeline switch, it tried to read the WAL segment with old timeline. When WAL archiving is enabled, that WAL segment with old timeline doesn't exist because it's renamed to .partial. This leads a primary to have tried to read non-existent WAL segment, and which caused replication to faill with the error "ERROR: requested WAL segment ... has already been removed". This commit fixes WalSndSegmentOpen() so that it's able to follow a timeline switch, to ensure that a standby is able to follow a primary on a newer timeline even when WAL archiving is enabled. This commit also adds the regression test to check whether a standby is able to follow a primary on a newer timeline when WAL archiving is enabled. Back-patch to v13 where the bug was introduced. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi, tweaked by Fujii Masao Reviewed-by: Alvaro Herrera, Fujii Masao Discussion: https://postgr.es/m/20201209.174314.282492377848029776.horikyota.ntt@gmail.com
* Rework refactoring of hex and encoding routinesMichael Paquier2021-01-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit addresses some issues with c3826f83 that moved the hex decoding routine to src/common/: - The decoding function lacked overflow checks, so when used for security-related features it was an open door to out-of-bound writes if not carefully used that could remain undetected. Like the base64 routines already in src/common/ used by SCRAM, this routine is reworked to check for overflows by having the size of the destination buffer passed as argument, with overflows checked before doing any writes. - The encoding routine was missing. This is moved to src/common/ and it gains the same overflow checks as the decoding part. On failure, the hex routines of src/common/ issue an error as per the discussion done to make them usable by frontend tools, but not by shared libraries. Note that this is why ECPG is left out of this commit, and it still includes a duplicated logic doing hex encoding and decoding. While on it, this commit uses better variable names for the source and destination buffers in the existing escape and base64 routines in encode.c and it makes them more robust to overflow detection. The previous core code issued a FATAL after doing out-of-bound writes if going through the SQL functions, which would be enough to detect problems when working on changes that impacted this area of the code. Instead, an error is issued before doing an out-of-bound write. The hex routines were being directly called for bytea conversions and backup manifests without such sanity checks. The current calls happen to not have any problems, but careless uses of such APIs could easily lead to CVE-class bugs. Author: Bruce Momjian, Michael Paquier Reviewed-by: Sehrope Sarkuni Discussion: https://postgr.es/m/20201231003557.GB22199@momjian.us
* Move our p{read,write}v replacements into their own files.Thomas Munro2021-01-14
| | | | | | | | | | | | | macOS's ranlib issued a warning about an empty pread.o file with the previous arrangement, on systems new enough to require no replacement functions. Let's go back to using configure's AC_REPLACE_FUNCS system to build and include each .o in the library only if it's needed, which requires moving the *v() functions to their own files. Also move the _with_retry() wrapper to a more permanent home. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1283127.1610554395%40sss.pgh.pa.us
* Mark inet_server_addr() and inet_server_port() as parallel-restricted.Tom Lane2021-01-13
| | | | | | | | | | | | | | | | | These need to be PR because they access the MyProcPort data structure, which doesn't get copied to parallel workers. The very similar functions inet_client_addr() and inet_client_port() are already marked PR, but somebody missed these. Although this is a pre-existing bug, we can't readily fix it in the back branches since we can't force initdb. Given the small usage of these two functions, and the even smaller likelihood that they'd get pushed to a parallel worker anyway, it doesn't seem worth the trouble to suggest that DBAs should fix it manually. Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoAT4aHP0Uxq91qpD7NL009tnUYQe-b14R3MnSVOjtE71g@mail.gmail.com
* Run reformat-dat-files to declutter the catalog data files.Tom Lane2021-01-13
| | | | | Things had gotten pretty messy here, apparently mostly but not entirely the fault of the multirange patch. No functional changes.
* Doc, more or less: uncomment tutorial example that was fixed long ago.Tom Lane2021-01-13
| | | | | | | | | | Reverts a portion of commit 344190b7e. Apparently, back in the twentieth century we had some issues with multi-statement SQL functions, but they've worked fine for a long time. Daniel Westermann Discussion: https://postgr.es/m/GVAP278MB04242DCBF5E31F528D53FA18D2A90@GVAP278MB0424.CHEP278.PROD.OUTLOOK.COM
* Disallow a digit as the first character of a variable name in pgbench.Tom Lane2021-01-13
| | | | | | | | | | | | | | | | | | The point of this restriction is to avoid trying to substitute variables into timestamp literal values, which may contain strings like '12:34'. There is a good deal more that should be done to reduce pgbench's tendency to substitute where it shouldn't. But this is sufficient to solve the case complained of by Jaime Soler, and it's simple enough to back-patch. Back-patch to v11; before commit 9d36a3866, pgbench had a slightly different definition of what a variable name is, and anyway it seems unwise to change long-stable branches for this. Fabien Coelho Discussion: https://postgr.es/m/alpine.DEB.2.22.394.2006291740420.805678@pseudo
* Enhance nbtree index tuple deletion.Peter Geoghegan2021-01-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach nbtree and heapam to cooperate in order to eagerly remove duplicate tuples representing dead MVCC versions. This is "bottom-up deletion". Each bottom-up deletion pass is triggered lazily in response to a flood of versions on an nbtree leaf page. This usually involves a "logically unchanged index" hint (these are produced by the executor mechanism added by commit 9dc718bd). The immediate goal of bottom-up index deletion is to avoid "unnecessary" page splits caused entirely by version duplicates. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. Bottom-up index deletion responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. The overall effect is to avoid certain pathological performance issues related to "version churn" from UPDATEs. The previous tableam interface used by index AMs to perform tuple deletion (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports certain new requirements. Many (perhaps all) of the capabilities added to nbtree by this commit could also be extended to other index AMs. That is left as work for a later commit. Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic to consider extra index tuples (that are not LP_DEAD-marked) for deletion in passing. This increases the number of index tuples deleted significantly in many cases. The LP_DEAD deletion process (which is now called "simple deletion" to clearly distinguish it from bottom-up deletion) won't usually need to visit any extra table blocks to check these extra tuples. We have to visit the same table blocks anyway to generate a latestRemovedXid value (at least in the common case where the index deletion operation's WAL record needs such a value). Testing has shown that the "extra tuples" simple deletion enhancement increases the number of index tuples deleted with almost any workload that has LP_DEAD bits set in leaf pages. That is, it almost never fails to delete at least a few extra index tuples. It helps most of all in cases that happen to naturally have a lot of delete-safe tuples. It's not uncommon for an individual deletion operation to end up deleting an order of magnitude more index tuples compared to the old naive approach (e.g., custom instrumentation of the patch shows that this happens fairly often when the regression tests are run). Add a further enhancement that augments simple deletion and bottom-up deletion in indexes that make use of deduplication: Teach nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. It is now possible to delete individual TIDs from posting list tuples provided the TIDs have a tableam block number of a table block that gets visited as part of the deletion process (visiting the table block can be triggered directly or indirectly). Setting the LP_DEAD bit of a posting list tuple is still an all-or-nothing thing, but that matters much less now that deletion only needs to start out with the right _general_ idea about which index tuples are deletable. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from bottom-up index deletion (i.e. no reindexing required) following a pg_upgrade. The enhancement to simple deletion is available with all B-Tree indexes following a pg_upgrade, no matter what PostgreSQL version the user upgrades from. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
* Pass down "logically unchanged index" hint.Peter Geoghegan2021-01-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an executor aminsert() hint mechanism that informs index AMs that the incoming index tuple (the tuple that accompanies the hint) is not being inserted by execution of an SQL statement that logically modifies any of the index's key columns. The hint is received by indexes when an UPDATE takes place that does not apply an optimization like heapam's HOT (though only for indexes where all key columns are logically unchanged). Any index tuple that receives the hint on insert is expected to be a duplicate of at least one existing older version that is needed for the same logical row. Related versions will typically be stored on the same index page, at least within index AMs that apply the hint. Recognizing the difference between MVCC version churn duplicates and true logical row duplicates at the index AM level can help with cleanup of garbage index tuples. Cleanup can intelligently target tuples that are likely to be garbage, without wasting too many cycles on less promising tuples/pages (index pages with little or no version churn). This is infrastructure for an upcoming commit that will teach nbtree to perform bottom-up index deletion. No index AM actually applies the hint just yet. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFASTdzipdkLTNQQ@mail.gmail.com
* Log long wait time on recovery conflict when it's resolved.Fujii Masao2021-01-13
| | | | | | | | | | | | | This is a follow-up of the work done in commit 0650ff2303. This commit extends log_recovery_conflict_waits so that a log message is produced also when recovery conflict has already been resolved after deadlock_timeout passes, i.e., when the startup process finishes waiting for recovery conflict after deadlock_timeout. This is useful in investigating how long recovery conflicts prevented the recovery from applying WAL. Author: Fujii Masao Reviewed-by: Kyotaro Horiguchi, Bertrand Drouvot Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
* Don't use elog() in src/port/pwrite.c.Thomas Munro2021-01-13
| | | | | | | | Nothing broke because of this oversight yet, but it would fail to link if we tried to use pg_pwrite() in frontend code on a system that lacks pwrite(). Use an assertion instead. Also pgindent while here. Discussion: https://postgr.es/m/CA%2BhUKGL57RvoQsS35TVPnQoPYqbtBixsdRhynB8NpcUKpHTTtg%40mail.gmail.com
* Fix memory leak in SnapBuildSerialize.Amit Kapila2021-01-13
| | | | | | | | | | | | | | The memory for the snapshot was leaked while serializing it to disk during logical decoding. This memory will be freed only once walsender stops streaming the changes. This can lead to a huge memory increase when master logs Standby Snapshot too frequently say when the user is trying to create many replication slots. Reported-by: funnyxj.fxj@alibaba-inc.com Diagnosed-by: funnyxj.fxj@alibaba-inc.com Author: Amit Kapila Backpatch-through: 9.5 Discussion: https://postgr.es/m/033ab54c-6393-42ee-8ec9-2b399b5d8cde.funnyxj.fxj@alibaba-inc.com
* Optimize DropRelFileNodesAllBuffers() for recovery.Amit Kapila2021-01-13
| | | | | | | | | | | | | | | Similar to commit d6ad34f341, this patch optimizes DropRelFileNodesAllBuffers() by avoiding the complete buffer pool scan and instead find the buffers to be invalidated by doing lookups in the BufMapping table. This optimization helps operations where the relation files need to be removed like Truncate, Drop, Abort of Create Table, etc. Author: Kirk Jamison Reviewed-by: Kyotaro Horiguchi, Takayuki Tsunakawa, and Amit Kapila Tested-By: Haiying Tang Discussion: https://postgr.es/m/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
* Fix routine name in comment of catcache.cMichael Paquier2021-01-13
| | | | | Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACUDXLAkf_XxQO9tAUtnTNGi3Lmd8fANd+vBJbcHn1HoWA@mail.gmail.com
* Invent struct ReindexIndexInfoAlvaro Herrera2021-01-12
| | | | | | | | | | | This struct is used by ReindexRelationConcurrently to keep track of the relations to process. This saves having to obtain some data repeatedly, and has future uses as well. Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Hamid Akhtar <hamid.akhtar@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/20201130195439.GA24598@alvherre.pgsql
* pg_dump: label INDEX ATTACH ArchiveEntries with an owner.Tom Lane2021-01-12
| | | | | | | | | | | | | | | | Although a partitioned index's attachment to its parent doesn't have separate ownership, the ArchiveEntry for it needs to be marked with an owner anyway, to ensure that the ALTER command is run by the appropriate role when restoring with --use-set-session-authorization. Without this, the ALTER will be run by the role that started the restore session, which will usually work but it's formally the wrong thing. Back-patch to v11 where this type of ArchiveEntry was added. In HEAD, add equivalent commentary to the just-added TABLE ATTACH case, which I'd made do the right thing already. Discussion: https://postgr.es/m/1094034.1610418498@sss.pgh.pa.us
* Fix thinko in commentAlvaro Herrera2021-01-12
| | | | | | | | This comment has been wrong since its introduction in commit 2c03216d8311. Author: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoAzz6qipFJBbGEaHmyWxvvNDp8httbwLR9tUQWaTjUs2Q@mail.gmail.com
* Fix relation descriptor leak.Amit Kapila2021-01-12
| | | | | | | | | | | We missed closing the relation descriptor while sending changes via the root of partitioned relations during logical replication. Author: Amit Langote and Mark Zhao Reviewed-by: Amit Kapila and Ashutosh Bapat Backpatch-through: 13, where it was introduced Discussion: https://postgr.es/m/tencent_41FEA657C206F19AB4F406BE9252A0F69C06@qq.com Discussion: https://postgr.es/m/tencent_6E296D2F7D70AFC90D83353B69187C3AA507@qq.com
* Optimize DropRelFileNodeBuffers() for recovery.Amit Kapila2021-01-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The recovery path of DropRelFileNodeBuffers() is optimized so that scanning of the whole buffer pool can be avoided when the number of blocks to be truncated in a relation is below a certain threshold. For such cases, we find the buffers by doing lookups in BufMapping table. This improves the performance by more than 100 times in many cases when several small tables (tested with 1000 relations) are truncated and where the server is configured with a large value of shared buffers (greater than equal to 100GB). This optimization helps cases (a) when vacuum or autovacuum truncated off any of the empty pages at the end of a relation, or (b) when the relation is truncated in the same transaction in which it was created. This commit introduces a new API smgrnblocks_cached which returns a cached value for the number of blocks in a relation fork. This helps us to determine the exact size of relation which is required to apply this optimization. The exact size is required to ensure that we don't leave any buffer for the relation being dropped as otherwise the background writer or checkpointer can lead to a PANIC error while flushing buffers corresponding to files that don't exist. Author: Kirk Jamison based on ideas by Amit Kapila Reviewed-by: Kyotaro Horiguchi, Takayuki Tsunakawa, and Amit Kapila Tested-By: Haiying Tang Discussion: https://postgr.es/m/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
* Dump ALTER TABLE ... ATTACH PARTITION as a separate ArchiveEntry.Tom Lane2021-01-11
| | | | | | | | | | | | | | | | Previously, we emitted the ATTACH PARTITION command as part of the child table's ArchiveEntry. This was a poor choice since it complicates restoring the partition as a standalone table; you have to ignore the error from the ATTACH, which isn't even an option when restoring direct-to-database with pg_restore. (pg_restore will issue the whole ArchiveEntry as one PQexec, so that any error rolls back the table creation as well.) Hence, separate it out as its own ArchiveEntry, as indeed we already did for index ATTACH PARTITION commands. Justin Pryzby Discussion: https://postgr.es/m/20201023052940.GE9241@telsasoft.com
* Make pg_dump's table of object-type priorities more maintainable.Tom Lane2021-01-11
| | | | | | | | | | | | | | | | | Wedging a new object type into this table has historically required manually renumbering a lot of existing entries. (Although it appears that some people got lazy and re-used the priority level of an existing object type, even if it wasn't particularly related.) We can let the compiler do the counting by inventing an enum type that lists the desired priority levels in order. Now, if you want to add or remove a priority level, that's a one-liner. This patch is not purely cosmetic, because I split apart the priorities of DO_COLLATION and DO_TRANSFORM, as well as those of DO_ACCESS_METHOD and DO_OPERATOR, which look to me to have been merged out of expediency rather than because it was a good idea. Shell types continue to be sorted interchangeably with full types, and opclasses interchangeably with opfamilies.
* Fix function prototypes in dependency.h.Thomas Munro2021-01-12
| | | | | | | | Commit 257836a7 accidentally deleted a couple of redundant-but-conventional "extern" keywords on function prototypes. Put them back. Reported-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
* Rethink SQLSTATE code for ERRCODE_IDLE_SESSION_TIMEOUT.Tom Lane2021-01-11
| | | | | | | | | | | | Move it to class 57 (Operator Intervention), which seems like a better choice given that from the client's standpoint it behaves a heck of a lot like, e.g., ERRCODE_ADMIN_SHUTDOWN. In a green field I'd put ERRCODE_IDLE_IN_TRANSACTION_SESSION_TIMEOUT here as well. But that's been around for a few years, so it's probably too late to change its SQLSTATE code. Discussion: https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
* Try next host after a "cannot connect now" failure.Tom Lane2021-01-11
| | | | | | | | | | | | | | | | | If a server returns ERRCODE_CANNOT_CONNECT_NOW, try the next host, if multiple host names have been provided. This allows dealing gracefully with standby servers that might not be in hot standby mode yet. In the wake of the preceding commit, it might be plausible to retry many more error cases than we do now, but I (tgl) am hesitant to move too aggressively on that --- it's not clear it'd be desirable for cases such as bad-password, for example. But this case seems safe enough. Hubert Zhang, reviewed by Takayuki Tsunakawa Discussion: https://postgr.es/m/BN6PR05MB3492948E4FD76C156E747E8BC9160@BN6PR05MB3492.namprd05.prod.outlook.com
* Uniformly identify the target host in libpq connection failure reports.Tom Lane2021-01-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prefix "could not connect to host-or-socket-path:" to all connection failure cases that occur after the socket() call, and remove the ad-hoc server identity data that was appended to a few of these messages. This should produce much more intelligible error reports in multiple-target-host situations, especially for error cases that are off the beaten track to any degree (because none of those provided any server identity info). As an example of the change, formerly a connection attempt with a bad port number such as "psql -p 12345 -h localhost,/tmp" might produce psql: error: could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 12345? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 12345? could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.12345"? Now it looks like psql: error: could not connect to host "localhost" (::1), port 12345: Connection refused Is the server running on that host and accepting TCP/IP connections? could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused Is the server running on that host and accepting TCP/IP connections? could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory Is the server running locally and accepting connections on that socket? This requires adjusting a couple of regression tests to allow for variation in the contents of a connection failure message. Discussion: https://postgr.es/m/BN6PR05MB3492948E4FD76C156E747E8BC9160@BN6PR05MB3492.namprd05.prod.outlook.com
* Allow pg_regress.c wrappers to postprocess test result files.Tom Lane2021-01-11
| | | | | | | | | | | | | | | | | | Add an optional callback to regression_main() that, if provided, is invoked on each test output file before we try to compare it to the expected-result file. The main and isolation test programs don't need this (yet). In pg_regress_ecpg, add a filter that eliminates target-host details from "could not connect" error reports. This filter doesn't do anything as of this commit, but it will be needed by the next one. In the long run we might want to provide some more general, perhaps pattern-based, filtering mechanism for test output. For now, this will solve the immediate problem. Discussion: https://postgr.es/m/BN6PR05MB3492948E4FD76C156E747E8BC9160@BN6PR05MB3492.namprd05.prod.outlook.com
* In libpq, always append new error messages to conn->errorMessage.Tom Lane2021-01-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we had an undisciplined mish-mash of printfPQExpBuffer and appendPQExpBuffer calls to report errors within libpq. This commit establishes a uniform rule that appendPQExpBuffer[Str] should be used. conn->errorMessage is reset only at the start of an application request, and then accumulates messages till we're done. We can remove no less than three different ad-hoc mechanisms that were used to get the effect of concatenation of error messages within a sequence of operations. Although this makes things quite a bit cleaner conceptually, the main reason to do it is to make the world safer for the multiple-target-host feature that was added awhile back. Previously, there were many cases in which an error occurring during an individual host connection attempt would wipe out the record of what had happened during previous attempts. (The reporting is still inadequate, in that it can be hard to tell which host got the failure, but that seems like a matter for a separate commit.) Currently, lo_import and lo_export contain exceptions to the "never use printfPQExpBuffer" rule. If we changed them, we'd risk reporting an incidental lo_close failure before the actual read or write failure, which would be confusing, not least because lo_close happened after the main failure. We could improve this by inventing an internal version of lo_close that doesn't reset the errorMessage; but we'd also need a version of PQfn() that does that, and it didn't quite seem worth the trouble for now. Discussion: https://postgr.es/m/BN6PR05MB3492948E4FD76C156E747E8BC9160@BN6PR05MB3492.namprd05.prod.outlook.com
* Use vectored I/O to fill new WAL segments.Thomas Munro2021-01-11
| | | | | | | | | | | | | Instead of making many block-sized write() calls to fill a new WAL file with zeroes, make a smaller number of pwritev() calls (or various emulations). The actual number depends on the OS's IOV_MAX, which PG_IOV_MAX currently caps at 32. That means we'll write 256kB per call on typical systems. We may want to tune the number later with more experience. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
* Provide pg_preadv() and pg_pwritev().Thomas Munro2021-01-11
| | | | | | | | | | | Provide synchronous vectored file I/O routines. These map to preadv() and pwritev(), with fallback implementations for systems that don't have them. Also provide a wrapper pg_pwritev_with_retry() that automatically retries on short writes. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
* Fix plpgsql tests for debug_invalidate_system_caches_always.Tom Lane2021-01-08
| | | | | | | | | | | | | | | | | | | | | Commit c9d529848 resulted in having a couple more places where the error context stack for a failure varies depending on debug_invalidate_system_caches_always (nee CLOBBER_CACHE_ALWAYS). This is not very surprising, since we have to re-parse cached plans if the plan cache is clobbered. Stabilize the expected test output by hiding the context stack in these places, as we've done elsewhere in this test script. (Another idea worth considering, now that we have debug_invalidate_system_caches_always, is to force it to zero for these test cases. That seems like it'd risk reducing the coverage of cache-clobber testing, which might or might not be worth being able to verify that we get the expected error output in normal cases. For the moment I just stuck with the existing technique.) In passing, update comments that referred to CLOBBER_CACHE_ALWAYS. Per buildfarm member hyrax.
* Fix ancient bug in parsing of BRE-mode regular expressions.Tom Lane2021-01-08
| | | | | | | | | | | | | | | | | brenext(), when parsing a '*' quantifier, forgot to return any "value" for the token; per the equivalent case in next(), it should return value 1 to indicate that greedy rather than non-greedy behavior is wanted. The result is that the compiled regexp could behave like 'x*?' rather than the intended 'x*', if we were unlucky enough to have a zero in v->nextvalue at this point. That seems to happen with some reliability if we have '.*' at the beginning of a BRE-mode regexp, although that depends on the initial contents of a stack-allocated struct, so it's not guaranteed to fail. Found by Alexander Lakhin using valgrind testing. This bug seems to be aboriginal in Spencer's code, so back-patch all the way. Discussion: https://postgr.es/m/16814-6c5e3edd2bdf0d50@postgresql.org
* Fix and simplify some code related to cryptohashesMichael Paquier2021-01-08
| | | | | | | | | | | This commit addresses two issues: - In pgcrypto, MD5 computation called pg_cryptohash_{init,update,final} without checking for the result status. - Simplify pg_checksum_raw_context to use only one variable for all the SHA2 options available in checksum manifests. Reported-by: Heikki Linnakangas Discussion: https://postgr.es/m/f62f26bb-47a5-8411-46e5-4350823e06a5@iki.fi
* Adjust createdb TAP tests to work on recent OpenBSD.Tom Lane2021-01-07
| | | | | | | | | | | | | | | | | | | We found last February that the error-case tests added by commit 008cf0409 failed on OpenBSD, because that platform doesn't really check locale names. At the time it seemed that that was only an issue for LC_CTYPE, but testing on a more recent version of OpenBSD shows that it's now equally lax about LC_COLLATE. Rather than dropping the LC_COLLATE test too, put back LC_CTYPE (reverting c4b0edb07), and adjust these tests to accept the different error message that we get if setlocale() doesn't reject a bogus locale name. The point of these tests is not really what the backend does with the locale name, but to show that createdb quotes funny locale names safely; so we're not losing test reliability this way. Back-patch as appropriate. Discussion: https://postgr.es/m/231373.1610058324@sss.pgh.pa.us
* Further second thoughts about idle_session_timeout patch.Tom Lane2021-01-07
| | | | | | | | | | | On reflection, the order of operations in PostgresMain() is wrong. These timeouts ought to be shut down before, not after, we do the post-command-read CHECK_FOR_INTERRUPTS, to guarantee that any timeout error will be detected there rather than at some ill-defined later point (possibly after having wasted a lot of work). This is really an error in the original idle_in_transaction_timeout patch, so back-patch to 9.6 where that was introduced.
* Add GUC to log long wait times on recovery conflicts.Fujii Masao2021-01-08
| | | | | | | | | | | | | | | | This commit adds GUC log_recovery_conflict_waits that controls whether a log message is produced when the startup process is waiting longer than deadlock_timeout for recovery conflicts. This is useful in determining if recovery conflicts prevent the recovery from applying WAL. Note that currently a log message is produced only when recovery conflict has not been resolved yet even after deadlock_timeout passes, i.e., only when the startup process is still waiting for recovery conflict even after deadlock_timeout. Author: Bertrand Drouvot, Masahiko Sawada Reviewed-by: Alvaro Herrera, Kyotaro Horiguchi, Fujii Masao Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
* Fix bogus link in test comments.Tom Lane2021-01-06
| | | | | I apparently copied-and-pasted the wrong link in commit ca8217c10. Point it where it was meant to go.
* Improve commentary in timeout.c.Tom Lane2021-01-06
| | | | | | | On re-reading I realized that I'd missed one race condition in the new timeout code. It's safe, but add a comment explaining it. Discussion: https://postgr.es/m/CA+hUKG+o6pbuHBJSGnud=TadsuXySWA7CCcPgCt2QE9F6_4iHQ@mail.gmail.com
* Fix allocation logic of cryptohash context data with OpenSSLMichael Paquier2021-01-07
| | | | | | | | | | | | | | | | | | | | The allocation of the cryptohash context data when building with OpenSSL was happening in the memory context of the caller of pg_cryptohash_create(), which could lead to issues with resowner cleanup if cascading resources are cleaned up on an error. Like other facilities using resowners, move the base allocation to TopMemoryContext to ensure a correct cleanup on failure. The resulting code gets simpler with this commit as the context data is now hold by a unique opaque pointer, so as there is only one single allocation done in TopMemoryContext. After discussion, also change the cryptohash subroutines to return an error if the caller provides NULL for the context data to ease error detection on OOM. Author: Heikki Linnakangas Discussion: https://postgr.es/m/X9xbuEoiU3dlImfa@paquier.xyz
* Add idle_session_timeout.Tom Lane2021-01-06
| | | | | | | | | | | | This GUC variable works much like idle_in_transaction_session_timeout, in that it kills sessions that have waited too long for a new client query. But it applies when we're not in a transaction, rather than when we are. Li Japin, reviewed by David Johnston and Hayato Kuroda, some fixes by me Discussion: https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
* Improve timeout.c's handling of repeated timeout set/cancel.Tom Lane2021-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A very common usage pattern is that we set a timeout that we don't expect to reach, cancel it after a little bit, and later repeat. With the original implementation of timeout.c, this results in one setitimer() call per timeout set or cancel. We can do a lot better by being lazy about changing the timeout interrupt request, namely: (1) never cancel the outstanding interrupt, even when we have no active timeout events; (2) if we need to set an interrupt, but there already is one pending at or before the required time, leave it alone. When the interrupt happens, the signal handler will reschedule it at whatever time is then needed. For example, with a one-second setting for statement_timeout, this method results in having to interact with the kernel only a little more than once a second, no matter how many statements we execute in between. The mainline code might never call setitimer() at all after the first time, while each time the signal handler fires, it sees that the then-pending request is most of a second away, and that's when it sets the next interrupt request for. Each mainline timeout-set request after that will observe that the time it wants is past the pending interrupt request time, and do nothing. This also works pretty well for cases where a few different timeout lengths are in use, as long as none of them are very short. But that describes our usage well. Idea and original patch by Thomas Munro; I fixed a race condition and improved the comments. Discussion: https://postgr.es/m/CA+hUKG+o6pbuHBJSGnud=TadsuXySWA7CCcPgCt2QE9F6_4iHQ@mail.gmail.com
* Report progress of COPY commandsTomas Vondra2021-01-06
| | | | | | | | | | | | This commit introduces a view pg_stat_progress_copy, reporting progress of COPY commands. This allows rough estimates how far a running COPY progressed, with the caveat that the total number of bytes may not be available in some cases (e.g. when the input comes from the client). Author: Josef Šimánek Reviewed-by: Fujii Masao, Bharath Rupireddy, Vignesh C, Matthias van de Meent Discussion: https://postgr.es/m/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com Discussion: https://postgr.es/m/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg+FEKet4XaTGSW=LitKQ@mail.gmail.com
* Add a test module for the regular expression package.Tom Lane2021-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | This module provides a function test_regex() that is functionally rather like regexp_matches(), but with additional debugging-oriented options and additional output. The debug options are somewhat obscure; they are chosen to match the API of the test harness that Henry Spencer wrote way-back-when for use in Tcl. With this, we can import all the test cases that Spencer wrote originally, even for regex functionality that we don't currently expose in Postgres. This seems necessary because we can no longer rely on Tcl to act as upstream and verify any fixes or improvements that we make. In addition to Spencer's tests, I added a few for lookbehind constraints (which we added in 2015, and Tcl still hasn't absorbed) that are modeled on his tests for lookahead constraints. After looking at code coverage reports, I also threw in a couple of tests to more fully exercise our "high colormap" logic. According to my testing, this brings the check-world coverage for src/backend/regex/ from 71.1% to 86.7% of lines. (coverage.postgresql.org shows a slightly different number, which I think is because it measures a non-assert build.) Discussion: https://postgr.es/m/2873268.1609732164@sss.pgh.pa.us
* Replace CLOBBER_CACHE_ALWAYS with run-time GUCPeter Eisentraut2021-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Forced cache invalidation (CLOBBER_CACHE_ALWAYS) has been impractical to use for testing in PostgreSQL because it's so slow and because it's toggled on/off only at build time. It is helpful when hunting bugs in any code that uses the sycache/relcache because causes cache invalidations to be injected whenever it would be possible for an invalidation to occur, whether or not one was really pending. Address this by providing run-time control over cache clobber behaviour using the new debug_invalidate_system_caches_always GUC. Support is not compiled in at all unless assertions are enabled or CLOBBER_CACHE_ENABLED is explicitly defined at compile time. It defaults to 0 if compiled in, so it has negligible effect on assert build performance by default. When support is compiled in, test code can now set debug_invalidate_system_caches_always=1 locally to a backend to test specific queries, functions, extensions, etc. Or tests can toggle it globally for a specific test case while retaining normal performance during test setup and teardown. For backwards compatibility with existing test harnesses and scripts, debug_invalidate_system_caches_always defaults to 1 if CLOBBER_CACHE_ALWAYS is defined, and to 3 if CLOBBER_CACHE_RECURSIVE is defined. CLOBBER_CACHE_ENABLED is now visible in pg_config_manual.h, as is the related RECOVER_RELATION_BUILD_MEMORY setting for the relcache. Author: Craig Ringer <craig.ringer@2ndquadrant.com> Discussion: https://www.postgresql.org/message-id/flat/CAMsr+YF=+ctXBZj3ywmvKNUjWpxmuTuUKuv-rgbHGX5i5pLstQ@mail.gmail.com
* Detect the deadlocks between backends and the startup process.Fujii Masao2021-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The deadlocks that the recovery conflict on lock is involved in can happen between hot-standby backends and the startup process. If a backend takes an access exclusive lock on the table and which finally triggers the deadlock, that deadlock can be detected as expected. On the other hand, previously, if the startup process took an access exclusive lock and which finally triggered the deadlock, that deadlock could not be detected and could remain even after deadlock_timeout passed. This is a bug. The cause of this bug was that the code for handling the recovery conflict on lock didn't take care of deadlock case at all. It assumed that deadlocks involving the startup process and backends were able to be detected by the deadlock detector invoked within backends. But this assumption was incorrect. The startup process also should have invoked the deadlock detector if necessary. To fix this bug, this commit makes the startup process invoke the deadlock detector if deadlock_timeout is reached while handling the recovery conflict on lock. Specifically, in that case, the startup process requests all the backends holding the conflicting locks to check themselves for deadlocks. Back-patch to v9.6. v9.5 has also this bug, but per discussion we decided not to back-patch the fix to v9.5. Because v9.5 doesn't have some infrastructure codes (e.g., 37c54863cf) that this bug fix patch depends on. We can apply those codes for the back-patch, but since the next minor version release is the final one for v9.5, it's risky to do that. If we unexpectedly introduce new bug to v9.5 by the back-patch, there is no chance to fix that. We determined that the back-patch to v9.5 would give more risk than gain. Author: Fujii Masao Reviewed-by: Bertrand Drouvot, Masahiko Sawada, Kyotaro Horiguchi Discussion: https://postgr.es/m/4041d6b6-cf24-a120-36fa-1294220f8243@oss.nttdata.com
* Fix typos in decode.c and logical.c.Amit Kapila2021-01-06
| | | | | Per report by Ajin Cherian in email: https://postgr.es/m/CAFPTHDYnRKDvzgDxoMn_CKqXA-D0MtrbyJvfvjBsO4G=UHDXkg@mail.gmail.com
* Promote --data-checksums to the common set of options in initdb --helpMichael Paquier2021-01-06
| | | | | | | | | This was previously part of the section dedicated to less common options, but it is an option commonly used these days. Author: Michael Banck Reviewed-by: Stephen Frost, Michael Paquier Discussion: https://postgr.es/m/d7938aca4d4ea8e8c72c33bd75efe9f8218fe390.camel@credativ.de
* Revert unstable test cases from commit 7d80441d2.Tom Lane2021-01-05
| | | | | | | | I momentarily forgot that the "owner" column wouldn't be stable in the buildfarm. Oh well, these tests weren't very valuable anyway. Discussion: https://postgr.es/m/20201130165436.GX24052@telsasoft.com
* Allow psql's \dt and \di to show TOAST tables and their indexes.Tom Lane2021-01-05
| | | | | | | | | | | | | | | | | | Formerly, TOAST objects were unconditionally suppressed, but since \d is able to print them it's not very clear why these variants should not. Instead, use the same rules as for system catalogs: they can be seen if you write the 'S' modifier or a table name pattern. (In practice, since hardly anybody would keep pg_toast in their search_path, it's really down to whether you use a pattern that can match pg_toast.*.) No docs change seems necessary because the docs already say that this happens for "system objects"; we're just classifying TOAST tables as being that. Justin Pryzby, reviewed by Laurenz Albe Discussion: https://postgr.es/m/20201130165436.GX24052@telsasoft.com
* Introduce a new GUC_REPORT setting "in_hot_standby".Tom Lane2021-01-05
| | | | | | | | | | | | | Aside from being queriable via SHOW, this value is sent to the client immediately at session startup, and again later on if the server gets promoted to primary during the session. The immediate report will be used in an upcoming patch to avoid an extra round trip when trying to connect to a primary server. Haribabu Kommi, Greg Nancarrow, Tom Lane; reviewed at various times by Laurenz Albe, Takayuki Tsunakawa, Peter Smith. Discussion: https://postgr.es/m/CAF3+xM+8-ztOkaV9gHiJ3wfgENTq97QcjXQt+rbFQ6F7oNzt9A@mail.gmail.com