aboutsummaryrefslogtreecommitdiff
path: root/src
Commit message (Collapse)AuthorAge
...
* Rewrite the GiST insertion logic so that we don't need the post-recoveryHeikki Linnakangas2010-12-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cleanup stage to finish incomplete inserts or splits anymore. There was two reasons for the cleanup step: 1. When a new tuple was inserted to a leaf page, the downlink in the parent needed to be updated to contain (ie. to be consistent with) the new key. Updating the parent in turn might require recursively updating the parent of the parent. We now handle that by updating the parent while traversing down the tree, so that when we insert the leaf tuple, all the parents are already consistent with the new key, and the tree is consistent at every step. 2. When a page is split, we need to insert the downlink for the new right page(s), and update the downlink for the original page to not include keys that moved to the right page(s). We now handle that by setting a new flag, F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is set, scans always follow the rightlink, regardless of the NSN mechanism used to detect concurrent page splits. That way the tree is consistent right after split, even though the downlink is still missing. This is very similar to the way B-tree splits are handled. When the downlink is inserted in the parent, the flag is cleared. To keep the insertion algorithm simple, when an insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it finishes the split before doing anything else. These changes allow removing the whole "invalid tuple" mechanism, but I retained the scan code to still follow invalid tuples correctly. While we don't create any such tuples anymore, we want to handle them gracefully in case you pg_upgrade a GiST index that has them. If we encounter any on an insert, though, we just throw an error saying that you need to REINDEX. The issue that got me into doing this is that if you did a checkpoint while an insert or split was in progress, and the checkpoint finishes quickly so that there is no WAL record related to the insert between RedoRecPtr and the checkpoint record, recovery from that checkpoint would not know to finish the incomplete insert. IOW, we have the same issue we solved with the rm_safe_restartpoint mechanism during normal operation too. It's highly unlikely to happen in practice, and this fix is far too large to backpatch, so we're just going to live with in previous versions, but this refactoring fixes it going forward. With this patch, you don't get the annoying 'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices anymore if you crash at an unfortunate moment.
* Add PQlibVersion() function to libpqMagnus Hagander2010-12-22
| | | | | | | | | This function is like the PQserverVersion() function except it returns the version of libpq, making it possible for a client program or driver to determine which version of libpq is in use at runtime, and not just at link time. Suggested by Harald Armin Massa and several others.
* Use memcmp() rather than strncmp() when shorter string length is known.Robert Haas2010-12-21
| | | | | | | It appears that this will be faster for all but the shortest strings; at least one some platforms, memcmp() can use word-at-a-time comparisons. Noah Misch, somewhat pared down.
* Fix typos.Robert Haas2010-12-21
| | | | Andreas Karlsson
* Work around unfortunate getppid() behavior on BSD-ish systems.Robert Haas2010-12-21
| | | | | | | | | | | | | | | | | On MacOS X, and apparently also on other BSD-derived systems, attaching a debugger causes getppid() to return the pid of the debugging process rather than the actual parent PID. As a result, debugging the autovacuum launcher, startup process, or WAL sender on such systems causes it to exit, because the previous coding of PostmasterIsAlive() detects postmaster death by testing whether getppid() == PostmasterPid. Work around that behavior by checking the return value of getppid() more carefully. If it's PostmasterPid, the postmaster must be alive; if it's 1, assume the postmaster is dead. If it's any other value, assume we've been debugged and fall through to the less-reliable kill() test. Review by Tom Lane.
* Allow transactions that don't write WAL to commit asynchronously.Robert Haas2010-12-20
| | | | | | | | This case can arise if a transaction has written data, but only to temporary tables. Loss of the commit record in case of a crash won't matter, because the temporary tables will be lost anyway. Reviewed by Heikki Linnakangas and Simon Riggs.
* Remove thread dumping constant that requires newer Platform SDKMagnus Hagander2010-12-19
| | | | | | | Since we're not multithreaded it only provides marginally useful information, and it does require a newer version of the Platform SDK than we target. We may want to reconsider this in the future along with a fix for MinGW.
* Fix up handling of simple-form CASE with constant test expression.Tom Lane2010-12-19
| | | | | | | | | | | | | | | | | | | | | | | | | eval_const_expressions() can replace CaseTestExprs with constants when the surrounding CASE's test expression is a constant. This confuses ruleutils.c's heuristic for deparsing simple-form CASEs, leading to Assert failures or "unexpected CASE WHEN clause" errors. I had put in a hack solution for that years ago (see commit 514ce7a331c5bea8e55b106d624e55732a002295 of 2006-10-01), but bug #5794 from Peter Speck shows that that solution failed to cover all cases. Fortunately, there's a much better way, which came to me upon reflecting that Peter's "CASE TRUE WHEN" seemed pretty redundant: we can "simplify" the simple-form CASE to the general form of CASE, by simply omitting the constant test expression from the rebuilt CASE construct. This is intuitively valid because there is no need for the executor to evaluate the test expression at runtime; it will never be referenced, because any CaseTestExprs that would have referenced it are now replaced by constants. This won't save a whole lot of cycles, since evaluating a Const is pretty cheap, but a cycle saved is a cycle earned. In any case it beats kluging ruleutils.c still further. So this patch improves const-simplification and reverts the previous change in ruleutils.c. Back-patch to all supported branches. The bug exists in 8.1 too, but it's out of warranty.
* Fix erroneous parsing of tsquery input "... & !(subexpression) | ..."Tom Lane2010-12-19
| | | | | | | | | | | After parsing a parenthesized subexpression, we must pop all pending ANDs and NOTs off the stack, just like the case for a simple operand. Per bug #5793. Also fix clones of this routine in contrib/intarray and contrib/ltree, where input of types query_int and ltxtquery had the same problem. Back-patch to all supported versions.
* Support for collecting crash dumps on WindowsMagnus Hagander2010-12-19
| | | | | | | | | | Add support for collecting "minidump" style crash dumps on Windows, by setting up an exception handling filter. Crash dumps will be generated in PGDATA/crashdumps if the directory is created (the existance of the directory is used as on/off switch for the generation of the dumps). Craig Ringer and Magnus Hagander
* Properly print the IP number and "localhost" for failed localhostBruce Momjian2010-12-18
| | | | connections when the server is down, on Win32.
* Make GUC variables for syslog and SSL always visibleMagnus Hagander2010-12-18
| | | | | Make the variables visible (but not used) even when support is not compiled in.
* set_ps_display when calling functions via fastpathAlvaro Herrera2010-12-17
| | | | This improves tag output by log_line_prefix
* Remove unnecessary definition for autovacuum in SignalSomeChildren.Alvaro Herrera2010-12-17
|
* Try to save a kernel call in ResolveRecoveryConflictWithVirtualXIDs.Robert Haas2010-12-17
| | | | If there's no work to be done, just exit quickly, before initialization.
* Reset 'ps' display just once when resolving VXID conflicts.Robert Haas2010-12-17
| | | | | | | | | | | This prevents the word "waiting" from briefly disappearing from the ps status line when ResolveRecoveryConflictWithVirtualXIDs begins a new iteration of the outer loop. Along the way, remove some useless pgstat_report_waiting() calls; the startup process doesn't appear in pg_stat_activity. Fujii Masao
* Improve comments around startup_hacks() code.Tom Lane2010-12-16
| | | | | | | These comments were not updated when we added the EXEC_BACKEND mechanism for Windows, even though it rendered them inaccurate. Also unify two unnecessarily-separate #ifdef __alpha code blocks.
* Remove optreset from src/port/ implementations of getopt and getopt_long.Tom Lane2010-12-16
| | | | | | | | | | We don't actually need optreset, because we can easily fix the code to ensure that it's cleanly restartable after having completed a scan over the argv array; which is the only case we need to restart in. Getting rid of it avoids a class of interactions with the system libraries and allows reversion of my change of yesterday in postmaster.c and postgres.c. Back-patch to 8.4. Before that the getopt code was a bit different anyway.
* Avoid clobbering errno, per comment from Tom.Alvaro Herrera2010-12-16
|
* Fix inconsequential FILE pointer leakageAlvaro Herrera2010-12-16
|
* Add some minor missing error checksAlvaro Herrera2010-12-16
|
* Simplify SignalSomeChildren(BACKEND_TYPE_ALL) to SignalChildren()Alvaro Herrera2010-12-16
|
* Fix crash caused by NULL lookup when reporting IP address of failedBruce Momjian2010-12-16
| | | | | | libpq connection, per report from Magnus. This happens only on GIT master and only on Win32 because that is the platform where "" maps to an IP address (localhost).
* Fix up getopt() reset management so it works on recent mingw.Tom Lane2010-12-15
| | | | | | | | | The mingw people don't appear to care about compatibility with non-GNU versions of getopt, so force use of our own copy of getopt on Windows. Also, ensure that we make use of optreset when using our own copy. Per report from Andrew Dunstan. Back-patch to all versions supported on Windows.
* Some copy editing of pg_read_binary_file() patch.Robert Haas2010-12-15
|
* Add pg_read_binary_file() and whole-file-at-once versions of pg_read_file().Itagaki Takahiro2010-12-16
| | | | | | | One of the usages of the binary version is to read files in a different encoding from the server encoding. Dimitri Fontaine and Itagaki Takahiro.
* Instrument checkpoint sync calls.Robert Haas2010-12-14
| | | | Greg Smith, reviewed by Jeff Janes
* Improved tab completion for views with triggers.Robert Haas2010-12-13
| | | | | | | | | | | Allow INSERT INTO, UPDATE, and DELETE FROM to be completed with either the name of a table (as before) or the name of a view with an appropriate INSTEAD OF rule. Along the way, allow CREATE TRIGGER to be completed with INSTEAD OF, as well as BEFORE and AFTER. David Fetter, reviewed by Itagaki Takahiro
* Allow plugins to suppress inlining and hook function entry/exit/abort.Robert Haas2010-12-13
| | | | | | | This is intended as infrastructure to allow an eventual SE-Linux plugin to support trusted procedures. KaiGai Kohei
* Update time zone data files to tzdata release 2010o: DST law changes inTom Lane2010-12-13
| | | | Fiji and Samoa. Historical corrections for Hong Kong.
* Generalize concept of temporary relations to "relation persistence".Robert Haas2010-12-13
| | | | | | | | | | | | | | | This commit replaces pg_class.relistemp with pg_class.relpersistence; and also modifies the RangeVar node type to carry relpersistence rather than istemp. It also removes removes rd_istemp from RelationData and instead performs the correct computation based on relpersistence. For clarity, we add three new macros: RelationNeedsWAL(), RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we can clarify the purpose of each check that previous depended on rd_istemp. This is intended as infrastructure for the upcoming unlogged tables patch, as well as for future possible work on global temporary tables.
* Reset all database-level stats in pgstat_recv_resetcounter().Tom Lane2010-12-12
| | | | | | | | | We were failing to zero out some pg_stat_database counters that have been added since the initial pgstats coding. This is a bug, but not back-patching the fix since changing this behavior in a minor release seems a cure worse than the disease. Report and patch by Tomas Vondra.
* Make S_IRGRP etc available in mingw builds as well as MSVC.Tom Lane2010-12-12
| | | | | | (Hm, I wonder whether BCC defines them either...) Also label dangling endifs a bit better in this area.
* Provide a complete set of file-permission-bit macros in win32.h.Tom Lane2010-12-11
| | | | | | My previous patch exposed the fact that we didn't have these. Those hard-wired octal constants were actually wrong on Windows, not just inconsistent.
* Allow bidirectional copy messages in streaming replication mode.Robert Haas2010-12-11
| | | | Fujii Masao. Review by Alvaro Herrera, Tom Lane, and myself.
* Add required new port files to MSVC builds.Magnus Hagander2010-12-11
|
* Move a couple of initdb's subroutines into src/port/.Tom Lane2010-12-10
| | | | | | | | | | mkdir_p and check_data_dir will be useful in CREATE TABLESPACE, since we have agreed that that command should handle subdirectory creation just like initdb creates the PGDATA directory. Push them into src/port/ so that they are available to both initdb and the backend. Rename to pg_mkdir_p and pg_check_dir, just to be on the safe side. Add FreeBSD's copyright notice to pgmkdirp.c, since that's where the code came from originally (this really should have been in initdb.c). Very marginal code/comment cleanup.
* Use symbolic names not octal constants for file permission flags.Tom Lane2010-12-10
| | | | | | | | Purely cosmetic patch to make our coding standards more consistent --- we were doing symbolic some places and octal other places. This patch fixes all C-coded uses of mkdir, chmod, and umask. There might be some other calls I missed. Inconsistency noted while researching tablespace directory permissions issue.
* Fix efficiency problems in tuplestore_trim().Tom Lane2010-12-10
| | | | | | | | | | | | | | | | | | | | | | The original coding in tuplestore_trim() was only meant to work efficiently in cases where each trim call deleted most of the tuples in the store. Which, in fact, was the pattern of the original usage with a Material node supporting mark/restore operations underneath a MergeJoin. However, WindowAgg now uses tuplestores and it has considerably less friendly trimming behavior. In particular it can attempt to trim one tuple at a time off a large tuplestore. tuplestore_trim() had O(N^2) runtime in this situation because of repeatedly shifting its tuple pointer array. Fix by avoiding shifting the array until a reasonably large number of tuples have been deleted. This can waste some pointer space, but we do still reclaim the tuples themselves, so the percentage wastage should be pretty small. Per Jie Li's report of slow percent_rank() evaluation. cume_dist() and ntile() would certainly be affected as well, along with any other window function that has a moving frame start and requires reading substantially ahead of the current row. Back-patch to 8.4, where window functions were introduced. There's no need to tweak it before that.
* Eliminate O(N^2) behavior in parallel restore with many blobs.Tom Lane2010-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | With hundreds of thousands of TOC entries, the repeated searches in reduce_dependencies() become the dominant cost. Get rid of that searching by constructing reverse-dependency lists, which we can do in O(N) time during the fix_dependencies() preprocessing. I chose to store the reverse dependencies as DumpId arrays for consistency with the forward-dependency representation, and keep the previously-transient tocsByDumpId[] array around to locate actual TOC entry structs quickly from dump IDs. While this fixes the slow case reported by Vlad Arkhipov, there is still a potential for O(N^2) behavior with sufficiently many tables: fix_dependencies itself, as well as mark_create_done and inhibit_data_for_failed_table, are doing repeated searches to deal with table-to-table-data dependencies. Possibly this work could be extended to deal with that, although the latter two functions are also used in non-parallel restore where we currently don't run fix_dependencies. Another TODO is that we fail to parallelize restore of multiple blobs at all. This appears to require changes in the archive format to fix. Back-patch to 9.0 where the problem was reported. 8.4 has potential issues as well; but since it doesn't create a separate TOC entry for each blob, it's at much less risk of having enough TOC entries to cause real problems.
* Self review of previous patch. Fix assumption that xmax >= xmin.Simon Riggs2010-12-09
|
* Reduce spurious Hot Standby conflicts from never-visible records.Simon Riggs2010-12-09
| | | | | | | | | | | | Hot Standby conflicts only with tuples that were visible at some point. So ignore tuples from aborted transactions or for tuples updated/deleted during the inserting transaction when generating the conflict transaction ids. Following detailed analysis and test case by Noah Misch. Original report covered btree delete records, correctly observed by Heikki Linnakangas that this applies to other cases also. Fix covers all sources of cleanup records via common code.
* Force default wal_sync_method to be fdatasync on Linux.Tom Lane2010-12-08
| | | | | | | | | | | | | | | | | | | | | | | Recent versions of the Linux system header files cause xlogdefs.h to believe that open_datasync should be the default sync method, whereas formerly fdatasync was the default on Linux. open_datasync is a bad choice, first because it doesn't actually outperform fdatasync (in fact the reverse), and second because we try to use O_DIRECT with it, causing failures on certain filesystems (e.g., ext4 with data=journal option). This part of the patch is largely per a proposal from Marti Raudsepp. More extensive changes are likely to follow in HEAD, but this is as much change as we want to back-patch. Also clean up confusing code and incorrect documentation surrounding the fsync_writethrough option. Those changes shouldn't result in any actual behavioral change, but I chose to back-patch them anyway to keep the branches looking similar in this area. In 9.0 and HEAD, also do some copy-editing on the WAL Reliability documentation section. Back-patch to all supported branches, since any of them might get used on modern Linux versions.
* Optimize commit_siblings in two ways to improve group commit.Simon Riggs2010-12-08
| | | | | | | | First, avoid scanning the whole ProcArray once we know there are at least commit_siblings active; second, skip the check altogether if commit_siblings = 0. Greg Smith
* Fix bugs in the hot standby known-assigned-xids tracking logic. If there'sHeikki Linnakangas2010-12-07
| | | | | | | | | | | | | | | | | | | | | | | an old transaction running in the master, and a lot of transactions have started and finished since, and a WAL-record is written in the gap between the creating the running-xacts snapshot and WAL-logging it, recovery will fail with "too many KnownAssignedXids" error. This bug was reported by Joachim Wieland on Nov 19th. In the same scenario, when fewer transactions have started so that all the xids fit in KnownAssignedXids despite the first bug, a more serious bug arises. We incorrectly initialize the clog code with the oldest still running transaction, and when we see the WAL record belonging to a transaction with an XID larger than one that committed already before the checkpoint we're recovering from, we zero the clog page containing the already committed transaction, leading to data loss. In hindsight, trying to track xids in the known-assigned-xids array before seeing the running-xacts record was too complicated. To fix that, hold XidGenLock while the running-xacts snapshot is taken and WAL-logged. That ensures that no transaction can begin or end in that gap, so that in recvoery we know that the snapshot contains all transactions running at that point in WAL.
* Add a stack overflow check to copyObject().Tom Lane2010-12-06
| | | | | | | | | | | | | | | There are some code paths, such as SPI_execute(), where we invoke copyObject() on raw parse trees before doing parse analysis on them. Since the bison grammar is capable of building heavily nested parsetrees while itself using only minimal stack depth, this means that copyObject() can be the front-line function that hits stack overflow before anything else does. Accordingly, it had better have a check_stack_depth() call. I did a bit of performance testing and found that this slows down copyObject() by only a few percent, so the hit ought to be negligible in the context of complete processing of a query. Per off-list report from Toshihide Katayama. Back-patch to all supported branches.
* Allow the low level COPY routines to read arbitrary numbers of fields.Andrew Dunstan2010-12-06
| | | | | | | This doesn't involve any user-visible change in behavior, but will be useful when the COPY routines are exposed to allow their use by Foreign Data Wrapper routines, which will be able to use these routines to read irregular CSV files, for example.
* Fix two typos, by Fujii Masao.Heikki Linnakangas2010-12-06
|
* Put only single space after "Sort Method:", for consistencyPeter Eisentraut2010-12-06
|
* Reduce memory consumption inside inheritance_planner().Tom Lane2010-12-05
| | | | | | | | | | | | | | | Avoid eating quite so much memory for large inheritance trees, by reclaiming the space used by temporary copies of the original parsetree and range table, as well as the workspace needed during planning. The cost is needing to copy the finished plan trees out of the child memory context. Although this looks like it ought to slow things down, my testing shows it actually is faster, apparently because fewer interactions with malloc() are needed and/or we can do the work within a more readily cacheable amount of memory. That result might be platform-dependent, but I'll take it. Per a gripe from John Papandriopoulos, in which it was pointed out that the memory consumption actually grew as O(N^2) for sufficiently many child tables, since we were creating N copies of the N-element range table.