aboutsummaryrefslogtreecommitdiff
path: root/src/backend/commands/copyfromparse.c
Commit message (Collapse)AuthorAge
* Add heuristic incoming-message-size limits in the server.Tom Lane2021-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had a report of confusing server behavior caused by a client bug that sent junk to the server: the server thought the junk was a very long message length and waited patiently for data that would never come. We can reduce the risk of that by being less trusting about message lengths. For a long time, libpq has had a heuristic rule that it wouldn't believe large message size words, except for a small number of message types that are expected to be (potentially) long. This provides some defense against loss of message-boundary sync and other corrupted-data cases. The server does something similar, except that up to now it only limited the lengths of messages received during the connection authentication phase. Let's do the same as in libpq and put restrictions on the allowed length of all messages, while distinguishing between message types that are expected to be long and those that aren't. I used a limit of 10000 bytes for non-long messages. (libpq's corresponding limit is 30000 bytes, but given the asymmetry of the FE/BE protocol, there's no good reason why the numbers should be the same.) Experimentation suggests that this is at least a factor of 10, maybe a factor of 100, more than we really need; but plenty of daylight seems desirable to avoid false positives. In any case we can adjust the limit based on beta-test results. For long messages, set a limit of MaxAllocSize - 1, which is the most that we can absorb into the StringInfo buffer that the message is collected in. This just serves to make sure that a bogus message size is reported as such, rather than as a confusing gripe about not being able to enlarge a string buffer. While at it, make sure that non-mainline code paths (such as COPY FROM STDIN) are as paranoid as SocketBackend is, and validate the message type code before believing the message length. This provides an additional guard against getting stuck on corrupted input. Discussion: https://postgr.es/m/2003757.1619373089@sss.pgh.pa.us
* Do COPY FROM encoding conversion/verification in larger chunks.Heikki Linnakangas2021-04-01
| | | | | | | | | | | | | | | | | | | | | This gives a small performance gain, by reducing the number of calls to the conversion/verification function, and letting it work with larger inputs. Also, reorganizing the input pipeline makes it easier to parallelize the input parsing: after the input has been converted to the database encoding, the next stage of finding the newlines can be done in parallel, because there cannot be any newline chars "embedded" in multi-byte characters in the encodings that we support as server encodings. This changes behavior in one corner case: if client and server encodings are the same single-byte encoding (e.g. latin1), previously the input would not be checked for zero bytes ('\0'). Any fields containing zero bytes would be truncated at the zero. But if encoding conversion was needed, the conversion routine would throw an error on the zero. After this commit, the input is always checked for zeros. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
* Remove server and libpq support for old FE/BE protocol version 2.Heikki Linnakangas2021-03-04
| | | | | | | | | | | | | | | | | Protocol version 3 was introduced in PostgreSQL 7.4. There shouldn't be many clients or servers left out there without version 3 support. But as a courtesy, I kept just enough of the old protocol support that we can still send the "unsupported protocol version" error in v2 format, so that old clients can display the message properly. Likewise, libpq still understands v2 ErrorResponse messages when establishing a connection. The impetus to do this now is that I'm working on a patch to COPY FROM, to always prefetch some data. We cannot do that safely with the old protocol, because it requires parsing the input one byte at a time to detect the end-of-copy marker. Reviewed-by: Tom Lane, Alvaro Herrera, John Naylor Discussion: https://www.postgresql.org/message-id/9ec25819-0a8a-d51a-17dc-4150bb3cca3b%40iki.fi
* Fix backslash-escaping multibyte chars in COPY FROM.Heikki Linnakangas2021-02-05
| | | | | | | | | | | | | | | | | | | | | | | If a multi-byte character is escaped with a backslash in TEXT mode input, and the encoding is one of the client-only encodings where the bytes after the first one can have an ASCII byte "embedded" in the char, we didn't skip the character correctly. After a backslash, we only skipped the first byte of the next character, so if it was a multi-byte character, we would try to process its second byte as if it was a separate character. If it was one of the characters with special meaning, like '\n', '\r', or another '\\', that would cause trouble. One such exmple is the byte sequence '\x5ca45c2e666f6f' in Big5 encoding. That's supposed to be [backslash][two-byte character][.][f][o][o], but because the second byte of the two-byte character is 0x5c, we incorrectly treat it as another backslash. And because the next character is a dot, we parse it as end-of-copy marker, and throw an "end-of-copy marker corrupt" error. Backpatch to all supported versions. Reviewed-by: John Naylor, Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
* Fix small error in COPY FROM progress reporting.Heikki Linnakangas2021-02-04
| | | | | | | | The # of bytes processed was accumulated slightly incorrectly. After loading more data to the input buffer, we added the number of bytes in the buffer to the sum. But in case of multi-byte characters or escapes, there can be a few unprocessed bytes left over from previous load in the buffer. Those bytes got counted twice.
* Report progress of COPY commandsTomas Vondra2021-01-06
| | | | | | | | | | | | This commit introduces a view pg_stat_progress_copy, reporting progress of COPY commands. This allows rough estimates how far a running COPY progressed, with the caveat that the total number of bytes may not be available in some cases (e.g. when the input comes from the client). Author: Josef Šimánek Reviewed-by: Fujii Masao, Bharath Rupireddy, Vignesh C, Matthias van de Meent Discussion: https://postgr.es/m/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com Discussion: https://postgr.es/m/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg+FEKet4XaTGSW=LitKQ@mail.gmail.com
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Remove leftover comments, left behind by removal of WITH OIDS.Heikki Linnakangas2020-11-30
| | | | | Author: Amit Langote Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGaRoF3XrhPW-Y7P%2BG7bKo84Z_h%3DkQHvMh-80%3Dav3wmOw%40mail.gmail.com
* Put "inline" marker on declarations of inline functions.Tom Lane2020-11-24
| | | | | | I'm having a hard time telling whether the letter of the C standard requires this, but we do have a couple of buildfarm members that throw warnings when this is not done. Oversight in c532d15dd.
* Split copy.c into four files.Heikki Linnakangas2020-11-23
Copy.c has grown really large. Split it into more manageable parts: - copy.c now contains only a few functions that are common to COPY FROM and COPY TO. - copyto.c contains code for COPY TO. - copyfrom.c contains code for initializing COPY FROM, and inserting the tuples to the correct table. - copyfromparse.c contains code for reading from the client/file/program, and parsing the input text/CSV/binary format into tuples. All of these parts are fairly complicated, and fairly independent of each other. There is a patch being discussed to implement parallel COPY FROM, which will add a lot of new code to the COPY FROM path, and another patch which would allow INSERTs to use the same multi-insert machinery as COPY FROM, both of which will require refactoring that code. With those two patches, there's going to be a lot of code churn in copy.c anyway, so now seems like a good time to do this refactoring. The CopyStateData struct is also split. All the formatting options, like FORMAT, QUOTE, ESCAPE, are put in a new CopyFormatOption struct, which is used by both COPY FROM and TO. Other state data are kept in separate CopyFromStateData and CopyToStateData structs. Reviewed-by: Soumyadeep Chakraborty, Erik Rijkers, Vignesh C, Andres Freund Discussion: https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi