From f69319f2f1fb16eda4b535bcccec90dff3a6795e Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 19 Mar 2024 15:24:41 -0700 Subject: Support C.UTF-8 locale in the new builtin collation provider. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The builtin C.UTF-8 locale has similar semantics to the libc locale of the same name. That is, code point sort order (fast, memcmp-based) combined with Unicode semantics for character operations such as pattern matching, regular expressions, and LOWER()/INITCAP()/UPPER(). The character semantics are based on Unicode simple case mappings. The builtin provider's C.UTF-8 offers several important advantages over libc: * faster sorting -- benefits from additional optimizations such as abbreviated keys and varstrfastcmp_c * faster case conversion, e.g. LOWER(), at least compared with some libc implementations * available on all platforms with identical semantics, and the semantics are stable, testable, and documentable within a given Postgres major version Being based on memcmp, the builtin C.UTF-8 locale does not offer natural language sort order. But it is an improvement for most use cases that might otherwise use libc's "C.UTF-8" locale, as well as many use cases that use libc's "C" locale. Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider --- doc/src/sgml/charset.sgml | 27 ++++++++++++++++++++++++++- doc/src/sgml/ref/create_collation.sgml | 2 +- doc/src/sgml/ref/create_database.sgml | 13 ++++++++----- doc/src/sgml/ref/initdb.sgml | 5 +++-- 4 files changed, 38 insertions(+), 9 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 7114eb7b522..55bbb20dacc 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,13 +377,21 @@ initdb --locale-provider=icu --icu-locale=en The builtin provider uses built-in operations. Only - the C locale is supported for this provider. + the C and C.UTF-8 locales are + supported for this provider. The C locale behavior is identical to the C locale in the libc provider. When using this locale, the behavior may depend on the database encoding. + + The C.UTF-8 locale is available only for when the + database encoding is UTF-8, and the behavior is + based on Unicode. The collation uses the code point values only. The + regular expression character classes are based on the "POSIX + Compatible" semantics, and the case mapping is the "simple" variant. + @@ -878,6 +886,23 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR"; + + pg_c_utf8 + + + This collation sorts by Unicode code point values rather than natural + language order. For the functions lower, + initcap, and upper, it uses + Unicode simple case mapping. For pattern matching (including regular + expressions), it uses the POSIX Compatible variant of Unicode Compatibility + Properties. Behavior is efficient and stable within a + Postgres major version. This collation is + only available for encoding UTF8. + + + + C (equivalent to POSIX) diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml index 98cd7d56be9..85f18cbbe5d 100644 --- a/doc/src/sgml/ref/create_collation.sgml +++ b/doc/src/sgml/ref/create_collation.sgml @@ -99,7 +99,7 @@ CREATE COLLATION [ IF NOT EXISTS ] name FROM If provider is builtin, then locale must be specified and set to - C. + either C or C.UTF-8. diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index 233ff1755dd..7653cb902ee 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -166,8 +166,9 @@ CREATE DATABASE name If is - builtin, then locale - must be specified and set to C. + builtin, then locale or + builtin_locale must be specified and set to + either C or C.UTF-8. @@ -228,9 +229,11 @@ CREATE DATABASE name linkend="create-database-locale-provider">locale provider must be builtin. The default is the setting of if specified; otherwise the same - setting as the template database. Currently, the only available - locale for the builtin provider is - C. + setting as the template database. + + + The locales available for the builtin provider are + C and C.UTF-8. diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 4760570f6ab..377c3cb20aa 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -288,8 +288,9 @@ PostgreSQL documentation If is builtin, - must be specified and set to - C. + or must be + specified and set to C or + C.UTF-8. -- cgit v1.2.3