The macOS LC_COLLATE
hunt
Or, why does sort(1)
order differently on macOS and Linux?
Zhiming Wang
2020-06-03
Today I noticed something interesting while working with a sorted list of package names: sort(1)
orders them differently on macOS and Linux (Ubuntu 20.04). A very simple example, with locale set explicitly:
(macOS) $ LC_ALL=en_US.UTF-8 sort <<<$'python-dev\npython3-dev'
python-dev
python3-dev
(Linux) $ LC_ALL=en_US.UTF-8 sort <<<$'python-dev\npython3-dev'
python3-dev
python-dev
What the hell? Same locale, different order (or technically, collation). This is not even a difference between GNU and BSD userland; coreutils sort
on macOS produces the same output as /usr/bin/sort
. (Of course, when LC_ALL=C
is used, the results are the same, matching the macOS result above, since “-
” as 0x2D
on the ASCII table comes before “3
” as 0x33
.) Therefore, the locale itself becomes the prime suspect.
macOS
LC_COLLATE
for any locale on macOS is very easy to find: just look under /usr/share/locale/<locale>
. Somewhat surprisingly, /usr/share/locale/en_US.UTF-8/LC_COLLATE
is a symlink to ../la_LN.US-ASCII/LC_COLLATE
. The US-ASCII
part is a giveaway for lack of sophistication, while the unfamiliar language code la
and unfamiliar country code LN
gave me pause. Turns out la
is code for Latin and LN
isn’t really code for anything (I guess they invented it for the Latin script influence sphere)? In fact, if we look a little bit closer, most locales’ LC_COLLATE
are symlinked to la_LN
dot something (mostly dot US-ASCII
), which isn’t very remarkable once we realize it stands for Latin:realpath
in the following command is part of GNU coreutils. In fact I’ll be liberally using coreutils commands in this article. You can brew install coreutils
(make sure you read the caveats).
$ realpath /usr/share/locale/*/LC_COLLATE | sort | uniq -c | sort -nr
122 /usr/share/locale/la_LN.US-ASCII/LC_COLLATE
21 /usr/share/locale/la_LN.ISO8859-1/LC_COLLATE
20 /usr/share/locale/la_LN.ISO8859-15/LC_COLLATE
5 /usr/share/locale/la_LN.ISO8859-2/LC_COLLATE
3 /usr/share/locale/de_DE.ISO8859-15/LC_COLLATE
3 /usr/share/locale/de_DE.ISO8859-1/LC_COLLATE
2 /usr/share/locale/is_IS.ISO8859-1/LC_COLLATE
2 /usr/share/locale/cs_CZ.ISO8859-2/LC_COLLATE
1 /usr/share/locale/uk_UA.KOI8-U/LC_COLLATE
1 /usr/share/locale/uk_UA.ISO8859-5/LC_COLLATE
1 /usr/share/locale/sv_SE.ISO8859-15/LC_COLLATE
1 /usr/share/locale/sv_SE.ISO8859-1/LC_COLLATE
1 /usr/share/locale/sr_YU.ISO8859-5/LC_COLLATE
1 /usr/share/locale/sl_SI.ISO8859-2/LC_COLLATE
1 /usr/share/locale/ru_RU.KOI8-R/LC_COLLATE
1 /usr/share/locale/ru_RU.ISO8859-5/LC_COLLATE
1 /usr/share/locale/ru_RU.CP866/LC_COLLATE
1 /usr/share/locale/ru_RU.CP1251/LC_COLLATE
1 /usr/share/locale/pl_PL.ISO8859-2/LC_COLLATE
1 /usr/share/locale/lt_LT.ISO8859-4/LC_COLLATE
1 /usr/share/locale/lt_LT.ISO8859-13/LC_COLLATE
1 /usr/share/locale/la_LN.ISO8859-4/LC_COLLATE
1 /usr/share/locale/kk_KZ.PT154/LC_COLLATE
1 /usr/share/locale/is_IS.ISO8859-15/LC_COLLATE
1 /usr/share/locale/hy_AM.ARMSCII-8/LC_COLLATE
1 /usr/share/locale/hi_IN.ISCII-DEV/LC_COLLATE
1 /usr/share/locale/et_EE.ISO8859-15/LC_COLLATE
1 /usr/share/locale/es_ES.ISO8859-15/LC_COLLATE
1 /usr/share/locale/es_ES.ISO8859-1/LC_COLLATE
1 /usr/share/locale/el_GR.ISO8859-7/LC_COLLATE
1 /usr/share/locale/de_DE-A.ISO8859-1/LC_COLLATE
1 /usr/share/locale/ca_ES.ISO8859-15/LC_COLLATE
1 /usr/share/locale/ca_ES.ISO8859-1/LC_COLLATE
1 /usr/share/locale/bg_BG.CP1251/LC_COLLATE
1 /usr/share/locale/be_BY.ISO8859-5/LC_COLLATE
1 /usr/share/locale/be_BY.CP1251/LC_COLLATE
1 /usr/share/locale/be_BY.CP1131/LC_COLLATE
Oddly enough though (until we realize it’s just lack of sophistication), many of the outliers are in fact Latin script-based languages, while markedly non-Latin ones are lumped together under the Latin arm:
$ realpath /usr/share/locale/{zh_CN,ja_JP,ko_KR}.UTF-8/LC_COLLATE
/usr/share/locale/la_LN.US-ASCII/LC_COLLATE
/usr/share/locale/la_LN.US-ASCII/LC_COLLATE
/usr/share/locale/la_LN.US-ASCII/LC_COLLATE
Of course, these locale files are compiled binaries, so it’s hard to gleen the collation rules from them (with my untrained eyes). We still need to find the source code.
Looking for OS X / macOS source code is always kind of a pain. Fortunately, searching for la_LN.US-ASCII site:opensource.apple.com
led me to the adv_cmds package, or more precisely, an old version of it. This package contains source code for locale-related commands (among other things) colldef
, locale
, localedef
, and mklocale
, and until v118 (from Mac OS X 10.5 era) it contained a usr-share-locale.tproj
directory with locale definitions in source form.You can download a tarball from here. They sure don’t make it easy to find the link. The collation definitions are in usr-share-locale.tproj/colldef
, and looking at the list usr-share-locale.tproj/colldef/*.src
we immediately notice the overlap with the resolved list above. In fact, it’s a perfect match save for de_DE-A.ISO8859-1
in the list above which wasn’t present in the OS X 10.5 era source package. And here’s the entirety of the la_LN.US-ASCII
ruleset (link):
# ASCII
#
# $FreeBSD: src/share/colldef/la_LN.US-ASCII.src,v 1.2 1999/08/28 00:59:47 peter Exp $
#
order \
\x00;...;\xff
I’m no expert on locale definitions (in fact this doesn’t seem to follow the standard, and looks more like colldef
-specific langauge – see man 1 colldef
), but the meaning is crystal clear: just compare the byte values one by one, semantics be damned. Same as the POSIX locale (aka C locale). That explains why LC_COLLATE=en_US.UTF-8
sorts the same as LC_COLLATE=C
.
Also, the README
(link) for context:
$FreeBSD: src/share/colldef/README,v 1.2 2002/04/08 09:28:22 ache Exp $
WARNING: For the compatibility sake try to keep collating table backward
compatible with ASCII, i.e. add other symbols to the existent ASCII order.
The content and timestamps place these source files perfectly in the FreeBSD 5.0.0 tree. It just so happens to be known that OS X’s BSD layer was synchronized with FreeBSD 5 back in 10.3 Panther, so the story as told by the source files checks out.
However, do recall usr-share-locale.tproj
has been long gone from the adv_cmds
package. Have the rules changed? One simple test:
$ colldef -o /dev/stdout usr-share-locale.tproj/colldef/la_LN.US-ASCII.src | sha256sum
9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78 -
$ sha256sum </usr/share/locale/en_US.UTF-8/LC_COLLATE
9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78 -
Nope, one and the same. The mystery has thus been solved: we owe our most unsophiscated collation rules on macOS to twenty-year-old FreeBSD (which itself has moved on). Well, at least this should be fast.
Linux
On GNU/Linux, locale programs and data are part of glibc. glibc’s localedef
(link) prefers to write all generated locales to a single archive $complocaledir/locale-archive
, where $complocaledir
is /usr/lib/locale
by default, so one usually can’t find a standalone LC_COLLATE
file for a given locale. In fact, on my Ubuntu 20.04 systems the only non-locale-archive
oddball is C.UTF-8
.
Debian does ship the locale definitions in source form, though, in /usr/share/i18n/locales
, since locales are mostly generated from source via the locale-gen(8)
wrapper (which is just a very short shell script). Looking into the LC_COLLATE
section of /usr/share/i18n/locales/en_US
, we can see it copies iso14651_t1
, which in turn copies iso14651_t1_common
, a 85612-line monstrosity solely for defining collation rules per ISO 14651 (entitled Information technology — International string ordering and comparison — Method for comparing character strings and description of the common template tailorable ordering).
So there you have it, python3-dev
is sorted before python-dev
due to ISO 14651.