ロケールに関すること

このページではロケールに関する問題や議論について示します。 以下の説明を通じて、システムに対しさまざまなロケールを設定する際に発生する問題の概略を説明します。 問題が発生する実際のロケールは、たいていは (しかしすべてではないですが) 各節のいずれかに分類されます。 その深刻さの度合いは、以下の基準で示します。

特定のパッケージに対しての回避策があるとすれば、そのパッケージのホームページなどに示されているはずです。

必要なエンコーディングがプログラムでは正常に利用できない

深刻度: 致命的

プログラムの中には、データ入出力にあたってキャラクターエンコーディングに特定のものを求めていて、エンコーディングを自由に選択できないものがあります。 This is the case for the -X option in Enscript-1.6.6, the -input-charset option in unpatched Cdrtools-3.02a09, and the character sets offered for display in the menu of Links-2.29. If the required encoding is not in the list, the program usually becomes completely unusable. For non-interactive programs, it may be possible to work around this by converting the document to a supported input character set before submitting to the program.

A solution to this type of problem is to implement the necessary support for the missing encoding as a patch to the original program or to find a replacement.

The Program Assumes the Locale-Based Encoding of External Documents

Severity: High for non-text documents, low for text documents

Some programs, nano-7.2 or JOE-4.6 for example, assume that documents are always in the encoding implied by the current locale. While this assumption may be valid for the user-created documents, it is not safe for external ones. When this assumption fails, non-ASCII characters are displayed incorrectly, and the document may become unreadable.

If the external document is entirely text based, it can be converted to the current locale encoding using the iconv program.

For documents that are not text-based, this is not possible. In fact, the assumption made in the program may be completely invalid for documents where the Microsoft Windows operating system has set de facto standards. An example of this problem is ID3v1 tags in MP3 files. For these cases, the only solution is to find a replacement program that doesn't have the issue (e.g., one that will allow you to specify the assumed document encoding).

Among BLFS packages, this problem applies to nano-7.2, JOE-4.6, and all media players except Audacious-4.3.1.

Another problem in this category is when someone cannot read the documents you've sent them because their operating system is set up to handle character encodings differently. This can happen often when the other person is using Microsoft Windows, which only provides one character encoding for a given country. For example, this causes problems with UTF-8 encoded TeX documents created in Linux. On Windows, most applications will assume that these documents have been created using the default Windows 8-bit encoding.

In extreme cases, Windows encoding compatibility issues may be solved only by running Windows programs under Wine.

The Program Uses or Creates Filenames in the Wrong Encoding

Severity: Critical

The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category. This information is well-hidden on the page which specifies the behavior of Tar and Cpio programs. Some programs get it wrong by default (or simply don't have enough information to get it right). The result is that they create filenames which are not subsequently shown correctly by ls, or they refuse to accept filenames that ls shows properly. For the GLib-2.78.3 library, the problem can be corrected by setting the G_FILENAME_ENCODING environment variable to the special "@locale" value. Glib2 based programs that don't respect that environment variable are buggy.

The Zip-3.0 and UnZip-6.0 have this problem because they hard-code the expected filename encoding. UnZip contains a hard-coded conversion table between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and uses this table when extracting archives created under DOS or Microsoft Windows. However, this assumption only works for those in the US and not for anyone using a UTF-8 locale. Non-ASCII characters will be mangled in the extracted filenames.

The general rule for avoiding this class of problems is to avoid installing broken programs. If this is impossible, the convmv command-line tool can be used to fix filenames created by these broken programs, or intentionally mangle the existing filenames to meet the broken expectations of such programs.

In other cases, a similar problem is caused by importing filenames from a system using a different locale with a tool that is not locale-aware (e.g., OpenSSH-9.5p1). In order to avoid mangling non-ASCII characters when transferring files to a system with a different locale, any of the following methods can be used:

  • Transfer anyway, fix the damage with convmv.

  • On the sending side, create a tar archive with the --format=posix switch passed to tar (this will be the default in a future version of tar).

  • Mail the files as attachments. Mail clients specify the encoding of attached filenames.

  • Write the files to a removable disk formatted with a FAT or FAT32 filesystem.

  • Transfer the files using Samba.

  • Transfer the files via FTP using RFC2640-aware server (this currently means only wu-ftpd, which has bad security history) and client (e.g., lftp).

The last four methods work because the filenames are automatically converted from the sender's locale to UNICODE and stored or sent in this form. They are then transparently converted from UNICODE to the recipient's locale encoding.

The Program Breaks Multibyte Characters or Doesn't Count Character Cells Correctly

Severity: High or critical

Many programs were written in an older era where multibyte locales were not common. Such programs assume that C "char" data type, which is one byte, can be used to store single characters. Further, they assume that any sequence of characters is a valid string and that every character occupies a single character cell. Such assumptions completely break in UTF-8 locales. The visible manifestation is that the program truncates strings prematurely (i.e., at 80 bytes instead of 80 characters). Terminal-based programs don't place the cursor correctly on the screen, don't react to the "Backspace" key by erasing one character, and leave junk characters around when updating the screen, usually turning the screen into a complete mess.

Fixing this kind of problems is a tedious task from a programmer's point of view, like all other cases of retrofitting new concepts into the old flawed design. In this case, one has to redesign all data structures in order to accommodate to the fact that a complete character may span a variable number of "char"s (or switch to wchar_t and convert as needed). Also, for every call to the "strlen" and similar functions, find out whether a number of bytes, a number of characters, or the width of the string was really meant. Sometimes it is faster to write a program with the same functionality from scratch.

Among BLFS packages, this problem applies to xine-ui-0.99.14 and all the shells.

The Package Installs Manual Pages in Incorrect or Non-Displayable Encoding

Severity: Low

LFS expects that manual pages are in the language-specific (usually 8-bit) encoding, as specified on the LFS Man DB page. However, some packages install translated manual pages in UTF-8 encoding (e.g., Shadow, already dealt with), or manual pages in languages not in the table. Not all BLFS packages have been audited for conformance with the requirements put in LFS (the large majority have been checked, and fixes placed in the book for packages known to install non-conforming manual pages). If you find a manual page installed by any of BLFS packages that is obviously in the wrong encoding, please remove or convert it as needed, and report this to BLFS team as a bug.

You can easily check your system for any non-conforming manual pages by copying the following short shell script to some accessible location,

#!/bin/sh
# Begin checkman.sh
# Usage: find /usr/share/man -type f | xargs checkman.sh
for a in "$@"
do
    # echo "Checking $a..."
    # Pure-ASCII manual page (possibly except comments) is OK
    grep -v '.\\"' "$a" | iconv -f US-ASCII -t US-ASCII >/dev/null 2>&1 \
        && continue
    # Non-UTF-8 manual page is OK
    iconv -f UTF-8 -t UTF-8 "$a" >/dev/null 2>&1 || continue
    # Found a UTF-8 manual page, bad.
    echo "UTF-8 manual page: $a" >&2
done
# End checkman.sh

and then issuing the following command (modify the command below if the checkman.sh script is not in your PATH environment variable):

find /usr/share/man -type f | xargs checkman.sh

Note that if you have manual pages installed in any location other than /usr/share/man (e.g., /usr/local/share/man), you must modify the above command to include this additional location.