slang-users mailing list

[2003 Date Index] [2003 Thread Index] [Other years]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]

Re: unicode (was Re: Minor error message change)

Hi, John!

As I understand it, the locale describes preferences of the user.  So, if
I set LANG=ru_RU, it means the that I want to see messages in Russian.
In this sense, the locale describes capabilities of the user.

Now, the encoding put to the locale is a different thing.  If I set
LANG=ru_RU.koi8-r, it meant that my terminal has a koi8-r font.  If the
application uses ISO-8859-5 for output, I won't be able to read it on my
terminal, even if I know Russian.  It's not a user preference because it's
implied that all users should see the output without having to recode it
in their brains.  So the encoding describes capabilities of the terminal.

There is another variable describing capabilities of the terminal, namely
TERM.  That's where the encoding should have been added in the ideal
world.  But I think nobody wanted to break old programs and nobody
considered a separate variable like TERM_ENCODING.  This design flaw was
further perpetuated by the decision to represent the UTF-8 support by the
locale as well.

However, it's very important to realize that neither the language/country
nor the encoding part of the locale don't represent the language and the
encoding of the text that doesn't come from the user (e.g. e-mail sent by
others).  The current locale may used as the the default, but some
standards, including HTML, define specifically that the default is
ISO-8859-1.  Also, those standards provide format-specific ways to define
the language and the encoding of the document.

Applications working with text data should have means to determine its
encoding base on standards, user actions and user preferences (and maybe
some guess work, like domain name of the sender).  Anyway, it's completely
the responsibility of the application.

So, the right approach for S-Lang would be to assume terminal capabilities
from the locale.  The application should tell S-Lang about the encoding
it's using for output via S-Lang and about the encoding it expects to get
from S-Lang.

If the application fails to tell S-Lang what encodings it uses, a safe
default should be used.  I don't really know what would be safe.  Perhaps
1-byte encodings should be passed as is, but if the terminal uses UTF-8,
the application should be assumed to be using ISO-8859-1 to avoid royal
mess on the screen.

Pavel Roskin

[2003 date index] [2003 thread index]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]