jed-users mailing list

[2003 Date Index] [2003 Thread Index] [Other years]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]

slang: UTF-8 and strlen


Hi,

   As many of you know, the next slang release will provide full
support for unicode using the UTF-8 encoding.  Right now, full support
has been added to the SLsmg/SLtt slang interfaces.  By full support I
mean support for 32 bit unicode characters accounting for combining
characters (up to 4), double width characters, illegal UTF-8 encoded
strings, etc.  In fact, the SLsmg/SLtt interfaces are done.

   At the same time, I am adding support for UTF-8 to jed, which will
serve to test the library.  (See http://www.jedsoft.org/images/jedutf8.png 
for an image) In doing so, I came across the following
"issue".  What should the interpreter's strlen function return?
Currently, it knows nothing about the encoding and returns the number
of bytes making up the string.  However, it could be modified to
return one of the following:

   1.  The number of bytes in the string.
   2.  The number of characters in the string, including combining
       characters.
   3.  The number of characters in the string, not counting the
       combining characters.

Keep in mind that in the UTF-8 encoding, a character is represented by
1 to 6 bytes.  Hence, one needs to be careful when using the term
"character".  A so-called combining character can be thought of as an
"overstrike" character.  For example, the spanish "enye" character,
may be represented as 2 characters: an 'n' and a '~' combined.  In
this case, the tilde (U+0303) is a combining character.

When looking at the way jed's .sl files use strlen, I noticed that
most of the code using strlen would not have to be changed assuming
strlen behaved according to the semantics of #3. Hence, I propose the
following for slang v2:

  strlen: returns the number of characters (not bytes!) in a string.
          Any combining characters will not be included in the sum.

In addition, I propose two new functions:

  strbytelen: Returns the number of bytes in a string.

  strcharlen: Returns the number of chars in a string, counting the
              combining characters.

Comments about this proposal?
Thanks,
--John

--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.


[2003 date index] [2003 thread index]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]