jed-users mailing list

[2003 Date Index] [2003 Thread Index] [Other years]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]

Re: Jed and utf-8... a pre-pre-pre-plea :-)



On Tue, 17 Jun 2003, Romano Giannetti wrote:

> Hi,
> 
>     I was playing with some text editor lately to try to learn to switch to
>     the brand new Unicode world... and I have a little observation to do. I
>     do not know how the slang-2 and jed-unicode is doing, but as a general
>     comment: please do not link it too strictly with LANG settings. I mean,
>     I can have a *.*@utf8 LANG setting, and sometime I will need to edit
>     iso-8859-* files (for example, LaTeX files) and the other way around. I
>     mean, let the encoding be a per-buffer thing... if it's possible at all. 
> 
>     Thanks,
>                Romano    
> 

I have deleted the message to which I meant to reply, but it was a reply
(perhaps indirectly) to the above.

The issue discussed was that an invalid utf-8 sequence of bytes would be
used as-is as their "real" values ( <A1> ), and that this could cause
problems if the character encoding (terminology?) was changed, for example
to utf-16. 

This may be the way it would be done, but if I understood that correctly
then I would say it would be the wrong way to do this.

Instead, an invalid sequence of bytes should be converted into some kind
of "out-of-band" unicode character that would preserve the original byte
value but within a different "non-character" value.  (That doesn't come
out too well, but let me explain.)

Unicode includes ranges of values called Private Use Area (PUA), and a
portion of those values could be used by jed to record data that was not
"correct" within the current encoding. 

So, if the original files included a sequence of bytes that did not
correctly represent characters, then those bytes should be converted into
a value in the Private Use Area.  A simple algorthm for this would be
(lowest-possible-PUA-value + byte value).  That value is then handled as a
regular unicode character. 

During internal manipulation or conversions, these characters would be
left as-is, in the sense that the correct encoding of the character is
always maintained in memory, this preserving both the original byte value
and the fact that this was not proper character data. During output from
jed (i.e. saving into a file), the value could be stored back as the
original single byte value.  In this manner, jed could handle any kind of
data, including things like binary files that contained unicode strings
plus arbitrary values that are not text. 

(Jed could of course also have an option to store the PUA values as
themselves, which jed would recognize later, and in this manner other
unicode manipulating software would "see" a valid unicode file, and jed
could manipulate unicode files from other software that also used the PUA
for private purposes.) 

$0.02


--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.


[2003 date index] [2003 thread index]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]