jed-users mailing list

[2007 Date Index] [2007 Thread Index] [Other years]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]

RE: UTF-8 and Regular Expressions


> -----Original Message-----
> From: G. Milde
> Sent: mercoledì 18 aprile 2007 9.20
> Subject: Re: UTF-8 and Regular Expressions
> 

Hi,

> > In testing, one problem has come up: When used in UTF-8 mode, PCRE
> > cannot tolerate malformed text.  This can be a problem when jed is
> > running in UTF-8 mode, but one is editing text in some other encoding,
> > e.g., ISO-Latin-1.
> 
> However, when editing text in Jed-U, "the right thing" would be to
> convert it transparently to UTF-8 in a find_file_hook and re-convert back
> when saving (analog to compress.sl).

I not only agree, I have also a stronger feeling: I think that internally
JED should always work in utf-8 (dropping support for 8bit characters), and
convert from/to local encoding (using locale() information or some sort of 
-*- encoding: -*- marker in the file if present) as needed.

Clearly to avoid big regressions this should be done backwards: first we
need a robust support for on-the-fly encoding conversion, only after we can
drop support for 8bit internal encoding.

> Conversion could be done by `iconv`, `recode` or (from|to latin-1) a
> poor-mans converter in SLang. Jörg did post (part of) such a solution
> some time ago to the list.
> 

Well, I don't know if Jörg wrote something for this, but I did: I have an
iconv module for SLang. This is the message to this list announcing
iconv_module:
http://www.ruptured-duck.com/jed-users/msg00721.html (please don't use the
attachment to that message: it changed a lot: see the file attached to this
mail).

Here is a mail Marko Mahnic wrote some time ago, I think it gives a good
description of what is needed to support multiple charsets:
http://ruptured-duck.com/jed-users/msg00515.html

Another interesting thread about utf-8 and charsets:
http://www.ruptured-duck.com/jed-users-2003/msg00373.html

In one of the threads highlighted above, John said he prefers to add to
SLang a native interface for charset conversions, instead of a module. This
way we can write some 'poor's man' version for systems without iconv. For
modern linuxes (anything with glibc) this is not a problem, as iconv is
integrated in glibc, and for windows, well, my JED installer already ships
iconv.dll :-)

> > My inclination is that if the lack of UTF-8 support by the current
> > regular expression engine is not much of a problem, then I think that
> > by default, regular expressions will be compiled using byte-semantics,
> > independent of whether or not jed is running in UTF-8 mode.
> 
> I do have the impression, that it would be quite surprising
> if re_search_forward did in UTF-8 mode pattern "f..r" did match "för"
> (with ö == U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) and a pattern
> "f[oö]r" were invalid (or not matching "för").
> 

Yes, I think is a bit strange...

Anyways, I use regular expression a lot, and I don't remember ever needing
or having a problem because of missing utf-8 support.

But probably I'm not a good test case: in Italian we have very few
characters outside the ASCII (7 bit) set.

Thanks,
						Dino

Attachment: jed-charset.zip
Description: Binary data


[2007 date index] [2007 thread index]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]