Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: unicode



--------------------------------------------------------
tlug note from jwb@example.com (Jim Breen)
--------------------------------------------------------
On May 27,  5:52pm, "Stephen J. Turnbull" wrote:
} Subject: Re: tlug: unicode
>> 
>>     Jim> I hope it never has to - It would be a disaster of the first
>> 
>> Too late; this is what Mule does already.  I think it's unlikely to
>> change, since it's an efficient way to handle multilingual input and
>> editing.

I regard Mule as a pre-Unicode system, and its ISO-2022 style internal
coding an interim technique. Frankly I don't think it will persist, if
Unicode gets general acceptance.

>>   But I gave a practical, if relatively contrived and
>> trivial, example of when the language tag has real semantic meaning.

True, but a bit of an isolated case. If I was searching a document
containing a mix of French & English, I'd have little chance. In a few
cases I could detect words as distinct within the language, but where they
were common (and perhaps faux amis such as `manifestation') I'd be stuck.

The mixed Chinese/Japanese example is pretty unique one, as in the world
of computerized text processing, it is probably the only case where you
could find languages using essentially the same characters in different
encodings.

>> Also (despite the unification philosophy) the identical character can
>> have different meaning in the different languages.  That would mean
>> that a content-indexing program would want to carry language along
>> with characters.  You can argue that it's not important, that you can
>> handle it otherwise.  I'd like to give the programmers the flexibility 
>> to implement it with wider characters in a standard way.

Well, I wouldn't. I like to separate things. It's only a short step to say
that `y' should have different codings for English and French, because it
can be a consonant for one, but only a vowell (of sorts) for the other.
Remember we are not doing this for the programmers.

I think we'll have to agree to disgree on that.

>> >From the user perspective, what will happen, I think, is what you
>> would want: Mule will convert to Unicode before writing a file.  The
>> 4-byte representation will rarely be seen outside of RAM owned by Mule
>> and similar tools.  I just think it's good to standardize an internal
>> code for things like Mule; we have a good framework for doing it.

But without round-trip capability. Going from a Unicode text into
Mule_internal_format will be fun, unless you hack in language markers to
tell Mule to map into quasi-GB internally.

Jim

-- 
Jim Breen          [$@%8%`!&%V%j!<%s(J@$@%b%J%7%eBg3X(J]
Department of Digital Systems.                  Monash University, 
Clayton VIC 3168 Australia (p) +61 3 9905 3298 (f) +61 3 9905 3574  
j.breen@example.com   [http://www.dgs.monash.edu.au/~jwb/]
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links