Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: unicode



--------------------------------------------------------
tlug note from "Stephen J. Turnbull" <turnbull@example.com>
--------------------------------------------------------
>>>>> "Jim" == Jim Breen <jwb@example.com> writes:

    Jim> On May 27, 3:22pm, "Stephen J. Turnbull" wrote: } Subject:
    Jim> Re: tlug: unicode
    >>> For example, suppose I'm grepping for all the Japanese words
    >>> in a Chinese-language nihongo textbook.  Given a 31-bit code
    >>> space, a UCS-4 grep can too.

    Jim> I hope it never has to - It would be a disaster of the first

Too late; this is what Mule does already.  I think it's unlikely to
change, since it's an efficient way to handle multilingual input and
editing.

    Jim> order if Chinese and Japanese ended up as distinct sets.

They never will, not in the Basic Multilingual Plane.  We can have our 
cake and eat it, too.  At the cost of very fat characters for internal 
processing.

[snip]
    Jim> I expect that eventually national font styles will be handled
    Jim> [by wrapping them in tags like for italics in TeX].

In fact this seems to be exactly the direction TeX (well, Omega
and CJK) is going.

    Jim> This is really a presentation markup. It doesn't thrill me
    Jim> [for the grep example], but I prefer it to the alternative.

Agreed.  If it's really a matter of style, it's much better to have a
markup tag.  But I gave a practical, if relatively contrived and
trivial, example of when the language tag has real semantic meaning.
Also (despite the unification philosophy) the identical character can
have different meaning in the different languages.  That would mean
that a content-indexing program would want to carry language along
with characters.  You can argue that it's not important, that you can
handle it otherwise.  I'd like to give the programmers the flexibility 
to implement it with wider characters in a standard way.

>From the user perspective, what will happen, I think, is what you
would want: Mule will convert to Unicode before writing a file.  The
4-byte representation will rarely be seen outside of RAM owned by Mule
and similar tools.  I just think it's good to standardize an internal
code for things like Mule; we have a good framework for doing it.

    >>> [At Shift-JIS] I think we've just reached
    >>> Jim Breen's limit of tolerance.  No JIS X 0212.  :-)

    Jim> Wait for it! There's an extension planned for JIS X 0208

Got me!

Steve

-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tel: +81 (298) 53-5091;  Fax: 55-3849              turnbull@example.com
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links