Re: tlug: A couple of questions about Unicode

To: tlug@example.com
Subject: Re: tlug: A couple of questions about Unicode
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Mon, 19 Jan 1998 13:31:48 +0900 (JST)
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <Pine.LNX.3.96.980115112345.186A-100000@example.com>
References: <Pine.LNX.3.95.980112000142.2645C-100000@example.com><Pine.LNX.3.96.980115112345.186A-100000@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com

>>>>> "David" == J David Beutel <jdb@example.com> writes:

    David> On Mon, 12 Jan 1998, Gaspar Sinai wrote:
>   o there is a codespace where two 16-bit characters are used to map a 
>     portion of the UCS4 space into UCS2.
>   o if you want to process  some Indian or Arabic scripts you need to
>     combine two 16-bit unicode character to form a single glyph.

Don't forget that various scripts, even Latin (and certainly the voice
marks used in half-width katakana are available), have accents and so
on that can be used to compose characters (I imagine it's possible to
write French using "ASCII" plus accents, though noone would want to),
and that there are all those Hangul jamo.  These are optional from the 
point of view of composing or transliterating documents for storage or 
transmission as Unicode without semantic loss, but essential for
roundtripping.

    David> I still have difficulty believing that such a nasty hack
    David> was done.  Can we pretend that it wasn't?

_You_ can _do_ anything you want.  You just can't say you're
supporting Unicode.  You can get rid of a lot of the functionality you
wish would go away by refusing to support UCS-2 level 2 or 3.

You can call it a "nasty hack" if that makes you feel better.  But I
find it very difficult to imagine that at this stage of linguistic
science we know enough to create a single Unicode that handles all
natural languages without some nasty hacks.  And remember, Unicode is
not just for display, but also for input and transmission, which have
problems of their own.

For example, suppose you implement a French input method for US
keyboards with the internal code being Unicode.  Then it makes sense
for an <accented a> to be represented in the input stream as <'a>.  It 
may make sense for that accent to be a separate code point so that
both <accented a> and "'a" can be typed in two keystrokes.  I don't
know, I'm not a specialist.  I don't want to restrict what specialists 
can do, however, especially in these complicated scripts like Hangul
and Devanagari.  Watching Devanagari be input reminds me of ants at a
picnic, the way the characters go back and forth in the display while
there is a definite sense of forward progress.

Unicodes V.1 and V.2 do very well at handling almost everything that
is useful to a systems programmer in a two-byte widechar format, and
adds some escape clauses that allow Unicode to be useful in 99% of the
rest of linguistic situations at the expense of admitting multi-byte
representations.

> o Unicode is not consistent to the rules it set to itself. You would 
>   expect that the wide ASCII characters would have the ASCII values just
>   like wide Cyrillic or Greek but this is not the case. For some strange 
>   reason they kept the wide ASCII.

As David says, I believe this is due to the "source separation rule".
Source separation does not apply to Cyrillic and Greek because there's
only one copy of each in JIS X 0208.

    David> I agree that wide ASCII does not deserve its own
    David> encoding--it should be a font thing.  I don't know about
    David> wide Cyrillic nor Greek, but my understanding about the
    David> wide ASCII is that Unicode includes distinct encodings for
    David> all chars of all national character sets, including JIS X
    David> 0208, which has both normal and wide ASCII.  No chars
    David> within a single national char set were unified, so that
    David> encoding translations of any such set to Unicode and back
    David> will be identical.  I think it's a good rule.  Of course,
    David> Shift-JIS is not a national char set.

Be careful.  Shift-JIS is an encoding which contains code points for
several national character sets.  As far as I know, round-tripping to
Unicode and back to SJIS is possible.

---------------------------------------------------------------
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096

References:
- Re: tlug: A couple of questions about Unicode
  - From: Gaspar Sinai <gsinai@example.com>
- Re: tlug: A couple of questions about Unicode
  - From: "J. David Beutel" <jdb@example.com>

Prev by Date: Re: tlug: SCSI Configuration
Next by Date: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
Prev by thread: Re: tlug: A couple of questions about Unicode
Next by thread: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
Index(es):
- Date
- Thread

Home | Main Index | Thread Index