Re: [tlug] OT-Japanese in PHP

Date: Tue, 24 May 2005 14:48:07 +0900
From: "Stephen J. Turnbull" <stephen@example.com>
Subject: Re: [tlug] OT-Japanese in PHP
References: <200505220201.j4M21ZnW002503@example.com><EX-MAIL-SHI-01VDBcs00000108@example.com><87wtpqsd0b.fsf@example.com><EX-MAIL-SHI-01OXNHU00000115@example.com>
Organization: The XEmacs Project
User-agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.5 (cilantro, linux)

>>>>> "Yoshihiro" == Yoshihiro Sato <y_satou@example.com> writes:

    Yoshihiro> We need to clarify end user's environment for designing
    Yoshihiro> of Japanese hadling.  I considered that that is web
    Yoshihiro> browser, and not specified its OS / versions (because
    Yoshihiro> this is PHP's thread.) That's the reason why I
    Yoshihiro> recommended to reject characters which are not in
    Yoshihiro> JISX0208.

If the end user has a browser that can enter the character, she
probably has a browser that can display it.

Anyway, few servers hesitate to enforce browser upgrades in order to
create funkier displays.  "Best viewed with next year's Internet
Exploder; CANNOT be viewed with last year's anything!" pages are all
over the place, yet they can't handle users' names?

    Yoshihiro> And, accepting JISX0213 characters will be a problem on
    Yoshihiro> backend, if backend is not designed specificallly to
    Yoshihiro> handle JISX0213.

Sure.  So fix the backend.  It may take time, but it's (usually) a
much easier problem conceptually than dealing with the end user UI
because the server owner usually owns the backend, too.

    Yoshihiro> Here is simple example: JISX0213 is including circled
    Yoshihiro> number #1 to #50 (and unicode does not defined circled
    Yoshihiro> number #21 to #50 characters, as far as I know.)

This is a perfect example.  Nobody needs those characters.  Sure, if
they're available, they'll be used, just like the Zapf dingbats.  But
the most important effect of standardizing those characters is to
ensure that Japanese standards will not be unified into Unicode for
years.

    Yoshihiro> You can find summary of characters, which are defined
    Yoshihiro> in JISX0213 but not in Unicode:

  http://www.m17n.org/m17n2000_all_but_registration/proceedings/kawabata/non-unicode.gif

Wow.  The shogi koma are useful in daily life for many Japanese and in
line with the principles of Unicode (although arguably there should be
more than a dozen of them, to represent all the pieces, like the chess
series U+2654--U+265F).

All the rest ... what's the rush?  It would be better to standardize a
block of characters that could be loaded into private space in
Unicode, and accessed relative to that space.

    Yoshihiro> I agree that if the target of the system is M18N and
    Yoshihiro> not L10N, unicode is the best solution.

But Unicode is no worse for L10N.  Why support both Unicode and a
national standard?  I don't know about PHP, but the other P-languages
commonly used to implement web applications (Perl, Python, and Puby---
the last P is Greek) all have reasonable suites of codecs and
well-defined ways to create new ones.  So storing internally in
Unicode and (trivially) converting on the fly as necessary just is no
big deal.

    Yoshihiro> I considered that this thread was/is PHP. and
    Yoshihiro> considered user clients are various OS/versions web
    Yoshihiro> browser - I don't think PHP connect to XEmacs in batch
    Yoshihiro> mode to handle entered strings with loading correct
    Yoshihiro> mapping table per each request.

Of course the server doesn't connect to Emacs; Emacs LISP is a
terrible language to write servers in (although people do it, and one
of the more popular window managers, Sawfish, is written in a LISP
that inherits a lot from Emacs LISP).  The point is that this kind of
programming is ultimately table-driven, anyway.  If we all use
Unicode, we can (a) share the table drivers, and (b) share the
tables.

This will take time for retrofitting old applications, of course.  My
point is that the implementation in XEmacs only took about a man-day
(a very very smart man-day, I'll admit).

    Yoshihiro> Still Unicode does not cover bunch of characters, which
    Yoshihiro> are used in people's name, location name, etc. Here is
    Yoshihiro> the example:
    Yoshihiro> http://homepage2.nifty.com/Gat_Tin/kanji/itaiji.htm

    Yoshihiro> That's the reason why there's project / activities to
    Yoshihiro> support more characters. For example, Mojikyo
    Yoshihiro> http://www.mojikyo.org/

Mojikyo is a fun hobby, but it has little to do with fixing these
problems.  Nobody outside of the Mojikyo club is ever going to use
99% of those characters.

    Yoshihiro> Agree :) But we still do not have standard way to
    Yoshihiro> handle Japanese characters (or say, characters which
    Yoshihiro> are used in Japanese) - especially if characters are
    Yoshihiro> not in JISX0208.

Yep, and the resistence to Unicode and the success-avoidance
activities at JIS that result in nonsense like the non-unicode.gif
table are why.  It's moji-hara, sorta like seku-hara ;-), if you ask
me.

But "itaiji" and "gaiji" are really a different issue, don't you
think?  It's akin to the Western notion of a "signature", which you
could think of as creating a personal font for one's name.  I agree
that it's very important to deal with them in Japan, and probably
throughout the Han-using cultures.  But it should be solved in a way
that represents the human individuality of names, not by saying that
"my ichi is a different character from your ichi".

    Yoshihiro> Here is interesting examples, how Kashiwa city
    Yoshihiro> governments are/were handling people's name in census
    Yoshihiro> registration:
    Yoshihiro> http://www.horagai.com/www/moji/int/kasiwa.htm and how
    Yoshihiro> "Japan Basic Resident Register Network" is handle
    Yoshihiro> characters: http://www.horagai.com/www/moji/juki.htm

Thank you for the references; I will look at them closely.  The
question is, why doesn't JIS put its effort into standardizing this
kind of thing, which is essentially an attempt to create a standard
solution to the "itaiji/gaiji problem", instead of deliberately
perpetuating divergent character set standards that are at best a tiny
improvement over Unicode?

In practice, the gaiji problem is never going to go away.  The
non-unicode.gif table is full of recently invented scientific
notation.  There will be more.  We need a way to represent those
characters _as they are invented_, far more than we need "maru-50", or
even "Takashimaya-no-taka".

    Yoshihiro> It's depend on target of the system.  If the service is
    Yoshihiro> provided to end user via web/http, and basically not
    Yoshihiro> restricted OS and/or environments, the safest way at
    Yoshihiro> this point (I don't mean in future) is, to avoid to be
    Yoshihiro> inserted Japanese characters which are not in JISX0208.

I don't understand this.  The worst that can happen is a couple of
geta marks on the display.  The data on the server won't be corrupted.
And users will quickly learn that the geta marks mean that their
client is broken, and complain, and get them fixed.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

Follow-Ups:
- Re: [tlug] OT-Japanese in PHP
  - From: Mark Sargent
- Re: [tlug] OT-Japanese in PHP
  - From: Yoshihiro Sato

References:
- Re: [tlug-digest] Re: [tlug] OT-Japanese in PHP
  - From: Jim Breen
- Re: [tlug] OT-Japanese in PHP
  - From: Yoshihiro Sato
- Re: [tlug] OT-Japanese in PHP
  - From: Stephen J. Turnbull
- Re: [tlug] OT-Japanese in PHP
  - From: Yoshihiro Sato

Prev by Date: Re: [tlug] SuSE 9.1 - 9.3 Upgrade Saga
Next by Date: Re: [tlug] SuSE 9.1 - 9.3 Upgrade Saga
Previous by thread: Re: [tlug] OT-Japanese in PHP
Next by thread: Re: [tlug] OT-Japanese in PHP
Index(es):
- Date
- Thread

Home | Main Index | Thread Index