Re: [tlug] OT-Japanese in PHP

Date: Mon, 23 May 2005 17:32:22 +0900
From: Yoshihiro Sato <y_satou@example.com>
Subject: Re: [tlug] OT-Japanese in PHP
References: <200505220201.j4M21ZnW002503@example.com><EX-MAIL-SHI-01VDBcs00000108@example.com><87wtpqsd0b.fsf@example.com>
Organization: Amazon.co.jp
User-agent: Wanderlust/2.12.0 (Your Wildest Dreams) SEMI/1.14.6 (Maruoka)FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.3(i386-redhat-linux-gnu) MULE/5.0 (SAKAKI)

On Mon, 23 May 2005 14:54:44 +0900, "Stephen J. Turnbull" <stephen@example.com> said:
 > 
>>>>>> "Yoshihiro" == Yoshihiro Sato <y_satou@example.com> writes:
Yoshihiro> On server's side, especially if it's web application, I
Yoshihiro> recommend to handle data like this: reject all
Yoshihiro> characters which are not in JISX0208, and reject all
Yoshihiro> half-width katakana.
 > 
 > That's not acceptable if you're a client-oriented operation.  And it's
 > quite unnecessary.  There are no disagreements about full-width <->
 > half-width mappings, and there's no good reason to reject anything in
 > JIS X 0212 (or JIS X 0213).

We need to clarify end user's environment for designing of Japanese hadling.
I considered that that is web browser, and not specified its OS / versions
(because this is PHP's thread.) That's the reason why I recommended to reject
characters which are not in JISX0208.

And, accepting JISX0213 characters will be a problem on backend, if backend
is not designed specificallly to handle JISX0213. Here is simple example:
JISX0213 is including circled number #1 to #50 (and unicode does not defined
circled number #21 to #50 characters, as far as I know.)
You can find summary of characters, which are defined in JISX0213 but not in
Unicode:
  http://www.m17n.org/m17n2000_all_but_registration/proceedings/kawabata/non-unicode.gif

Yoshihiro> 3. Unicode CJK characters are unified.
 > 
 > This is not a problem unless you're doing multilingual work (multiple
 > languages in the same document).  Mere I18N/L10N is complex, of
 > course, but Han unification is not the problem there.

I agree that if the target of the system is M18N and not L10N, unicode is the
best solution.

Yoshihiro> This issue is typically happened when entering people's
Yoshihiro> name and/or location name.
 > 
 > But this isn't a problem of Unicode, which fully handles the entire
 > JIS X 0208 and JIS X 0212 character sets.  The problem is that the
 > Japanese standards bodies have spent at least 100 years prescribing
 > rather than describing the language, and so a welter of non-conforming
 > industry standards have grown up.

I agree, this is not unicode problem. Even if we use any of JISX????, we have
characters which cannot be stored as data, which are accepted to be used in
people's name.

Yoshihiro> Even if end user has method to input correct character
Yoshihiro> on their UI in legacy character set, but there's a case
Yoshihiro> it's mapped to different character on server's side.
 > 
 > So fix the server!  It's not like correcting the mapping tables is
 > hard.  Eg, in XEmacs 21.5 you just do a wget of the Unicode Consortium
 > or other registry's tables into etc/unicode, and type M-x
 > load-unicode-tables RET.  XEmacs has _other_ _serious_ problems in
 > Unicode handling, but the mapping tables have been available since
 > 2002 or so, and they only took that long because there wasn't really a
 > use for them before updating the Windows port to use Windows NT
 > Unicode APIs.

I considered that this thread was/is PHP. and considered user clients are
various OS/versions web browser - I don't think PHP connect to XEmacs in
batch mode to handle entered strings with loading correct mapping table per
each request.

Yoshihiro> But actual problem is, most of the case end user does
Yoshihiro> not have proper way to input such special characters.
 > 
 > I simply don't believe that, except maybe for keitai platforms.  Both
 > Windows and the Mac provide palette-based input methods, and such are
 > available for any free software OS.  Sure, you have to find the
 > character the first time, but after that you record it in your
 > dictionary.  This is a user education problem, not a Unicode issue.

Still Unicode does not cover bunch of characters, which are used in people's
name, location name, etc. Here is the example:
  http://homepage2.nifty.com/Gat_Tin/kanji/itaiji.htm

That's the reason why there's project / activities to support more
characters. For example, Mojikyo
  http://www.mojikyo.org/

Yoshihiro> And users input "simplified character" or "similar
Yoshihiro> character" as compromised solution when they meet
Yoshihiro> restriction.
 > 
 > Shameful.  The first thing that should be done with technology is to
 > allow people to write their own names and addresses correctly!

Agree :) But we still do not have standard way to handle Japanese characters
(or say, characters which are used in Japanese) - especially if characters 
are not in JISX0208.

Here is interesting examples, 
how Kashiwa city governments are/were handling people's name in census registration:
  http://www.horagai.com/www/moji/int/kasiwa.htm
and how "Japan Basic Resident Register Network" is handle characters:
  http://www.horagai.com/www/moji/juki.htm

Unfortunately both pages are written in Japanese, so here is summary:
* Unicode cannot cover all characters which are used in people's name, 
  so Kashiwa city governments are handling people's name to create external
  characters, and shared them by XKP (http://www.est.co.jp/xkp/xkp/index.html)
  via LAN, internally.
* Because "Japan Basic Resident Register Network" is consolidate all cities 
  (or other provinces) census registrations, and each provice defines their
  own  external characters, this network is handling characters with using 
  its own charset ($B8M@example.com}0lJ8;z%3!<%I(B ... which is based on Unicode, but many
  additional characters are mapped to non-CJK area), and using own font
  ($BE}0lJ8;z%U%)%s%H(B aka $B=;4p%M%C%HL@example.com+(B).

 > I'll grant that in practice, fixing an existing installation can be
 > difficult, because you may have to rebuild from the ground up with new
 > server software, add-on modules, and the like.  But new installations
 > should take advantage of Unicode technology which allows a unified
 > treatment of all these problems, and software (including font)
 > sharing.  And this should be a criterion (not necessarily overriding,
 > of course) for any upgrade.

It's depend on target of the system.
If the service is provided to end user via web/http, and basically not
restricted OS and/or environments, the safest way at this point (I don't mean
in future) is, to avoid to be inserted Japanese characters which are not in
JISX0208.
If the system is allowing us to restrict the client environment (just like
"Japan Basic Resident Register Network"), there's several ways to handle
these characters.

Yoshihiro Satou
y_satou@example.com

Follow-Ups:
- Re: [tlug] OT-Japanese in PHP
  - From: Stephen J. Turnbull

References:
- Re: [tlug-digest] Re: [tlug] OT-Japanese in PHP
  - From: Jim Breen
- Re: [tlug] OT-Japanese in PHP
  - From: Yoshihiro Sato
- Re: [tlug] OT-Japanese in PHP
  - From: Stephen J. Turnbull

Prev by Date: RE: [tlug] Job Hunting
Next by Date: Re: [tlug] Job Hunting
Previous by thread: Re: [tlug] OT-Japanese in PHP
Next by thread: Re: [tlug] OT-Japanese in PHP
Index(es):
- Date
- Thread

Home | Main Index | Thread Index