Re: [tlug] OT-Japanese in PHP

Date: Wed, 25 May 2005 15:59:34 +0900
From: Yoshihiro Sato <y_satou@example.com>
Subject: Re: [tlug] OT-Japanese in PHP
References: <87u0ktqinc.fsf@example.com>
Organization: Amazon.co.jp
User-agent: Wanderlust/2.12.0 (Your Wildest Dreams) SEMI/1.14.6 (Maruoka)FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.3(i386-redhat-linux-gnu) MULE/5.0 (SAKAKI)


On Tue, 24 May 2005 14:48:07 +0900, "Stephen J. Turnbull" <stephen@example.com> said:
 > 
Yoshihiro> We need to clarify end user's environment for designing
Yoshihiro> of Japanese hadling.  I considered that that is web
Yoshihiro> browser, and not specified its OS / versions (because
Yoshihiro> this is PHP's thread.) That's the reason why I
Yoshihiro> recommended to reject characters which are not in
Yoshihiro> JISX0208.
 > 
 > If the end user has a browser that can enter the character, she
 > probably has a browser that can display it.

I think we need to add condition: "with specfic user interface"

 > Anyway, few servers hesitate to enforce browser upgrades in order to
 > create funkier displays.  "Best viewed with next year's Internet
 > Exploder; CANNOT be viewed with last year's anything!" pages are all
 > over the place, yet they can't handle users' names?

If we can limit end user, yes, we can ask user to upgrade / change their
software.

I'm considering that the service's condition is providing to various users,
various  environment (again, because this was/is PHP's thread.)
Just like this -  someone enter the data with CP932 - the other user will
lookup the information via http with web browser on MacOS with Shift-JIS
Macintosh encoding, or with various carrier's cellphone device. Or,
distribute the infomation as plain text email with iso-2022-jp.

But it seems that you're considering that the service can be restricted to
be ran on sprcific environment (i.e. specify OS, specify UI, etc.)

I think this is the divergence of our discussion.


Yoshihiro> And, accepting JISX0213 characters will be a problem on
Yoshihiro> backend, if backend is not designed specificallly to
Yoshihiro> handle JISX0213.
 > 
 > Sure.  So fix the backend.  It may take time, but it's (usually) a
 > much easier problem conceptually than dealing with the end user UI
 > because the server owner usually owns the backend, too.

Yeah, it can be done if we stick end user's UI.


Yoshihiro> I agree that if the target of the system is M18N and
Yoshihiro> not L10N, unicode is the best solution.
 > 
 > But Unicode is no worse for L10N.  Why support both Unicode and a
 > national standard?  I don't know about PHP, but the other P-languages
 > commonly used to implement web applications (Perl, Python, and Puby---
 > the last P is Greek) all have reasonable suites of codecs and
 > well-defined ways to create new ones.  So storing internally in
 > Unicode and (trivially) converting on the fly as necessary just is no
 > big deal.

We still have problem in the process to trancode to Unicode. For example:

* If received data 0x8740 - is it CIRCLED DIGIT ONE (U+2460) (=Windows-31J)
  or PARENTHESIZED IDEOGRAPH SUN (U+3230) (=Mac) ? Which character was
  inputted on user's side ?

Unforuntately, there's no way to detect it correctly if we provide this
service on the Internet - it's depend on user's OS, browser, and font.
Maybe we can check User Agent for OS and browser, but no way to detect which
font is being used by user.


 > But "itaiji" and "gaiji" are really a different issue, don't you
 > think?  It's akin to the Western notion of a "signature", which you
 > could think of as creating a personal font for one's name.  I agree
 > that it's very important to deal with them in Japan, and probably
 > throughout the Han-using cultures.  But it should be solved in a way
 > that represents the human individuality of names, not by saying that
 > "my ichi is a different character from your ichi".

Actually it's enough important for Japanese government work - that's the
reason why they have quiet volume of "gaiji" table, and using it in their
daily operations.


 > Thank you for the references; I will look at them closely.  The
 > question is, why doesn't JIS put its effort into standardizing this
 > kind of thing, which is essentially an attempt to create a standard
 > solution to the "itaiji/gaiji problem", instead of deliberately
 > perpetuating divergent character set standards that are at best a tiny
 > improvement over Unicode?
 > 
 > In practice, the gaiji problem is never going to go away.  The
 > non-unicode.gif table is full of recently invented scientific
 > notation.  There will be more.  We need a way to represent those
 > characters _as they are invented_, far more than we need "maru-50", or
 > even "Takashimaya-no-taka".

I'm sorry but I have no idea at this point. As far as I know, I've heard that
Citizen office should accept the character if it's on dictionary.
(It's Ministry of Justice's announcement, IIRC.)

FYI, I suppose you already know of this article, but this is very intersting
- written by Katsuhiro Ogata on Impress Watch webzine,
  http://internet.watch.impress.co.jp/www/column/ogata/index.htm
reported the discussion of finalizing 2000JIS and JISX0213. It seemed that
there were a not only technical discussion but also political maneuver...


Yoshihiro> It's depend on target of the system.  If the service is
Yoshihiro> provided to end user via web/http, and basically not
Yoshihiro> restricted OS and/or environments, the safest way at
Yoshihiro> this point (I don't mean in future) is, to avoid to be
Yoshihiro> inserted Japanese characters which are not in JISX0208.
 > 
 > I don't understand this.  The worst that can happen is a couple of
 > geta marks on the display.  The data on the server won't be corrupted.
 > And users will quickly learn that the geta marks mean that their
 > client is broken, and complain, and get them fixed.

Typically this kind of approach is taken:
Respond to user with displaying geta-mark, with annotation: "the character(s)
which is(are) displayed with geta-mark indicates you that the character(s)
which you input cannot be handled on our system. Please use simplified
character, or Hira-gana or Kata-kana if it's Kanji. If it's Hankaku katakana,
please use Zenkaku katakana. Also some machine-dependent characters - like
circled numbers, roman numeric digits - are also rejected. Please use normal
arabic digits instead of."

--
Yoshihiro Satou
y_satou@example.com

Follow-Ups:
- Re: [tlug] OT-Japanese in PHP
  - From: Stephen J. Turnbull

References:
- Re: [tlug] OT-Japanese in PHP
  - From: Stephen J. Turnbull

Prev by Date: Re: [tlug] OT-Japanese in PHP
Next by Date: Re: [tlug] SuSE 9.1 - 9.3 Upgrade Saga
Previous by thread: Re: [tlug] OT-Japanese in PHP
Next by thread: Re: [tlug] OT-Japanese in PHP
Index(es):
- Date
- Thread

Home | Main Index | Thread Index