Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] OT-Japanese in PHP



On server's side, especially if it's web application, I recommend to
handle data like this: reject all characters which are not in JISX0208,
and reject all half-width katakana.


The difficulties of handling Japanese is

1. Shift_JIS is not same as CP932 / Windows-31J / Shift_JIS on Macintosh.
2. There's many mapping table between Unicode and legacy Japanese charsets
3. Unicode CJK characters are unified.


1. Shift_JIS is not same as CP932.

Shift_JIS is originally not charset, it's rule how to "shift" JIS X 0208.
But Microsoft's Windows31-J (aka CP932) is having some additional characters in
extentional area: like circled numbers, roman numeric characters, symbols, etc.

But on the other hand, Macintosh (Mac OS) is assigned different characters on
the same data. For example, Windows (CP932) circled-number-one is displaying as
(日) (in one double-width char) on Mac.
You can find unmatched character list on Shift_JIS between Windows and Macintosh:
  http://www.notoinsatu.co.jp/font/omake/S-JIS_check.pdf

The problem is, Windows PC can enter these characters on the form of web browser.
On web server's side, really difficult to detect which character is entered
on user's side. Maybe need to check OS and browser version properly - but it
won't promise always we can get correct result.


2. There's many mapping table between Unicode and legacy Japanese charsets

For example, even if it's in Microsoft world, you can find there's difference
between Shift_JIS -> Unicode and CP932 -> Unicode:
  http://www.asahi-net.or.jp/~ez3k-msym/charsets/jis2ucs.htm
Mapping table is different between each processing engine - typically
library. There's several libraries (like iconv, etc.) for converting legacy
charset <--> Unicode, and typically it has differencies.


3. Unicode CJK characters are unified.

This issue is typically happened when entering people's name and/or location
name.
When Unicode is designed, some characters which looks "similar" are unified
into 1 characters (which is in area of "CJK Unified Ideographs"), and
additionals are put into area of "CJK Compatibility Ideographs." 
This also makes mapping issue - mapping tables are simply comberted
characters into "CJK Unified Ideographs" characters, and not using "CJK
Compatibility Ideographs" characters.
Even if end user has method to input correct character on their UI in legacy
character set, but there's a case it's mapped to different character on
server's side.

But actual problem is, most of the case end user does not have proper way to
input such special characters. And users input "simplified character" or
"similar character" as compromised solution when they meet restriction.

--
Yoshihiro Satou
y_satou@example.com


On Sun, 22 May 2005 12:01:35 +1000 (EST), Jim Breen <Jim.Breen@example.com> said:
 > 
 > Evan Monroig <evan.ubuntu@example.com> wrote:
>>> The
>>> generally accepted idea is that since Shift_JIS was created by
>>> Japanese people for Japanese people, then it handles the Japanese
>>> language better than UTF-8, which is not true (^_^)
 > 
 > I'll say it's not true. Shit_JIS was created by Microsoft, as Ken Lunde
 > wrote in his UJIP book 12 years ago.
 > 
 > Jim
 > 
 > -- 
 > Jim Breen                                http://www.csse.monash.edu.au/~jwb/
 > Computer Science & Software Engineering,                Tel: +61 3 9905 9554
 > Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
 > (Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links