Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] OT-Japanese in PHP
- Date: Mon, 23 May 2005 17:32:22 +0900
- From: Yoshihiro Sato <y_satou@example.com>
- Subject: Re: [tlug] OT-Japanese in PHP
- References: <200505220201.j4M21ZnW002503@example.com><EX-MAIL-SHI-01VDBcs00000108@example.com><87wtpqsd0b.fsf@example.com>
- Organization: Amazon.co.jp
- User-agent: Wanderlust/2.12.0 (Your Wildest Dreams) SEMI/1.14.6 (Maruoka)FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.3(i386-redhat-linux-gnu) MULE/5.0 (SAKAKI)
On Mon, 23 May 2005 14:54:44 +0900, "Stephen J. Turnbull" <stephen@example.com> said: > >>>>>> "Yoshihiro" == Yoshihiro Sato <y_satou@example.com> writes: Yoshihiro> On server's side, especially if it's web application, I Yoshihiro> recommend to handle data like this: reject all Yoshihiro> characters which are not in JISX0208, and reject all Yoshihiro> half-width katakana. > > That's not acceptable if you're a client-oriented operation. And it's > quite unnecessary. There are no disagreements about full-width <-> > half-width mappings, and there's no good reason to reject anything in > JIS X 0212 (or JIS X 0213). We need to clarify end user's environment for designing of Japanese hadling. I considered that that is web browser, and not specified its OS / versions (because this is PHP's thread.) That's the reason why I recommended to reject characters which are not in JISX0208. And, accepting JISX0213 characters will be a problem on backend, if backend is not designed specificallly to handle JISX0213. Here is simple example: JISX0213 is including circled number #1 to #50 (and unicode does not defined circled number #21 to #50 characters, as far as I know.) You can find summary of characters, which are defined in JISX0213 but not in Unicode: http://www.m17n.org/m17n2000_all_but_registration/proceedings/kawabata/non-unicode.gif Yoshihiro> 3. Unicode CJK characters are unified. > > This is not a problem unless you're doing multilingual work (multiple > languages in the same document). Mere I18N/L10N is complex, of > course, but Han unification is not the problem there. I agree that if the target of the system is M18N and not L10N, unicode is the best solution. Yoshihiro> This issue is typically happened when entering people's Yoshihiro> name and/or location name. > > But this isn't a problem of Unicode, which fully handles the entire > JIS X 0208 and JIS X 0212 character sets. The problem is that the > Japanese standards bodies have spent at least 100 years prescribing > rather than describing the language, and so a welter of non-conforming > industry standards have grown up. I agree, this is not unicode problem. Even if we use any of JISX????, we have characters which cannot be stored as data, which are accepted to be used in people's name. Yoshihiro> Even if end user has method to input correct character Yoshihiro> on their UI in legacy character set, but there's a case Yoshihiro> it's mapped to different character on server's side. > > So fix the server! It's not like correcting the mapping tables is > hard. Eg, in XEmacs 21.5 you just do a wget of the Unicode Consortium > or other registry's tables into etc/unicode, and type M-x > load-unicode-tables RET. XEmacs has _other_ _serious_ problems in > Unicode handling, but the mapping tables have been available since > 2002 or so, and they only took that long because there wasn't really a > use for them before updating the Windows port to use Windows NT > Unicode APIs. I considered that this thread was/is PHP. and considered user clients are various OS/versions web browser - I don't think PHP connect to XEmacs in batch mode to handle entered strings with loading correct mapping table per each request. Yoshihiro> But actual problem is, most of the case end user does Yoshihiro> not have proper way to input such special characters. > > I simply don't believe that, except maybe for keitai platforms. Both > Windows and the Mac provide palette-based input methods, and such are > available for any free software OS. Sure, you have to find the > character the first time, but after that you record it in your > dictionary. This is a user education problem, not a Unicode issue. Still Unicode does not cover bunch of characters, which are used in people's name, location name, etc. Here is the example: http://homepage2.nifty.com/Gat_Tin/kanji/itaiji.htm That's the reason why there's project / activities to support more characters. For example, Mojikyo http://www.mojikyo.org/ Yoshihiro> And users input "simplified character" or "similar Yoshihiro> character" as compromised solution when they meet Yoshihiro> restriction. > > Shameful. The first thing that should be done with technology is to > allow people to write their own names and addresses correctly! Agree :) But we still do not have standard way to handle Japanese characters (or say, characters which are used in Japanese) - especially if characters are not in JISX0208. Here is interesting examples, how Kashiwa city governments are/were handling people's name in census registration: http://www.horagai.com/www/moji/int/kasiwa.htm and how "Japan Basic Resident Register Network" is handle characters: http://www.horagai.com/www/moji/juki.htm Unfortunately both pages are written in Japanese, so here is summary: * Unicode cannot cover all characters which are used in people's name, so Kashiwa city governments are handling people's name to create external characters, and shared them by XKP (http://www.est.co.jp/xkp/xkp/index.html) via LAN, internally. * Because "Japan Basic Resident Register Network" is consolidate all cities (or other provinces) census registrations, and each provice defines their own external characters, this network is handling characters with using its own charset ($B8M@example.com}0lJ8;z%3!<%I(B ... which is based on Unicode, but many additional characters are mapped to non-CJK area), and using own font ($BE}0lJ8;z%U%)%s%H(B aka $B=;4p%M%C%HL@example.com+(B). > I'll grant that in practice, fixing an existing installation can be > difficult, because you may have to rebuild from the ground up with new > server software, add-on modules, and the like. But new installations > should take advantage of Unicode technology which allows a unified > treatment of all these problems, and software (including font) > sharing. And this should be a criterion (not necessarily overriding, > of course) for any upgrade. It's depend on target of the system. If the service is provided to end user via web/http, and basically not restricted OS and/or environments, the safest way at this point (I don't mean in future) is, to avoid to be inserted Japanese characters which are not in JISX0208. If the system is allowing us to restrict the client environment (just like "Japan Basic Resident Register Network"), there's several ways to handle these characters. Yoshihiro Satou y_satou@example.com
- Follow-Ups:
- Re: [tlug] OT-Japanese in PHP
- From: Stephen J. Turnbull
- References:
- Re: [tlug-digest] Re: [tlug] OT-Japanese in PHP
- From: Jim Breen
- Re: [tlug] OT-Japanese in PHP
- From: Yoshihiro Sato
- Re: [tlug] OT-Japanese in PHP
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Prev by Date: RE: [tlug] Job Hunting
- Next by Date: Re: [tlug] Job Hunting
- Previous by thread: Re: [tlug] OT-Japanese in PHP
- Next by thread: Re: [tlug] OT-Japanese in PHP
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links