Re: tlug: Re: Japanese input

To: tlug@example.com
Subject: Re: tlug: Re: Japanese input
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Thu, 11 Jun 1998 12:48:00 +0900 (JST)
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <199806102220.HAA32470@example.com>
References: <199806101220.MAA00828@example.com><199806102220.HAA32470@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com

Three asides:

One: Gaspar says "this thread is about input methods".  Not for me;
it's about "text processing."  Yes, if it were about input methods,
what you're talking about is doable, and I'd join you.  But I don't
think it can be about input methods only.  That's why I'm working at
the breadboard prototype level in XEmacs.  But given Gaspar's premise,
I agree with nearly all of what he's written in this thread.

Two: I just finished skimming the X/Open Technical Study "Universal
Multiple-Octet Coded Character Set Coexistence and Migration".  The
sentence "Microsoft can make some of these design decisions because
application portability is not a high priority in NT."  My minimal
knowledge suggests that Yudit is better than NT in this respect (eg,
UTF-8 is used externally), but not sufficiently so.  Only Gaspar is
potentially competent to judge, though, at this point.

And this "Technical Study" is really just < 50-page pamphlet of
semi-random thoughts.

Hope you have got debuggers loaded,
hope you are quite prepared to crash.
Ambiguous protocol codings
Not e'en RSA could invert the hash.

Don't go code tonight, you'll never get it right.
There's a bad bug in the spec.

(Copyright 1998 Yaseppochi-gumi.  Gomen, ne, John Fogerty.  What'm I
saying, gomen, ne, Steve, your grey hairs are showing!)

Three:

>>>>> "Cliff" == Cliff Miller <cliff@example.com> writes:

    Cliff> For all the complexity, there are real advantages in
    Cliff> kanji. Japan has the lowest illiteracy rate in the
    Cliff> world. Of course, the educational system has a lot to do
    Cliff> with it, but not everything.

The statistical system and culture has a lot to do with it, too.
Japanese official unemployment rates are estimated by most people who
study the issue to be about 1/3 of what they would be if counted by
the US Bureau of Labor Statistics (BLS) standard.  Culturally, most
U.S. economists consider the BLS standard to be acceptable only
because the amount of underemployment in the U.S. is small.  That is
due to the extreme flexibility of the U.S. employment system.
Underemployment in Japan is rampant (one Japanese iconoclast I know
says that you can measure underemployment in Japan by counting the
number of males in pachinko parlors at 1pm on a weekday, but "it would
be an understatement 'cause some people play the horses or mahjongg
instead").

The World Health Organization has never been allowed to test a random
sample of Japanese for literacy.  In fact, nobody but Monbusho has
ever been allowed to do so.  This is like allowing the fox to keep
chicken mortality statistics.

This is not to say that Japan's literacy rate isn't the lowest in the
world.  But I don't know, and neither can anyone else; least of all
Monbusho.

Sigh.

>>>>> "Matt" == Matthew J Francis <asbel@example.com> writes:

    >> Brother, you are in for some unpleasant surprises :-) Check out
    >> locale (5), o-negai-shimasu.  No silver bullet here.

    Matt> Hmm, but that seems to relate mostly to the 'traditional'

Yes.  Portability and interoperability must be carefully considered.

    Matt> charset support. Yudit is Unicode all the way through, so
    Matt> character mapping and input can be uniquely specified
    Matt> without "knowing" anything about locales.  If you meant
    Matt> something else I'm not seeing, please to enlighten...

You can use Unicode/UCS-[24] internally if you want.  This simplifies
a lot of things.  However, a monolingual Chinese will find a Japanese
input method useless.  So input cannot be `uniquely specified without
"knowing" anything about locales.'

It is arguable (I don't agree, but many pros do) that _every_
multilingual text should specify locale internally.  Ie, a text
document stored in UTF-8 does not contain Japanese, Chinese, German,
and Russian, it contains a UTF-8 string.  Such experts will find use
of Yudit widgets unacceptable in principle.

Also, some such information _absolutely must_ be included, for
bidirectional languages (Semitic, mostly, but also vertical Japanese,
most probably).  This is emphatically not a "locale", but the handling
will have some similar elements.

Gaspar, doesn't Yudit (like everything else) punt on this?

    Matt> [Input servers]

    >> Nope.  Symptomatic of the fundamental fact that "tastes
    >> differ."  In any case, that was an example to demonstrate
    >> feasibility.  I think it would be insane to try to overload a
    >> Japanese server with algorithms for Devanagari or Arabic.
    >> Multiple servers.

    Matt> Silly and unnecessary to want to do it all at once; So,
    Matt> throw in dynamic loading of conversion sets, and I don't see
    Matt> how it could be slower or more hungry than the existing
    Matt> one-locale servers.

And vice versa.

We have a tool for doing what you're talking about already, although
it's much more general: inetd.  (Woof!  Betcha never thought of that!)
It might be feasible and even efficient to make a single server with
multiple conversion algorithms, but no need to do so.

Also, on a single-user workstation running the conversion server
locally, the vast majority of users will run only one.  In a multiple
workstation environment, it is sensible to have a server host running
all the conversion servers over the network.  Only a small number of
multilingual experts will need to run more than one conversion server, 
and for them, getting the best proprietary servers is probably far more 
important than avoiding purchase of 4MB of RAM per server.

jon@example.com, what do you say?

    Matt> Colour me unconvinced, but for fairness I will go and have a
    Matt> *really* good look at all the code (probably this weekend)
    Matt> before putting my head more firmly on the chopping block.

_All_ the code is impossible.  Wnn6 is proprietary, ATOK is
proprietary, ....  Even limiting yourself to the open source, my hat's 
off to you as a speed reader.

    Matt> Code does of course have an immense memetic (informational)
    Matt> reuse value as well as genetic (implementational). "Code
    Matt> reuse" can be effected without necessarily actually using
    Matt> code.

Well, OK, I can see that.  Study the code as an example of what not to 
do.  :-)

    Matt> I know the limits of my knowledge - although sometimes to
    Matt> start coding is a very good way to find that out. I am
    Matt> actively researching, although I can't afford to buy much
    Matt> treeware at the moment; pointers to any relevant online
    Matt> documentation would of course be greatly appreciated...

ISO and Unicode Consortium standards are expensive and not published
online, unfortunately.  I've never thought about looking for the JIS
versions; I bet they're expensive too (besides being in Japanese).
You're welcome to come to Tsukuba and study my copies any time, but I
have gotten very stiff-necked about copyright since understanding the
GPL.  I've thought about trying to find a way to serve the document to
one user at a time, but the terminal would need to be under my
control....

    Matt> Yudit already *has* this. Even Gaspar's code there as it is
    Matt> now has both raw XLib, Qt, and Motif versions of the (entry
    Matt> and edit) widgets; because it's quite cleanly written, it
    Matt> should stand porting to other toolkits with little fuss.

You're missing the point.  Entry and edit widgets?  Great.  How about
buttons, labels, displays, panners, menus, titlebars, dialogs, ...?

And many important applications (vi, emacs to name two) don't use
widgets (at that level, anyway) at all.

    Matt> And who's to say if I find this fun, others can't? =^^=

Porting the dialog widgets to use Gaspar's entry and edit widgets is
straightforward but probably tedious.  Winkling out _all_ the places
where text is manipulated is why Cobol programmers are making
$500,000/year to do Year 2000 maintenance.  (s/text/dates/ of course.)

Why aren't you learning Cobol?  :-)  When porting means implementation
it's fun, when it means maintenance it's drudgery.

    Matt> Not a silver bullet, but at least a loaded gun to hand to
    Matt> the developers.  90% or more of text display in typical
    Matt> programs is done with standard widgets; Entry, Edit, Menu,

Absolutely true.

    Matt> Label. Replace them with ones that understand
    Matt> internationalised input and display properly, and you're 90%
    Matt> of the way there.

Oh, brother.  Your arithmetic is right, but your model is wrong.

Fred Brooks.  _The Mythical Man-Month_.  Get the recent updated
edition, it has the famous "no silver bullet" essay in it.  Read it.
Then we can talk.  :-)

Ed Yourdon's (_Decline and Fall of the American Programmer_ et seq)
stuff bears peripherally on this.  Quality control, quality control,
quality control.

--------------------------------------------------------------
Next TLUG Meeting: 13 June Sat, Tokyo Station Yaesu gate 12:30
Featuring Stone and Turnbull on .rpm and .deb packages
Next Nomikai: 17 July, 19:30 Tengu TokyoEkiMae 03-3275-3691
After June 13, the next meeting is 8 August at Tokyo Station
--------------------------------------------------------------
Sponsor: PHT, makers of TurboLinux http://www.pht.co.jp

Follow-Ups:
- Re: tlug: Re: Japanese input
  - From: "Matthew J. Francis" <asbel@example.com>

References:
- tlug: Re: Japanese input
  - From: Karl-Max Wagner <karlmax@example.com>
- Re: tlug: Re: Japanese input
  - From: Cliff Miller <cliff@example.com>

Prev by Date: tlug: Be_careful!!! HPLaserJet4000N_Final_Solution
Next by Date: tlug: Minor Correction.
Prev by thread: Re: tlug: Re: Japanese input
Next by thread: Re: tlug: Re: Japanese input
Index(es):
- Date
- Thread

Home | Main Index | Thread Index