Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: Two Qs re translation project



Frank Bennett wrote:

> Looks like it's time for Frank to go back to school.
>
> Duh, can UTF-8 be interpreted correctly by browsers in common
> circulation, and if so (or if it's on a rising wave) what is the best
> reference text on it?

All the 4.x series of popular browsers can do it (even the 3.x version of
Windows Nav could do it with a hack). Early revs of the Japanese Netscape
versions-- especially the Windows versions-- for some reason the Unix ones
get it right because they use a pseudo-font for Unicode had their Unicode
font by default set to English Arial, so it required newbie JP users to set
the font manually to a Japanese font. I use UTF-8 on my personal web pages
for the Japanese if you want to test your browser support.

It's definitely on the rising wave and the future. Most new protocols such
as XML, etc., make UTF-8 support _mandatory_ (EUC-JP in XML is an "option").
So if you migrate to XML or XHTML (now a W3C Recommendation) in the future,
you can count on every app supporting UTF-8, even English apps.

The best docs are The Unicode Standard... but you can find a lot of free
documentation on the web about it, because it's the encoding of choice for
BeOS and many other things these days.

It you go to UTF-8, you get the plus benefit that you'll be able to also
correctly search Latin-1 text, which most commercial web pages use (even if
it's all English, English pages often use the "degree" marks and the
accented vowels (resume, Pokemon, sake).

Not to mention an easy upgrade path to allow it to do Chinese and Korean
indexing as well.

> Also ... if we move to a new encoding, we'll need a conversion tool.
> Is there a Unix filter that can munge one of the common Jse
> encodings into UTF-8?

glibc 2.1's "iconv" can do it, so can Plan 9's "tcs" (a Unix port is
available) and Java's "native2ascii" (in a roundabout manner, though).

But if you allow me to toot my own horn, "ucconv", a sample app with the
"fugu" library, works great on web data... better than iconv, and has the
following features (for real Japanese WWW text) that iconv doesn't have--
intelligent error-recovery scheme for Japanese that's broken (edited with
broken or ASCII-only editors-- very common in the real world) and gracefull
fallback into HTML/SGML encodings. (You can both translate character
references like © and dec/hex NCRs $#x4E00; into straight UTF-8 and
convert back into ASCII editor safe NCRs). And handles "Windows JISx208
extensions (the NEC and IBM extensions) and correctly handles the NEC/IBM
extension ambiguity in MS-Windows extended JIS set. (as well as the extended
Mac Japanese set). Also compiles on Windows (NT with non-free MS tools or
95/98/NT with Cygwin) and BeOS, so if you have Content-guys that are not
fortunate to be running a OSS-based OS, they can still use ucconv native on
their systems.

It's technically "alpha" software, but the "alpha" means I haven't put all
the features in the API that I want in it yet-- the CJK and UTF/Unicode
converters have been VERY well tested with lotsa real-world Chinese/Japanese
(and some Korean) data and are complete, as is the ucconv sample app. GPL
License, so no warranty/guarantee though. I'll through in e-mail support,
though. :)

<URL:ftp://ftp.turbolinux.co.jp/pub/fugu/>

If you're working with well formed EUC-JP and don't need the extra HTML/SGML
translation/conversion or other filter/features and the content
generation/handling system is Linux, you should use iconv that comes with
glibc 2.1. No need for a new tool when the one that's on your OS does the
job.


--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links