Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Re: Updating iconv tables



A few days ago I wrote:

2008/6/11 Jim Breen <jimbreen@example.com>:
> I have struck a problem with missing mappings in
> iconv in several Linux distros. The problem has
> arisen initially with ㈱ (i.e. (株)), but is sure crop
> up with others.
[...]
> I'll send a copy of this to bug-gnu-libiconv@example.com
> but is that enough?

I received the following reply, which also went to
bug-gnu-libiconv@example.com so I think I can relay it.
=========================================================
from	Bruno Haible <bruno@example.com>
to	Jim Breen <jimbreen@example.com>,bug-gnu-libiconv@example.com
date	12 June 2008 10:42
subject	Re: [bug-gnu-libiconv] Updating iconv tables
	
Hi,

I'm not sure I understand it all right.

> When people have
> gone to convert the EDICT file to UTF8 for other
> systems, the iconv utility simply dies on that character

In summary, you are saying that you have a particular character in EUC-JP,
that the iconv conversion from EUC-JP to UTF-8 does not grok?

Then the character is not EUC-JP.

I'm not sure which character you are talking about, because your mail
had an encoding specification of ISO-2022-JP, which usually means
ISO-2022-JP-2, but that particular character was invalid in ISO-2022-JP-2
(it was encoded as "ESC $ B - j"), the other character in that line was
U+682A, and you were talking about U+3231.

> The problem, I conclude, is with the compiled-in tables
> in iconv in the Linux distros. It seems Sun has gone to
> the trouble of keeping theirs up-to-date, but the standard
> distros haven't.

You have a misconception of what EUC-JP is. EUC-JP is a character encoding
scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These
are standards issued by Japanese authorities, and carved in stone. Anyone
who thinks that EUC-JP tables have to be "kept up-to-date", is asking for
deviation from standards, and is asking for interoperability problems!

The interoperability problem that you encountered is *precisely* due to
your vendor having added "extensions" to their EUC-JP fonts, and you
expect that everyone else has the same extensions in their fonts and tables!
Take a look at
  http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html
to see how many variants of EUC-JP already exist!

Bruno
=============================================================

Needless to say I couldn't let that pass. "Carve in stone" indeed!
Vendor extension!

Anyway, my response:

================================================================
Hi Bruno,

Great to hear from you

2008/6/12 Bruno Haible <bruno@example.com>:

> I'm not sure I understand it all right.
>
>> When people have
>> gone to convert the EDICT file to UTF8 for other
>> systems, the iconv utility simply dies on that character
>
> In summary, you are saying that you have a particular character in EUC-JP,
> that the iconv conversion from EUC-JP to UTF-8 does not grok?
>
> Then the character is not EUC-JP.

Wrong. I'll explain more below.

> I'm not sure which character you are talking about, because your mail
> had an encoding specification of ISO-2022-JP, which usually means
> ISO-2022-JP-2, but that particular character was invalid in ISO-2022-JP-2
> (it was encoded as "ESC $ B - j"), the other character in that line was
> U+682A, and you were talking about U+3231.

This is a bit of a side issue. My email was indeed in ISO-2022-JP, since
I have gmail set to use the default for the language, and my email
contained Japanese. The code-point question converts and displays
correctly in compliant mailers. Nothing illegal about it.

>> The problem, I conclude, is with the compiled-in tables
>> in iconv in the Linux distros. It seems Sun has gone to
>> the trouble of keeping theirs up-to-date, but the standard
>> distros haven't.
>
> You have a misconception of what EUC-JP is. EUC-JP is a character encoding
> scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These
> are standards issued by Japanese authorities, and carved in stone. Anyone
> who thinks that EUC-JP tables have to be "kept up-to-date", is asking for
> deviation from standards, and is asking for interoperability problems!

You are out-of-date there. EUC-JP also includes JIS X 0213, which was released
in 2000 and updated in 2004. The codepoint I raised arrived in JIS X 0213. You
can think of JIS X 0213 as an enhancement/replacement for JIS X 0208. It added
a heap of additional characters, *all* of which have been included in Unicode,
and all of which have EUC codings, since EUC-JP is simply a transformation
of the ku-ten codes in the Japanese standards. Of course EUC-JP tables need to
be kept up-to-date.

See: http://en.wikipedia.org/wiki/JIS_X_0213 for an overview.

> The interoperability problem that you encountered is *precisely* due to
> your vendor having added "extensions" to their EUC-JP fonts, and you
> expect that everyone else has the same extensions in their fonts and tables!
> Take a look at
>   http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html
> to see how many variants of EUC-JP already exist!

Sadly your WWW page omits any mention of JIS X 0213. In other words it is
lacking all the characters added to the standard Japanese codings in the last
decade. Sun has simply kept up with the developments in Japanese
coding. These are *not* vendor extensions.

In case you think I am talking through my hat, I must point out that I am
one of only a handful of non-Japanese people who have participated in the
development of the Japanese standards. You will find my name among the
respondents at the back of JIS X 0208-1997, along with people like Ken Lunde
and Martin Duerst. (I assume you have a copy.) Ask Ken if he has heard of me.

I am happy to work with you in getting the full set of current Japanese
codes into iconv. As it stands at the moment, the GNU issue does not
adequately handle all the standard Japanese codes.

Best wishes
===============================================================

I'll keep the TLUG list informed of developments (if any). I think it
would help a lot if one or two more people, e.g. Linux users in Japan,
chipped in with emails to bug-gnu-libiconv@example.com on this matter.

Cheers

Jim

-- 
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links