
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Generating Furigana in documents
On 2013-03-30 10:01 +0100 (Sat), Christian Horn wrote:
> Hm.. it seems to expect 'JIS x0208' Kanji characters that I am
> unable to produce.
Actually, JIS x0208 is a character set; you also need to worry about
the encoding of that character set. (A character set is merely a list
of characters, so 僕 might for example be number 1728. How that number
1728 is encoded in a stream of bytes can vary. Note that Unicode is a
character set and UTF-8, UCS-16, et al. are encodings of Unicode.)
> ./kakasi -i jis -H;
'jis' is the term they use for ISO-2022-JP encoding. I don't really
recommend using it; it's not that common outside of e-mail messages and
it's not very compact. Ideally you want to be using UTF-8 everywhere, of
course, but kakasi doesn't appear to support Unicode, unfortunately.
Typically I go with Shift-JIS when I can't use Unicode, and this seemed
to work well for me when I tried it just now.
Using vim, I created a file with your sample text in Unicode
(":set encoding=utf-8"), and this translated fine for me:
$ cat input-file
私は馬鹿です
$ iconv -f utf8 -t sjis input-file | kakasi -JK | iconv -f sjis -t utf8
ワタシはバカです
I used a file for input just to make sure I was certain what the input
encoding was, but it also works fine just typing it directly into a
UTF-8 xterm:
$ echo '私は馬鹿です' \
| iconv -f utf8 -t sjis | kakasi -JK | iconv -f sjis -t utf8
ワタシはバカです
Incidently, nkf (Network Kanji Filter, also available as an Ubuntu
package) can be useful as the final filter when working out encoding
issues because it's usually fairly good at guessing the input encoding
and translating it to whatever output encoding you know you need.
cjs
--
Curt Sampson <cjs@example.com> +81 90 7737 2974
To iterate is human, to recurse divine.
- L Peter Deutsch
Home |
Main Index |
Thread Index