Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [tlug] Translating old to new kanji forms using tr





>From: David Riggs <dariggs@example.com>
>Reply-To: tlug@example.com
>To: tlug@example.com
>Subject: [tlug] Translating old to new kanji forms using tr
>Date: Tue, 28 Jun 2005 22:28:29 +0900
>
>I need to go back and forth between the old (旧孁E and the modern kanji 
>forms. I have a list of corresponding old and new form and the the stardard 
>utility "tr" works fine for a test case:
>
>echo 仁E| tr 仁E佁E>
>Gives back 佁Ejust fine.
>
>But as soon as I get more than a handful of characters in the two 
>translation pair lists I get random answers that make no sense. I am 
>setting up a script that simply feeds "tr" the two long lists that I have 
>stuffed into two variables. But I have tried testing "tr" outside of the 
>script and get the same weird results. I am running very standard Debian 
>Sarge 3.1, starting up X with
>
>export XMODIFIERS="@example.com=kinput2" LC_CTYPE=ja_JP.UTF-8

Using EUC coding, it's possible to get a series of kanji which is something 
like

A2 A4 B3 A4 (kanji 1, kanji 2)

Now, in your list tr might have a kanji A2 A4, and then one B3 A4, but also 
one A4 B3. Thus if it doesn't understand where one character begins and 
another ends, it might mistakenly match the A4 B3 as another kanji and thus 
foul up.

>Which seems to make all the usual utilities work just fine with kanji 
>inside the "konsole" (or plain old xterm as far as that goes).
>
>
>Anybody tried to do this kind of stuff with "tr"? Or have another solution?

I've done something like this once, some time ago. The solution I used was 
to write a script in Perl. You could just write loads of s/kanji1/kanji2/, 
for example, (s/A/B/ means "substitute A for B" in Perl) or you could stuff 
all the kanji into an associative array and match them using a regular 
expression. I think (not sure) that the newer versions of perl have a \cJ 
operator which matches one Japanese character, so if the above-mentioned 
character overlap is the cause of the problem, then it would be solved that 
way.

This is all from a little faded memory, but I hope this is somewhat helpful.

Danny.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links