Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Translating old to new kanji forms using tr



>>>>> "David" == David Riggs <dariggs@example.com> writes:

    David> I need to go back and forth between the old (¾É×Ö) and the
    David> modern kanji forms. I have a list of corresponding old and
    David> new form and the the stardard utility "tr" works fine for a
    David> test case:

tr(1) is byte-oriented, as far as I know, and any resemblence to
success is using up your good karma.  What is happening is that you
are feeding "E4 BB 8F" and "E4 BD 9B" to tr, and it is mapping E4->E4,
BB->BD, and 8F->9B for you byte-by-byte.

As far as I know byte-oriented is the case for all of "the usual
utilities", except that cut(1) claims to know about characters now.

    David> Which seems to make all the usual utilities work just fine
    David> with kanji inside the "konsole" (or plain old xterm as far
    David> as that goes).
That's because usage like "grep '[$B$"(B-$B$s(B]' file" is probably relatively
unusual for us gaijin.

Your best bet is to use a language like Python or **** that supports
Unicode internally.  They generally have functions that emulate the
standard command line utilities but work on Unicode strings as well as
on unibyte strings.  With **** you can probably write a one-liner.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links