Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Strip Kanji from a document for study purposes



>>>>> "Dave" == Dave M G <Dave> writes:

    Dave> There may be existing software that does what I'm looking
    Dave> for, but I haven't seen it. If you know of a suitable Linux
    Dave> based application, please let me know.

I doubt such a thing exists.  It seems like a common thing to do, but
actually there's infinite variation; even if you did find an
application designed for the purpose, you'd still probably need to
script it.  Also, doing it well involves sufficiently much grunt work
that I doubt you'll find an open source program (eg, last I heard
rikaichan was not open at all, which is one reason why I don't use it).

    Dave> What I'd like to do is take a Japanese document and convert
    Dave> it into a list of the kanji included, and a list of
    Dave> words. Ideally repetitions would be removed,

This is easy, at least to the 90% accuracy level.  You just assume
that each contiguous batch of kanji starts a word.  Getting past that
would require looking up each possibility for a connective (eg, the め
in 閉め切た触息¬ 碯ô 蜩逾ô 竢釿辮搐瘡踟 抅癆 葹鰾®

    鶴洹¾ 癈 猪棭ä 鞜鶯蜒跂ó 瘤ä 阡蒹ò 苒瘢轣拄竅ì 蜴肚繝拄闔鶤

繖蜒廊繻 ?î 桃痺ó 痲筬闔 矚黼ä 闔 田秒圦 粹纉 á 葹趾燾ù 褊â 閹 抅蜩
ḿ鞳竕肅竅跛磲 籙õ 閹扖î 鈬繖 捃 扖跛 蜚 犛纈å 猪鰾ó 矼芍î 瘤ä 緕筱
瘤ä 蜚ヸ 逡竏 矼揵纈 癆 蜴洹鶯蜴ç 迴鴃蓖齷銓痺拄ã 竏瘤艱ó 抅瘤 蜚 蜩
癆 鱚迴海鈑 鞜鶯蜒跂鶇®  磨諱皷 犾跛 瘡齒 粹 抅蜩¬ 髟蜚å á 砠ô 矼揵纈®
彦洹 鈬洹ò 椵繖 諱諱皷 蜴 逋 阯î 瘰韭蜒癆蜿銖¬ 抅阨艾®

彦洹 纔扖鈔繖 繖蜒廊繻 捃 皷逅踟 黼癇竏 肬ò 諱鉅é 瘤ä 癈齦辣 抅癆ヸ á
猪鰾 矼芍銕蜴膃 抅緕 齡癇ô 瘰韭簞鈑 迴鴃蓖跫芍竅ì 揥瘤黽闥轣拄闔ó 瘤ä
頏閼桲å á 跚齡 閹 竅鈔蜆癆纉 癆 縺竏 竟皷拄闔®  被牖洹鬪 蜚 燾ó 闔踟 á
頏闖æ 閹 竢釿辮幞 髟蜚å 鼬阯 瘤ä 頏閼桲蜴ç 聲ò 捃ï 轣銷 竅鈔蜆癆纉®

听抅纈 抅瘤 鱚肅鈬 蜚¬ É 粤竕粤ä 抅癆 闔跚鈬 椵瘍å 犛纈å 抅å 椵纈
齔繝蜀蜈ä 抅å 猪鰾 燾ó 矼揵纈 抅瘤 矚戾è 頏閭纉皷鈑®  囈蛹ì 蜀 籙õ 竅î
葹站 也齔 抅蜩 猪棭ä 矼 á 苡閼 韭痺å 捃 齡癇廊

    鶴洹¾ 皮鱇艨釶 瘤ä 諱懲諱釶 猪鰾ó 竢棭ä 矼 糅關鞳ä 捃鎬

壽蜩 蜩 縺齷®

    鶴洹¾ 吏 棭拄轣扖 苡瘡 猪棭ä 矼 捃 竰縺扖 á 跚齡 抅癆 葹ó
    鶴洹¾ 粤肅鉗拄闔ó 瘤ä 鱚痲蜴苴®

壽蜩 蜩 縺齷¬ 籙õ 褫齡 黼癇竏 抅鳫梥è 抅å 田秒Ô 籬懲矚黼®

⑬ 
噬蓖闌 閹 囮齡纃ó 瘤ä 侮肬鴉癆蜿î 梼芍鈬纈蜴ç 蔗捥痕婭𣗄鈞棭谺鼡厲齦謨矚劓祟褓
寰蝟纈皷摑 閹 夾棨桛á                    壹銕閼瘟 鵜鵜± 夾棨桛á 外記元軍 柄仭Î
               蒼ë 鈿ô 蓖÷ 籙õ 竅î ≫錢 胙繞 齒胄燾鱚 碯皷鈬齠»
              癈ë 犛癆 籙𣗄 碯皷鈬齠 竅î ≫ï 肬鬆 胙繞 齒胄燾鱚®


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links