Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Date: Wed, 19 Jul 2006 00:25:04 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- References: <44BCAFF3.6030604@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b27 (linux)
>>>>> "Dave" == Dave M G <Dave> writes: Dave> There may be existing software that does what I'm looking Dave> for, but I haven't seen it. If you know of a suitable Linux Dave> based application, please let me know. I doubt such a thing exists. It seems like a common thing to do, but actually there's infinite variation; even if you did find an application designed for the purpose, you'd still probably need to script it. Also, doing it well involves sufficiently much grunt work that I doubt you'll find an open source program (eg, last I heard rikaichan was not open at all, which is one reason why I don't use it). Dave> What I'd like to do is take a Japanese document and convert Dave> it into a list of the kanji included, and a list of Dave> words. Ideally repetitions would be removed, This is easy, at least to the 90% accuracy level. You just assume that each contiguous batch of kanji starts a word. Getting past that would require looking up each possibility for a connective (eg, the め in 閉め切た触息¬ 碯ô 蜩逾ô 竢釿辮搐瘡踟 抅癆 葹鰾® 鶴洹¾ 癈 猪棭ä 鞜鶯蜒跂ó 瘤ä 阡蒹ò 苒瘢轣拄竅ì 蜴肚繝拄闔鶤 繖蜒廊繻 ?î 桃痺ó 痲筬闔 矚黼ä 闔 田秒圦 粹纉 á 葹趾燾ù 褊â 閹 抅蜩 ḿ鞳竕肅竅跛磲 籙õ 閹扖î 鈬繖 捃 扖跛 蜚 犛纈å 猪鰾ó 矼芍î 瘤ä 緕筱 瘤ä 蜚ヸ 逡竏 矼揵纈 癆 蜴洹鶯蜴ç 迴鴃蓖齷銓痺拄ã 竏瘤艱ó 抅瘤 蜚 蜩 癆 鱚迴海鈑 鞜鶯蜒跂鶇® 磨諱皷 犾跛 瘡齒 粹 抅蜩¬ 髟蜚å á 砠ô 矼揵纈® 彦洹 鈬洹ò 椵繖 諱諱皷 蜴 逋 阯î 瘰韭蜒癆蜿銖¬ 抅阨艾® 彦洹 纔扖鈔繖 繖蜒廊繻 捃 皷逅踟 黼癇竏 肬ò 諱鉅é 瘤ä 癈齦辣 抅癆ヸ á 猪鰾 矼芍銕蜴膃 抅緕 齡癇ô 瘰韭簞鈑 迴鴃蓖跫芍竅ì 揥瘤黽闥轣拄闔ó 瘤ä 頏閼桲å á 跚齡 閹 竅鈔蜆癆纉 癆 縺竏 竟皷拄闔® 被牖洹鬪 蜚 燾ó 闔踟 á 頏闖æ 閹 竢釿辮幞 髟蜚å 鼬阯 瘤ä 頏閼桲蜴ç 聲ò 捃ï 轣銷 竅鈔蜆癆纉® 听抅纈 抅瘤 鱚肅鈬 蜚¬ É 粤竕粤ä 抅癆 闔跚鈬 椵瘍å 犛纈å 抅å 椵纈 齔繝蜀蜈ä 抅å 猪鰾 燾ó 矼揵纈 抅瘤 矚戾è 頏閭纉皷鈑® 囈蛹ì 蜀 籙õ 竅î 葹站 也齔 抅蜩 猪棭ä 矼 á 苡閼 韭痺å 捃 齡癇廊 鶴洹¾ 皮鱇艨釶 瘤ä 諱懲諱釶 猪鰾ó 竢棭ä 矼 糅關鞳ä 捃鎬 壽蜩 蜩 縺齷® 鶴洹¾ 吏 棭拄轣扖 苡瘡 猪棭ä 矼 捃 竰縺扖 á 跚齡 抅癆 葹ó 鶴洹¾ 粤肅鉗拄闔ó 瘤ä 鱚痲蜴苴® 壽蜩 蜩 縺齷¬ 籙õ 褫齡 黼癇竏 抅鳫梥è 抅å 田秒Ô 籬懲矚黼® ⑬ 噬蓖闌 閹 囮齡纃ó 瘤ä 侮肬鴉癆蜿î 梼芍鈬纈蜴ç 蔗捥痕婭𣗄鈞棭谺鼡厲齦謨矚劓祟褓 寰蝟纈皷摑 閹 夾棨桛á 壹銕閼瘟 鵜鵜± 夾棨桛á 外記元軍 柄仭Î 蒼ë 鈿ô 蓖÷ 籙õ 竅î ≫錢 胙繞 齒胄燾鱚 碯皷鈬齠» 癈ë 犛癆 籙𣗄 碯皷鈬齠 竅î ≫ï 肬鬆 胙繞 齒胄燾鱚®
- Follow-Ups:
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Josh Glover
- References:
- [tlug] [OT] Strip Kanji from a document for study purposes
- From: Dave M G
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Previous by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links