
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Mon, 07 Aug 2006 10:23:47 +1000 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Dave M G <martin@example.com> wrote:
>> I want to divide the first line into three variables, $word, $reading,=20
>> and $meaning. And I want to divide the second line into two variables,=20
>> $word and $meaning.
The output from WWWJDIC's text glosser usually comes in two types:
歴史 【れきし】 (n) history; (P); EP
and
ヨーロッパ (n) Europe; (P); EP
If I had to parse this stuff for the purposes you state, I'd use a
simple state machine. Assuming you are using WWWJDIC's default "glossdic",
which has about 800,000 words/expressions, you can assume:
(a) the occurrence of a 【 】 encapsulates a reading, and after that
you are into the translation region.
(b) once you reach a space followed by an ASCII character (usually alphabetic
or a "("), you are into the translation region. If you didn't encounter
a 【 】 pair along the way, the Japanese can be assumed to be kana-only.
The exception to the above is Japanese names, where you get
stuff like
寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA
as 寿康 can be read several ways. Again it can be parsed quite
deterministically. The trigger is the multiple occurrences of 【 】.
Stephen suggested going to the XML sources. That really doesn't work,
as the glossdic file is built from 24 different files, only two
of which are available as XML. Also you'd miss out on the work
WWWJDIC puts into parsing the text, ducking and weaving around
vern and adjective inflections, etc.
HTH
Jim
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大蛙触Â
Home |
Main Index |
Thread Index