Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Regular Expressions to find Japanese Text



Dave M G <martin@example.com> wrote:
>> I want to divide the first line into three variables, $word, $reading,=20
>> and $meaning. And I want to divide the second line into two variables,=20
>> $word and $meaning.

The output from WWWJDIC's text glosser usually comes in two types:

 歴史 【れきし】 (n) history; (P); EP

and 
 ヨーロッパ (n) Europe; (P); EP

If I had to parse this stuff for the purposes you state, I'd use a 
simple state machine. Assuming you are using WWWJDIC's default "glossdic", 
which has about 800,000 words/expressions, you can assume:

(a) the occurrence of a 【 】 encapsulates a reading, and after that
you are into the translation region.
(b) once you reach a space followed by an ASCII character (usually alphabetic
or a "("), you are into the translation region. If you didn't encounter
a 【 】 pair along the way, the Japanese can be assumed to be kana-only.

The exception to the above is Japanese names, where you get
stuff like 

 寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA

as 寿康 can be read several ways. Again it can be parsed quite 
deterministically. The trigger is the multiple occurrences of 【 】.

Stephen suggested going to the XML sources. That really doesn't work,
as the glossdic file is built from 24 different files, only two
of which are available as XML. Also you'd miss out on the work
WWWJDIC puts into parsing the text, ducking and weaving around
vern and adjective inflections, etc.

HTH

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大蛙触Â


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links