
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Tue, 08 Aug 2006 10:54:19 +1000 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Dave M G <martin@example.com> wrote:
>> Jim said:
>> > (a) the occurrence of a 【 】 encapsulates a reading, and after that
>> > you are into the translation region.
>> > (b) once you reach a space followed by an ASCII character (usually alphabetic
>> > or a "("), you are into the translation region. If you didn't encounter
>> > a 【 】 pair along the way, the Japanese can be assumed to be kana-only.
>> There seem to be other issues, such as where it starts out by saying
>> "possible inflected verb", and "partial match". Is it the case that
>> sometimes there might be some kind of English text before a Japanese word?
The "Possible inflected verb or adjective" is, I think, the only time
non-Japanese will start it. The [Partial Match] is put at the end when
there is not an exact match between text and entry.
>> Or is the issue with my parser? In order to pull out definitions, I've
>> selected text that begins with <li> and ends with <br>, as this seems to
>> account for all words extracted from a WWWJDIC search.
That <br> is redundant. I may remove it at some stage. Better to extract
between <li> and the next <li> or the terminal </ul>.
>> > The exception to the above is Japanese names, where you get
>> > stuff like
>> > 寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA
>>
>> Is it only Japanese names that have multiple readings? I would have
>> thought there would also be regular words with multiple readings,
>> especially with verbs with multiple inflections.
Several entries with kanji headwords have multiple readings. They will
be in the 【 】 region with ";" between them.
*In General* the entries in EDICT with multiple headwords/readings are
broken up into their combinations, and the glossdic only has the most common
one. Where the reading affects the meaning, I use a special file
which has hybrids like:
今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/
(2) (int) (こんにちは) hello/good day (daytime greeting)/
Note the [..] gets formatted as 【...】. Note also that there may be
Japanese text in the translation, e.g.
バヤイ (n-adv,n) case; situation; (slangy version of 場合)
I do names differently because "glossdic" uses a special version of the
names file in which the readings/transliterations are string out like that.
Also the merge attempts to put the common readings at the front.
>> Here's a question that has relevance to the flash card program that I am
>> importing data into:
>>
>> What word (or name) in the WWWJDIC server has the most readings and
>> definitions, and how many does it have?
Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses.
Readings are a bit harder to count, but I think there are entries with
5 or 6.
>> Botond said:
>> > You should also consider the fact that there are edict dictionary files
>> > in other languages also, not just Japanese-English.
>> That is a good consideration for a more generally adopted application.
>> However, even though I'd share the source with anyone who might find it
>> useful, what I'm working on now is for my own purposes and so I can
>> guarantee that I'm only going to be using the Japanese-English dictionaries.
Some people use that funtion of WWWJDIC with the French and/or
German dictionaries. The same parsing suggestions apply.
Cheers
Jim
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大蛙触Â
Home |
Main Index |
Thread Index