Re: [tlug] [OT] Regular Expressions to find Japanese Text

Date: Tue, 08 Aug 2006 10:54:19 +1000 (EST)
From: Jim Breen <Jim.Breen@example.com>
Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text

 Dave M G <martin@example.com> wrote:
>> Jim said:
>> > (a) the occurrence of a 【 】 encapsulates a reading, and after that
>> > you are into the translation region.
>> > (b) once you reach a space followed by an ASCII character (usually alphabetic
>> > or a "("), you are into the translation region. If you didn't encounter
>> > a 【 】 pair along the way, the Japanese can be assumed to be kana-only.

>> There seem to be other issues, such as where it starts out by saying
>> "possible inflected verb", and "partial match". Is it the case that
>> sometimes there might be some kind of English text before a Japanese word?

The "Possible inflected verb or adjective" is, I think, the only time
non-Japanese will start it. The [Partial Match] is put at the end when
there is not an exact match between text and entry.

>> Or is the issue with my parser? In order to pull out definitions, I've
>> selected text that begins with <li> and ends with <br>, as this seems to
>> account for all words extracted from a WWWJDIC search.

That <br> is redundant. I may remove it at some stage. Better to extract
between <li> and the next <li> or the terminal </ul>.

>> > The exception to the above is Japanese names, where you get
>> > stuff like 
>> >  寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA
>> 
>> Is it only Japanese names that have multiple readings? I would have
>> thought there would also be regular words with multiple readings,
>> especially with verbs with multiple inflections.

Several entries with kanji headwords have multiple readings. They will
be in the 【 】 region with ";" between them. 

*In General* the entries in EDICT with multiple headwords/readings are
broken up into their combinations, and the glossdic only has the most common
one. Where the reading affects the meaning, I use a special file
which has hybrids like:

	今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/
	(2) (int) (こんにちは) hello/good day (daytime greeting)/

Note the [..] gets formatted as 【...】. Note also that there may be
Japanese text in the translation, e.g.

	バヤイ (n-adv,n) case; situation; (slangy version of 場合)

I do names differently because "glossdic" uses a special version of the 
names file in which the readings/transliterations are string out like that. 
Also the merge attempts to put the common readings at the front.

>> Here's a question that has relevance to the flash card program that I am
>> importing data into:
>> 
>> What word (or name) in the WWWJDIC server has the most readings and
>> definitions, and how many does it have?

Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses.
Readings are a bit harder to count, but I think there are entries with 
5 or 6.

>> Botond said:
>> > You should also consider the fact that there are edict dictionary files
>> > in other languages also, not just Japanese-English.

>> That is a good consideration for a more generally adopted application.
>> However, even though I'd share the source with anyone who might find it
>> useful, what I'm working on now is for my own purposes and so I can
>> guarantee that I'm only going to be using the Japanese-English dictionaries.

Some people use that funtion of WWWJDIC with the French and/or
German dictionaries. The same parsing  suggestions apply.

Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大蛙触Ā

Follow-Ups:
- Re: [tlug] [OT] Regular Expressions to find Japanese Text
  - From: Dave M G

Prev by Date: Re: [tlug] Leaving Windows Part 1: Japanese input
Next by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Index(es):
- Date
- Thread

Home | Main Index | Thread Index