Re: [tlug] [OT] Regular Expressions to find Japanese Text

Date: Tue, 08 Aug 2006 11:42:20 +0900
From: Dave M G <martin@example.com>
Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
References: <200608080054.k780sJYO014497@example.com>
User-agent: Thunderbird 1.5.0.5 (X11/20060728)

Jim,

Thanks for your continued help.
>
> The "Possible inflected verb or adjective" is, I think, the only time
> non-Japanese will start it. The [Partial Match] is put at the end when
> there is not an exact match between text and entry.
>   
Okay, so I should first remove the phrase "Possible inflected verb or
adjective", and then after that I should be safe to parse out kanji,
yomikata, and definitions. I can also remove "partial match", or
anything within a <font> tag, just for cleanliness.
>
> That <br> is redundant. I may remove it at some stage. Better to extract
> between <li> and the next <li> or the terminal </ul>.
>   
If I may make a suggestion:

I recommend that if you do remove the <br> tag, which is definitely
redundant, you should replace it with a closing </li> tag. This will
make it more compatible with strict XHTML. I think evolving towards
XHTML compliance with your HTML output would be a very good thing.

Also, in the case of parsing as I'm doing, finding the next <li> tag or
terminal </ul> tag might be complicated by line breaks between them. Not
insurmountable, just complicated, and simply not an issue if the HTML
was XHTML compliant.

> Several entries with kanji headwords have multiple readings. They will
> be in the 【 】 region with ";" between them. 
>
> *In General* the entries in EDICT with multiple headwords/readings are
> broken up into their combinations, and the glossdic only has the most common
> one. Where the reading affects the meaning, I use a special file
> which has hybrids like:
>
> 	今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/
> 	(2) (int) (こんにちは) hello/good day (daytime greeting)/
>   

Hmm... A little tricky, but I think I can handle it.

> Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses.
> Readings are a bit harder to count, but I think there are entries with 
> 5 or 6.
That's very helpful to know.

What do you mean by "grouped in 11 senses". That there are eleven
semi-colons dividing up the 25 meanings?

--
Dave M G

References:
- Re: [tlug] [OT] Regular Expressions to find Japanese Text
  - From: Jim Breen

Prev by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Next by Date: Re: [tlug] SuSE 10.1 upgrade problem
Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Index(es):
- Date
- Thread

Home | Main Index | Thread Index