Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Regular Expressions to find Japanese Text



Jim,

Thanks for your continued help.
>
> The "Possible inflected verb or adjective" is, I think, the only time
> non-Japanese will start it. The [Partial Match] is put at the end when
> there is not an exact match between text and entry.
>   
Okay, so I should first remove the phrase "Possible inflected verb or
adjective", and then after that I should be safe to parse out kanji,
yomikata, and definitions. I can also remove "partial match", or
anything within a <font> tag, just for cleanliness.
>
> That <br> is redundant. I may remove it at some stage. Better to extract
> between <li> and the next <li> or the terminal </ul>.
>   
If I may make a suggestion:

I recommend that if you do remove the <br> tag, which is definitely
redundant, you should replace it with a closing </li> tag. This will
make it more compatible with strict XHTML. I think evolving towards
XHTML compliance with your HTML output would be a very good thing.

Also, in the case of parsing as I'm doing, finding the next <li> tag or
terminal </ul> tag might be complicated by line breaks between them. Not
insurmountable, just complicated, and simply not an issue if the HTML
was XHTML compliant.

> Several entries with kanji headwords have multiple readings. They will
> be in the 【 】 region with ";" between them. 
>
> *In General* the entries in EDICT with multiple headwords/readings are
> broken up into their combinations, and the glossdic only has the most common
> one. Where the reading affects the meaning, I use a special file
> which has hybrids like:
>
> 	今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/
> 	(2) (int) (こんにちは) hello/good day (daytime greeting)/
>   

Hmm... A little tricky, but I think I can handle it.

> Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses.
> Readings are a bit harder to count, but I think there are entries with 
> 5 or 6.
That's very helpful to know.

What do you mean by "grouped in 11 senses". That there are eleven
semi-colons dividing up the 25 meanings?

--
Dave M G




Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links