Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Tue, 08 Aug 2006 11:42:20 +0900
- From: Dave M G <martin@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- References: <200608080054.k780sJYO014497@example.com>
- User-agent: Thunderbird 1.5.0.5 (X11/20060728)
Jim, Thanks for your continued help. > > The "Possible inflected verb or adjective" is, I think, the only time > non-Japanese will start it. The [Partial Match] is put at the end when > there is not an exact match between text and entry. > Okay, so I should first remove the phrase "Possible inflected verb or adjective", and then after that I should be safe to parse out kanji, yomikata, and definitions. I can also remove "partial match", or anything within a <font> tag, just for cleanliness. > > That <br> is redundant. I may remove it at some stage. Better to extract > between <li> and the next <li> or the terminal </ul>. > If I may make a suggestion: I recommend that if you do remove the <br> tag, which is definitely redundant, you should replace it with a closing </li> tag. This will make it more compatible with strict XHTML. I think evolving towards XHTML compliance with your HTML output would be a very good thing. Also, in the case of parsing as I'm doing, finding the next <li> tag or terminal </ul> tag might be complicated by line breaks between them. Not insurmountable, just complicated, and simply not an issue if the HTML was XHTML compliant. > Several entries with kanji headwords have multiple readings. They will > be in the 【 】 region with ";" between them. > > *In General* the entries in EDICT with multiple headwords/readings are > broken up into their combinations, and the glossdic only has the most common > one. Where the reading affects the meaning, I use a special file > which has hybrids like: > > 今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/ > (2) (int) (こんにちは) hello/good day (daytime greeting)/ > Hmm... A little tricky, but I think I can handle it. > Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses. > Readings are a bit harder to count, but I think there are entries with > 5 or 6. That's very helpful to know. What do you mean by "grouped in 11 senses". That there are eleven semi-colons dividing up the 25 meanings? -- Dave M G
- References:
- Re: [tlug] [OT] Regular Expressions to find Japanese Text
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by Date: Re: [tlug] SuSE 10.1 upgrade problem
- Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links