Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Mon, 07 Aug 2006 13:29:29 +0900
- From: Dave M G <martin@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- References: <200608070023.k770Nlka025548@example.com>
- User-agent: Thunderbird 1.5.0.5 (X11/20060728)
Botand, Stephen, Jim, Thank you all for your responses and insights. Jim said: > (a) the occurrence of a 【 】 encapsulates a reading, and after that > you are into the translation region. > (b) once you reach a space followed by an ASCII character (usually alphabetic > or a "("), you are into the translation region. If you didn't encounter > a 【 】 pair along the way, the Japanese can be assumed to be kana-only. > There seem to be other issues, such as where it starts out by saying "possible inflected verb", and "partial match". Is it the case that sometimes there might be some kind of English text before a Japanese word? Or is the issue with my parser? In order to pull out definitions, I've selected text that begins with <li> and ends with <br>, as this seems to account for all words extracted from a WWWJDIC search. > The exception to the above is Japanese names, where you get > stuff like > 寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA Is it only Japanese names that have multiple readings? I would have thought there would also be regular words with multiple readings, especially with verbs with multiple inflections. If it is the case that only names will have multiple readings, I may ditch them for the time being, to give study priority to other words. But if regular words have multiple readings and definitions, then I will come up with a plan to account for them. Here's a question that has relevance to the flash card program that I am importing data into: What word (or name) in the WWWJDIC server has the most readings and definitions, and how many does it have? Botond said: > You should also consider the fact that there are edict dictionary files > in other languages also, not just Japanese-English. That is a good consideration for a more generally adopted application. However, even though I'd share the source with anyone who might find it useful, what I'm working on now is for my own purposes and so I can guarantee that I'm only going to be using the Japanese-English dictionaries. -- Dave M G
- References:
- Re: [tlug] [OT] Regular Expressions to find Japanese Text
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] can't view iso-2022 in mutt
- Next by Date: Re: [tlug] can't view iso-2022 in mutt
- Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links