Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Tue, 08 Aug 2006 10:54:19 +1000 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Dave M G <martin@example.com> wrote: >> Jim said: >> > (a) the occurrence of a 【 】 encapsulates a reading, and after that >> > you are into the translation region. >> > (b) once you reach a space followed by an ASCII character (usually alphabetic >> > or a "("), you are into the translation region. If you didn't encounter >> > a 【 】 pair along the way, the Japanese can be assumed to be kana-only. >> There seem to be other issues, such as where it starts out by saying >> "possible inflected verb", and "partial match". Is it the case that >> sometimes there might be some kind of English text before a Japanese word? The "Possible inflected verb or adjective" is, I think, the only time non-Japanese will start it. The [Partial Match] is put at the end when there is not an exact match between text and entry. >> Or is the issue with my parser? In order to pull out definitions, I've >> selected text that begins with <li> and ends with <br>, as this seems to >> account for all words extracted from a WWWJDIC search. That <br> is redundant. I may remove it at some stage. Better to extract between <li> and the next <li> or the terminal </ul>. >> > The exception to the above is Japanese names, where you get >> > stuff like >> > 寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA >> >> Is it only Japanese names that have multiple readings? I would have >> thought there would also be regular words with multiple readings, >> especially with verbs with multiple inflections. Several entries with kanji headwords have multiple readings. They will be in the 【 】 region with ";" between them. *In General* the entries in EDICT with multiple headwords/readings are broken up into their combinations, and the glossdic only has the most common one. Where the reading affects the meaning, I use a special file which has hybrids like: 今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/ (2) (int) (こんにちは) hello/good day (daytime greeting)/ Note the [..] gets formatted as 【...】. Note also that there may be Japanese text in the translation, e.g. バヤイ (n-adv,n) case; situation; (slangy version of 場合) I do names differently because "glossdic" uses a special version of the names file in which the readings/transliterations are string out like that. Also the merge attempts to put the common readings at the front. >> Here's a question that has relevance to the flash card program that I am >> importing data into: >> >> What word (or name) in the WWWJDIC server has the most readings and >> definitions, and how many does it have? Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses. Readings are a bit harder to count, but I think there are entries with 5 or 6. >> Botond said: >> > You should also consider the fact that there are edict dictionary files >> > in other languages also, not just Japanese-English. >> That is a good consideration for a more generally adopted application. >> However, even though I'd share the source with anyone who might find it >> useful, what I'm working on now is for my own purposes and so I can >> guarantee that I'm only going to be using the Japanese-English dictionaries. Some people use that funtion of WWWJDIC with the French and/or German dictionaries. The same parsing suggestions apply. Cheers Jim -- Jim Breen http://www.csse.monash.edu.au/~jwb/ Clayton School of Information Technology, Tel: +61 3 9905 9554 Monash University, VIC 3800, Australia Fax: +61 3 9905 5146 (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大蛙触Â
- Follow-Ups:
- Re: [tlug] [OT] Regular Expressions to find Japanese Text
- From: Dave M G
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Leaving Windows Part 1: Japanese input
- Next by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links