TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] [OT] Regular Expressions to find Japanese Text

Date: Sun, 06 Aug 2006 23:22:02 +0900

From: Dave M G <martin@example.com>

Subject: [tlug] [OT] Regular Expressions to find Japanese Text

User-agent: Thunderbird 1.5.0.5 (X11/20060728)
TLUG,
(The following message contains UTF-8 encoded Japanese text. Apologiesif it comes out as ASCII gibberish.)
Once again I hope the members of this list will permit me to draw uponthe collective experience of the TLUG in handling Japanese text. If Ihave strayed too far off topic, please accept my apologies.
In a private email conversation, Jim Breen has been helping meunderstand his WWWJDIC backdoor entry system for me to create a formwhich will fetch lists of words from blocks of Japanese text. By usingPHP, I've achieved a stage where I am able to get a response from theWWWJDIC server and trim it down to just the words and definitions that Ineed.
The next step is to parse the words and definitions in such a way as tocleanly insert them into a database, which I can then use to create morepersonalized study lists.
As I'm sure many of this list's members know well already, the outputfrom WWWJDIC looks like this:
気温 【きおん】 (n) atmospheric temperature; (P); EP
について (exp) concerning; along; under; per; KD
I want to divide the first line into three variables, $word, $reading,and $meaning. And I want to divide the second line into two variables,$word and $meaning.
If I can figure out how to extract the first variable, $word, then I canfigure out the rest and go on to build more complicated text parsing.
But that first step seems to be a doozy.
The way I see it, I could do it two ways. One is to not rely on anydifference between Japanese text and English text, but instead buildregular expressions based on things like where spaces divide the firstword from the rest of the line. For example, take out all the charactersup to the first occurrence of a space, and assume that it's Japanese.
But it seems like it would be a lot more sophisticated if I coulddetermine if a word was Japanese by testing it's Unicode value or somesimilar method. That way I would be less vulnerable to slightvariabilities in positioning of words in the source material.
Looking at all the multibyte related functions in the PHP manual, itseems there are options for testing the type of encoding, but not forthe type of language or character set.
http://jp2.php.net/manual/en/ref.mbstring.php
However, I could be wrong about this (and it would be nice if I was).
Searching the web, I came across this guy's script to test if characterswere above the usual ASCII range in Unicode, and could therefore beassumed to be Japanese (since in my case the only 2 options are Japaneseor English):
http://www.randomchaos.com/documents/?source=php_and_unicode
But this seems unwieldy, as I think, if I understand it correctly, I'dhave to test each individual character. I could use it to test if therewas any Japanese at all in a string, but I'm not confident I could useit to extract words.
I think this may be more of a regular expressions issue or possibly anencoding handling issue, and not regulated purely to PHP.
In any case, if anyone has any tips for how I might create a logical wayof looking at a string and selecting the Japanese words, that would beawesome.
Thank you for your time and help.

--
Dave M G
Follow-Ups:

Re: [tlug] [OT] Regular Expressions to find Japanese Text
From: Botond Botyanszki

Re: [tlug] [OT] Regular Expressions to find Japanese Text
From: Stephen J. Turnbull

Prev by Date: Re: [tlug] Slide of Mar. TLUG Tech Meeting, Open Source Software Licensing

Next by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text

Previous by thread: Re: [tlug] Slide of Mar. TLUG Tech Meeting, Open Source Software Licensing

Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links