Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] [OT] Regular Expressions to find Japanese Text



TLUG,

(The following message contains UTF-8 encoded Japanese text. Apologies if it comes out as ASCII gibberish.)

Once again I hope the members of this list will permit me to draw upon the collective experience of the TLUG in handling Japanese text. If I have strayed too far off topic, please accept my apologies.

In a private email conversation, Jim Breen has been helping me understand his WWWJDIC backdoor entry system for me to create a form which will fetch lists of words from blocks of Japanese text. By using PHP, I've achieved a stage where I am able to get a response from the WWWJDIC server and trim it down to just the words and definitions that I need.

The next step is to parse the words and definitions in such a way as to cleanly insert them into a database, which I can then use to create more personalized study lists.

As I'm sure many of this list's members know well already, the output from WWWJDIC looks like this:

気温 【きおん】 (n) atmospheric temperature; (P); EP
について (exp) concerning; along; under; per; KD

I want to divide the first line into three variables, $word, $reading, and $meaning. And I want to divide the second line into two variables, $word and $meaning.

If I can figure out how to extract the first variable, $word, then I can figure out the rest and go on to build more complicated text parsing.

But that first step seems to be a doozy.

The way I see it, I could do it two ways. One is to not rely on any difference between Japanese text and English text, but instead build regular expressions based on things like where spaces divide the first word from the rest of the line. For example, take out all the characters up to the first occurrence of a space, and assume that it's Japanese.

But it seems like it would be a lot more sophisticated if I could determine if a word was Japanese by testing it's Unicode value or some similar method. That way I would be less vulnerable to slight variabilities in positioning of words in the source material.

Looking at all the multibyte related functions in the PHP manual, it seems there are options for testing the type of encoding, but not for the type of language or character set.
http://jp2.php.net/manual/en/ref.mbstring.php
However, I could be wrong about this (and it would be nice if I was).

Searching the web, I came across this guy's script to test if characters were above the usual ASCII range in Unicode, and could therefore be assumed to be Japanese (since in my case the only 2 options are Japanese or English):
http://www.randomchaos.com/documents/?source=php_and_unicode

But this seems unwieldy, as I think, if I understand it correctly, I'd have to test each individual character. I could use it to test if there was any Japanese at all in a string, but I'm not confident I could use it to extract words.

I think this may be more of a regular expressions issue or possibly an encoding handling issue, and not regulated purely to PHP.

In any case, if anyone has any tips for how I might create a logical way of looking at a string and selecting the Japanese words, that would be awesome.

Thank you for your time and help.

--
Dave M G


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links