
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] [OT] Regular Expressions to find Japanese Text
- Date: Sun, 06 Aug 2006 23:22:02 +0900
- From: Dave M G <martin@example.com>
- Subject: [tlug] [OT] Regular Expressions to find Japanese Text
- User-agent: Thunderbird 1.5.0.5 (X11/20060728)
TLUG,
(The following message contains UTF-8 encoded Japanese text. Apologies
if it comes out as ASCII gibberish.)
Once again I hope the members of this list will permit me to draw upon
the collective experience of the TLUG in handling Japanese text. If I
have strayed too far off topic, please accept my apologies.
In a private email conversation, Jim Breen has been helping me
understand his WWWJDIC backdoor entry system for me to create a form
which will fetch lists of words from blocks of Japanese text. By using
PHP, I've achieved a stage where I am able to get a response from the
WWWJDIC server and trim it down to just the words and definitions that I
need.
The next step is to parse the words and definitions in such a way as to
cleanly insert them into a database, which I can then use to create more
personalized study lists.
As I'm sure many of this list's members know well already, the output
from WWWJDIC looks like this:
気温 【きおん】 (n) atmospheric temperature; (P); EP
について (exp) concerning; along; under; per; KD
I want to divide the first line into three variables, $word, $reading,
and $meaning. And I want to divide the second line into two variables,
$word and $meaning.
If I can figure out how to extract the first variable, $word, then I can
figure out the rest and go on to build more complicated text parsing.
But that first step seems to be a doozy.
The way I see it, I could do it two ways. One is to not rely on any
difference between Japanese text and English text, but instead build
regular expressions based on things like where spaces divide the first
word from the rest of the line. For example, take out all the characters
up to the first occurrence of a space, and assume that it's Japanese.
But it seems like it would be a lot more sophisticated if I could
determine if a word was Japanese by testing it's Unicode value or some
similar method. That way I would be less vulnerable to slight
variabilities in positioning of words in the source material.
Looking at all the multibyte related functions in the PHP manual, it
seems there are options for testing the type of encoding, but not for
the type of language or character set.
http://jp2.php.net/manual/en/ref.mbstring.php
However, I could be wrong about this (and it would be nice if I was).
Searching the web, I came across this guy's script to test if characters
were above the usual ASCII range in Unicode, and could therefore be
assumed to be Japanese (since in my case the only 2 options are Japanese
or English):
http://www.randomchaos.com/documents/?source=php_and_unicode
But this seems unwieldy, as I think, if I understand it correctly, I'd
have to test each individual character. I could use it to test if there
was any Japanese at all in a string, but I'm not confident I could use
it to extract words.
I think this may be more of a regular expressions issue or possibly an
encoding handling issue, and not regulated purely to PHP.
In any case, if anyone has any tips for how I might create a logical way
of looking at a string and selecting the Japanese words, that would be
awesome.
Thank you for your time and help.
--
Dave M G
Home |
Main Index |
Thread Index