Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Regular Expressions to find Japanese Text



>>>>> Dave M G writes:

    Dave> The next step is to parse the words and definitions in such
    Dave> a way as to cleanly insert them into a database, which I can
    Dave> then use to create more personalized study lists.

The preferred way to do this (as Jim may have already remarked) is to
suck it out of Jim's XML sources, which means you'll have preparsed
form (that's just the way XML works) ready for stuffing into the
database du jour.  Look for references to "expat", "libxml2", and/or
"libneon" in the PHP docs.

    Dave> But it seems like it would be a lot more sophisticated if I
    Dave> could determine if a word was Japanese by testing it's
    Dave> Unicode value or some similar method. That way I would be
    Dave> less vulnerable to slight variabilities in positioning of
    Dave> words in the source material.

Look for "ICU" (originally "IBM Classes for Unicode", now changed to
something less corporate).  If there is a PHP module that wraps ICU,
it should provide functions and/or regexps for detecting "Unicode
blocks".  Another alternative is to try to convert the character to
JIS X 0208 or JIS X O212.  If those fail, you either don't have
Japanese or you have an exceedingly rare word.

Both ICU and the XML-related functionality are likely to be packaged
as add-on modules for PHP, rather than being part of the PHP
distribution.

    Dave> But this seems unwieldy, as I think, if I understand it
    Dave> correctly, I'd have to test each individual character. I
    Dave> could use it to test if there was any Japanese at all in a
    Dave> string, but I'm not confident I could use it to extract
    Dave> words.

Extracting Japanese words is a hard problem, unless you're lucky
enough to have them already broken out for you.

    Dave> In any case, if anyone has any tips for how I might create a
    Dave> logical way of looking at a string and selecting the
    Dave> Japanese words, that would be awesome.

I would start by trying to get my hands on the XML; that's the
RightThang[tm].  If that doesn't work or seems like more annoyance
than you want to go to, trust Jim, and report any failures of the
"whitespace delimits fields in the returned string" to him as bugs.
:-)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links