Re: [tlug] [OT] Regular Expressions to find Japanese Text

Date: Mon, 07 Aug 2006 01:19:29 +0900
From: "Stephen J. Turnbull" <stephen@example.com>
Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
References: <44D5FB0A.6090605@example.com>
Organization: The XEmacs Project
User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b27 (linux)

>>>>> Dave M G writes:

    Dave> The next step is to parse the words and definitions in such
    Dave> a way as to cleanly insert them into a database, which I can
    Dave> then use to create more personalized study lists.

The preferred way to do this (as Jim may have already remarked) is to
suck it out of Jim's XML sources, which means you'll have preparsed
form (that's just the way XML works) ready for stuffing into the
database du jour.  Look for references to "expat", "libxml2", and/or
"libneon" in the PHP docs.

    Dave> But it seems like it would be a lot more sophisticated if I
    Dave> could determine if a word was Japanese by testing it's
    Dave> Unicode value or some similar method. That way I would be
    Dave> less vulnerable to slight variabilities in positioning of
    Dave> words in the source material.

Look for "ICU" (originally "IBM Classes for Unicode", now changed to
something less corporate).  If there is a PHP module that wraps ICU,
it should provide functions and/or regexps for detecting "Unicode
blocks".  Another alternative is to try to convert the character to
JIS X 0208 or JIS X O212.  If those fail, you either don't have
Japanese or you have an exceedingly rare word.

Both ICU and the XML-related functionality are likely to be packaged
as add-on modules for PHP, rather than being part of the PHP
distribution.

    Dave> But this seems unwieldy, as I think, if I understand it
    Dave> correctly, I'd have to test each individual character. I
    Dave> could use it to test if there was any Japanese at all in a
    Dave> string, but I'm not confident I could use it to extract
    Dave> words.

Extracting Japanese words is a hard problem, unless you're lucky
enough to have them already broken out for you.

    Dave> In any case, if anyone has any tips for how I might create a
    Dave> logical way of looking at a string and selecting the
    Dave> Japanese words, that would be awesome.

I would start by trying to get my hands on the XML; that's the
RightThang[tm].  If that doesn't work or seems like more annoyance
than you want to go to, trust Jim, and report any failures of the
"whitespace delimits fields in the returned string" to him as bugs.
:-)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

References:
- [tlug] [OT] Regular Expressions to find Japanese Text
  - From: Dave M G

Prev by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Next by Date: Re: [tlug] sending mails to the localhost
Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Index(es):
- Date
- Thread

Home | Main Index | Thread Index