
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Mon, 07 Aug 2006 01:19:29 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- References: <44D5FB0A.6090605@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b27 (linux)
>>>>> Dave M G writes:
Dave> The next step is to parse the words and definitions in such
Dave> a way as to cleanly insert them into a database, which I can
Dave> then use to create more personalized study lists.
The preferred way to do this (as Jim may have already remarked) is to
suck it out of Jim's XML sources, which means you'll have preparsed
form (that's just the way XML works) ready for stuffing into the
database du jour. Look for references to "expat", "libxml2", and/or
"libneon" in the PHP docs.
Dave> But it seems like it would be a lot more sophisticated if I
Dave> could determine if a word was Japanese by testing it's
Dave> Unicode value or some similar method. That way I would be
Dave> less vulnerable to slight variabilities in positioning of
Dave> words in the source material.
Look for "ICU" (originally "IBM Classes for Unicode", now changed to
something less corporate). If there is a PHP module that wraps ICU,
it should provide functions and/or regexps for detecting "Unicode
blocks". Another alternative is to try to convert the character to
JIS X 0208 or JIS X O212. If those fail, you either don't have
Japanese or you have an exceedingly rare word.
Both ICU and the XML-related functionality are likely to be packaged
as add-on modules for PHP, rather than being part of the PHP
distribution.
Dave> But this seems unwieldy, as I think, if I understand it
Dave> correctly, I'd have to test each individual character. I
Dave> could use it to test if there was any Japanese at all in a
Dave> string, but I'm not confident I could use it to extract
Dave> words.
Extracting Japanese words is a hard problem, unless you're lucky
enough to have them already broken out for you.
Dave> In any case, if anyone has any tips for how I might create a
Dave> logical way of looking at a string and selecting the
Dave> Japanese words, that would be awesome.
I would start by trying to get my hands on the XML; that's the
RightThang[tm]. If that doesn't work or seems like more annoyance
than you want to go to, trust Jim, and report any failures of the
"whitespace delimits fields in the returned string" to him as bugs.
:-)
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
Home |
Main Index |
Thread Index