Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Regular Expressions to find Japanese Text



On Sun, 06 Aug 2006 23:22:02 +0900
Dave M G <martin@example.com> wrote:

> If I can figure out how to extract the first variable, $word, then I can 
> figure out the rest and go on to build more complicated text parsing.
Reading all the characters up to the first space would be sufficient.
 
> But it seems like it would be a lot more sophisticated if I could 
> determine if a word was Japanese by testing it's Unicode value or some 
> similar method. That way I would be less vulnerable to slight 
> variabilities in positioning of words in the source material.
That's not very likely to happen.
You should also consider the fact that there are edict dictionary files
in other languages also, not just Japanese-English.

> Looking at all the multibyte related functions in the PHP manual, it 
> seems there are options for testing the type of encoding, but not for 
> the type of language or character set.
If you want to extract Japanese, you can convert the utf8 to utf32 (with
the function on the page you posted) and then test each character if they
fall into code ranges of unicode characters used in Japanese. I have some
C code if you want (can be converted into php fairly easily).

Attachment: signature.asc
Description: PGP signature


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links