Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: Udi Manber: Re: Glimpse support for Asian characters



--------------------------------------------------------
tlug note from "Andrew S. Howell" <andy@example.com>
--------------------------------------------------------
>>>>> "Stephen" == Stephen J Turnbull <turnbull@example.com> writes:


    Stephen> --------------------------------------------------------
    Stephen> tlug note from "Stephen J. Turnbull"
    Stephen> <turnbull@example.com>
    Stephen> --------------------------------------------------------
>>>>> "Francis" == Francis Brian O'Carroll <ocarroll@example.com>
    Stephen> writes:

    Francis> did they say they would accept patches for japanese if we
    Francis> develeped theym? the code is not copyleft, so we couldn't
    Francis> redistribute I think.

    Stephen> Not a chance.  The copyright basically amounts to doing a
    Stephen> binary distribution in source as I recall.

>>>>> "Andy" == Andrew S Howell <andy@example.com> writes:


[snip]
    Andy> Doesn't it get a bit complicated though, if you have text
    Andy> with various encodings. I would think that index would have
    Andy> be in some canoncal format, say EUC. You would have

    Stephen> Uh-uh.  For this application you want to do it in
    Stephen> Unicode, presumably "raw" UCS-2 for size and speed.
    Stephen> Don't create trouble for people who are bilingual in
    Stephen> Asian languages.

Yeah, Unicode woudl make more sense.

    Andy> determine the encoding on the fly, both when creating the
    Andy> index and when searching through the text again. Actualy, to
    Andy> get the results of the search to display correctly, wouldn't
    Andy> you have to convert it to whatever your terminal was set
    Andy> for?

    Stephen> And _this_ is the easy part.  Remember, Japanese is an
    Stephen> extremely highly inflected language and does not use
    Stephen> spaces to separate words.  Unless your indexing program
    Stephen> understands Japanese syntax, it is not clear how you
    Stephen> would go about doing the indexing.

    Stephen> Too bad glimpse and wnn aren't written in Java.  Then you
    Stephen> could do "import glimpse.*; import wnn.*;" and only have
    Stephen> about 20,000 lines of code left to modify or write.
    Stephen> (^^;)

    Stephen> Glimpse is fast and the indexes are "small" because
    Stephen> glimpse knows a lot about European languages.  Writing a
    Stephen> glimpse for Japanese looks to me like a major research
    Stephen> project.  You may as well start from scratch, too,
    Stephen> because glimpseindex's code seems to be heavily dependent
    Stephen> on whitespace as word boundaries and stuff like that that
    Stephen> just don't apply to Japanese at all.

Ok, I'll just tell my boss I'm taking a couple years off to write a
good indexing program.... I wonder if I'll have a paycheck when I come
back :)

    Stephen> Probably the best bet would be to index only kanji,
    Stephen> katakana, and romaji.  That ought to keep your indexes to
    Stephen> a moderate size, maybe.  Of course you're going to miss a
    Stephen> lot of hiragana words.

    Stephen> If you decide to do it right, though, don't forget to
    Stephen> include code to handle "henkan typos" like ?73X9; for ?@
    Stephen> 3X9;.  (^^)

Sounds like to "do it right" approach would require an awfull lot of
knowledge of Japanese. On second thought, I think I put this on hold
for a while...

Andy
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links