Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: Udi Manber: Re: Glimpse support for Asian characters



--------------------------------------------------------
tlug note from "Stephen J. Turnbull" <turnbull@example.com>
--------------------------------------------------------
>>>>> "Francis" == Francis Brian O'Carroll <ocarroll@example.com> writes:

    Francis> did they say they would accept patches for japanese if we
    Francis> develeped theym? the code is not copyleft, so we couldn't
    Francis> redistribute I think.

Not a chance.  The copyright basically amounts to doing a binary
distribution in source as I recall.

>>>>> "Andy" == Andrew S Howell <andy@example.com> writes:

    Andy> I didn't ask if they would accept patches. Their reply was
    Andy> the one liner I mentioned.

And they were quite right, too.  See below.

    Francis> Glimpse is basically an index plus grep; if you grep
    Francis> supports japanese you could hack together a prototype
    Francis> jglimpse with a little c programming.

I don't think so.  The grep part is easy, as you point out, although
you'd have to do without "n typos" and "full-word" switches.  The issue 
is the indexing.

    Andy> Doesn't it get a bit complicated though, if you have text
    Andy> with various encodings. I would think that index would have
    Andy> be in some canoncal format, say EUC. You would have

Uh-uh.  For this application you want to do it in Unicode, presumably
"raw" UCS-2 for size and speed.  Don't create trouble for people who
are bilingual in Asian languages.

    Andy> determine the encoding on the fly, both when creating the
    Andy> index and when searching through the text again. Actualy, to
    Andy> get the results of the search to display correctly, wouldn't
    Andy> you have to convert it to whatever your terminal was set
    Andy> for?

And _this_ is the easy part.  Remember, Japanese is an extremely
highly inflected language and does not use spaces to separate words.
Unless your indexing program understands Japanese syntax, it is not
clear how you would go about doing the indexing.

Too bad glimpse and wnn aren't written in Java.  Then you could do
"import glimpse.*; import wnn.*;" and only have about 20,000 lines of 
code left to modify or write.  (^^;)

Glimpse is fast and the indexes are "small" because glimpse knows a
lot about European languages.  Writing a glimpse for Japanese looks to
me like a major research project.  You may as well start from scratch,
too, because glimpseindex's code seems to be heavily dependent on
whitespace as word boundaries and stuff like that that just don't
apply to Japanese at all.

Probably the best bet would be to index only kanji, katakana, and
romaji.  That ought to keep your indexes to a moderate size, maybe.
Of course you're going to miss a lot of hiragana words.

If you decide to do it right, though, don't forget to include code to
handle "henkan typos" like 新学校 for 神纔瘢韭絎竢躬.  (^^)

-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tel: +81 (298) 53-5091;  Fax: 55-3849              turnbull@example.com
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links