Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- To: tlug@example.com
- Subject: Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 12 May 1997 13:31:03 +0900
- In-reply-to: Your message of "Sun, 11 May 1997 12:00:04 +0200." <9705110300.AA20711@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug
-------------------------------------------------------- tlug note from "Stephen J. Turnbull" <turnbull@example.com> -------------------------------------------------------- >>>>> "Francis" == Francis Brian O'Carroll <ocarroll@example.com> writes: Francis> did they say they would accept patches for japanese if we Francis> develeped theym? the code is not copyleft, so we couldn't Francis> redistribute I think. Not a chance. The copyright basically amounts to doing a binary distribution in source as I recall. >>>>> "Andy" == Andrew S Howell <andy@example.com> writes: Andy> I didn't ask if they would accept patches. Their reply was Andy> the one liner I mentioned. And they were quite right, too. See below. Francis> Glimpse is basically an index plus grep; if you grep Francis> supports japanese you could hack together a prototype Francis> jglimpse with a little c programming. I don't think so. The grep part is easy, as you point out, although you'd have to do without "n typos" and "full-word" switches. The issue is the indexing. Andy> Doesn't it get a bit complicated though, if you have text Andy> with various encodings. I would think that index would have Andy> be in some canoncal format, say EUC. You would have Uh-uh. For this application you want to do it in Unicode, presumably "raw" UCS-2 for size and speed. Don't create trouble for people who are bilingual in Asian languages. Andy> determine the encoding on the fly, both when creating the Andy> index and when searching through the text again. Actualy, to Andy> get the results of the search to display correctly, wouldn't Andy> you have to convert it to whatever your terminal was set Andy> for? And _this_ is the easy part. Remember, Japanese is an extremely highly inflected language and does not use spaces to separate words. Unless your indexing program understands Japanese syntax, it is not clear how you would go about doing the indexing. Too bad glimpse and wnn aren't written in Java. Then you could do "import glimpse.*; import wnn.*;" and only have about 20,000 lines of code left to modify or write. (^^;) Glimpse is fast and the indexes are "small" because glimpse knows a lot about European languages. Writing a glimpse for Japanese looks to me like a major research project. You may as well start from scratch, too, because glimpseindex's code seems to be heavily dependent on whitespace as word boundaries and stuff like that that just don't apply to Japanese at all. Probably the best bet would be to index only kanji, katakana, and romaji. That ought to keep your indexes to a moderate size, maybe. Of course you're going to miss a lot of hiragana words. If you decide to do it right, though, don't forget to include code to handle "henkan typos" like 新学校 for 神纔瘢韭絎竢躬. (^^) -- Stephen J. Turnbull Institute of Policy and Planning Sciences Yaseppochi-Gumi University of Tsukuba http://turnbull.sk.tsukuba.ac.jp/ Tel: +81 (298) 53-5091; Fax: 55-3849 turnbull@example.com ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- Follow-Ups:
- tlug: Re: Glimpse support for Asian characters
- From: Dennis McMurchy <denismcm@example.com>
- Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- From: "Andrew S. Howell" <andy@example.com>
- References:
- Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- From: "Andrew S. Howell" <andy@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: PAP and CHAP and mailq
- Next by Date: Re: tlug: Sub-notebooks for Linux?
- Prev by thread: Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- Next by thread: tlug: Re: Glimpse support for Asian characters
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links