Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- To: tlug@example.com
- Subject: Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- From: "Andrew S. Howell" <andy@example.com>
- Date: Mon, 12 May 1997 16:05:13 JST
- In-Reply-To: Your message of "Mon, 12 May 1997 13:31:03 JST." <m0wQmlL-000010C@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug
-------------------------------------------------------- tlug note from "Andrew S. Howell" <andy@example.com> -------------------------------------------------------- >>>>> "Stephen" == Stephen J Turnbull <turnbull@example.com> writes: Stephen> -------------------------------------------------------- Stephen> tlug note from "Stephen J. Turnbull" Stephen> <turnbull@example.com> Stephen> -------------------------------------------------------- >>>>> "Francis" == Francis Brian O'Carroll <ocarroll@example.com> Stephen> writes: Francis> did they say they would accept patches for japanese if we Francis> develeped theym? the code is not copyleft, so we couldn't Francis> redistribute I think. Stephen> Not a chance. The copyright basically amounts to doing a Stephen> binary distribution in source as I recall. >>>>> "Andy" == Andrew S Howell <andy@example.com> writes: [snip] Andy> Doesn't it get a bit complicated though, if you have text Andy> with various encodings. I would think that index would have Andy> be in some canoncal format, say EUC. You would have Stephen> Uh-uh. For this application you want to do it in Stephen> Unicode, presumably "raw" UCS-2 for size and speed. Stephen> Don't create trouble for people who are bilingual in Stephen> Asian languages. Yeah, Unicode woudl make more sense. Andy> determine the encoding on the fly, both when creating the Andy> index and when searching through the text again. Actualy, to Andy> get the results of the search to display correctly, wouldn't Andy> you have to convert it to whatever your terminal was set Andy> for? Stephen> And _this_ is the easy part. Remember, Japanese is an Stephen> extremely highly inflected language and does not use Stephen> spaces to separate words. Unless your indexing program Stephen> understands Japanese syntax, it is not clear how you Stephen> would go about doing the indexing. Stephen> Too bad glimpse and wnn aren't written in Java. Then you Stephen> could do "import glimpse.*; import wnn.*;" and only have Stephen> about 20,000 lines of code left to modify or write. Stephen> (^^;) Stephen> Glimpse is fast and the indexes are "small" because Stephen> glimpse knows a lot about European languages. Writing a Stephen> glimpse for Japanese looks to me like a major research Stephen> project. You may as well start from scratch, too, Stephen> because glimpseindex's code seems to be heavily dependent Stephen> on whitespace as word boundaries and stuff like that that Stephen> just don't apply to Japanese at all. Ok, I'll just tell my boss I'm taking a couple years off to write a good indexing program.... I wonder if I'll have a paycheck when I come back :) Stephen> Probably the best bet would be to index only kanji, Stephen> katakana, and romaji. That ought to keep your indexes to Stephen> a moderate size, maybe. Of course you're going to miss a Stephen> lot of hiragana words. Stephen> If you decide to do it right, though, don't forget to Stephen> include code to handle "henkan typos" like ?73X9; for ?@ Stephen> 3X9;. (^^) Sounds like to "do it right" approach would require an awfull lot of knowledge of Japanese. On second thought, I think I put this on hold for a while... Andy ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- References:
- Re: tlug: Udi Manber: Re: Glimpse support for Asian characters
- From: "Stephen J. Turnbull" <turnbull@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: A Question on Creating Linux Partitions
- Next by Date: Re: tlug: A Question on Creating Linux Partitions
- Prev by thread: tlug: Re: Glimpse support for Asian characters
- Next by thread: tlug: Communicator 4.0b3 plugins
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links