Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]tlug: Two Qs re translation project
- To: tlug@example.com
- Subject: tlug: Two Qs re translation project
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Fri, 28 Jan 2000 16:43:21 +0900 (JST)
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <20000128060241.A508@example.com>
- References: <20000128060241.A508@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
Frank writes: >> UTF-8 doesn't suffer from this problem, btw. By design, the >> head byte is structurally different from the tail byte(s) so a >> 8-bit clean string search won't deliver a false positive. FB> Looks like it's time for Frank to go back to school. FB> Duh, can UTF-8 be interpreted correctly by browsers in common FB> circulation, and if so (or if it's on a rising wave) what is FB> the best reference text on it? Probably most commercial browsers do (IE, Netscape, Mozilla claims to but I can't test easily). Lynx will not, unless the user is fortunate enough to have a UTF-8 font. But what you would probably prefer to do is to store your stuff in UTF-8, and spit it back out in whatever code the request came in. The reference is _The Unicode Standard, Version 2.0_, or you can wait for version 3.0 which will be out RSN. About $50 IIRC, Addison-Wesley. Available from Amazon (unless you are obeying rms's boycott). FB> Also ... if we move to a new encoding, we'll need a conversion FB> tool. Is there a Unix filter that can munge one of the common FB> Jse encodings into UTF-8? Ask Jim Breen <jwb@example.com>. I don't think any of the standard ones (nkf, kcc, ack, jcode) do yet. Large tables are required. FB> After looking around and thinking about our requirements, I've FB> tentatively settled on MySQL as the database engine to use for FB> the dictionary of equivalents in our translation project. If FB> I should think twice and consider an alternative, please let FB> us know. I haven't heard anything bad about MySQL I18N, but I'm not basically a database person. All I need are find(1) and egrep(1). FB> One of the things that I'll need to do with the statutory FB> material (on the English and on the Japanese side) is quick FB> searching for words and phrases. Does anyone have information FB> on Japanese-capable search engines that can be run under FB> Linux? I remember there was a mention of Glimpse for FB> Japanese, but if I recall correctly, the patch is not being FB> maintained. ftp.lab.kdd.co.jp, ftp.ring.gr.jp, and ftp.etl.go.jp are good places to look. >From the Debian package list. Get sources from any Debian FTP mirror or http mirror in dists/unstable/main/sources ("unstable" doesn't usually mean the application is unstable, it means that your Debian dependencies are probably incoherent; I also suggest NOT applying any Debianization patches; there's a good chance they'll introduce problems rather than fix them). Or do a web search for the upstream site. ht://Dig looks like a poor option. Package: htdig Priority: optional Section: web Installed-Size: 1700 Maintainer: Gergely Madarasz <gorgo@example.com> Architecture: i386 Version: 3.1.4-1 Depends: libc6 (>= 2.1), libstdc++2.10, libz1, perl5 Recommends: httpd Suggests: htdig-doc, catdoc | word2x, pstotext | gs, pstotext | xpdf | xpdf-i Filename: dists/unstable/main/binary-i386/web/htdig_3.1.4-1.deb Size: 666526 MD5sum: fb0f2e8bfb85e1a1e953d59aaf903789 Description: WWW search system for an intranet or small internet The ht://Dig system is a complete world wide web indexing and searching system for a small domain or intranet. This system is not meant to replace the need for powerful internet-wide search systems like Lycos, Infoseek, Webcrawler and AltaVista. Instead it is meant to cover the search needs for a single company, campus, or even a particular sub section of a web site. . As opposed to some WAIS-based or web-server based search engines, ht://Dig can span several web servers at a site. The type of these different web servers doesn't matter as long as they understand the HTTP 1.0 protocol. . Features: * Intranet searching * It is free * Robot exclusion is supported * Boolean expression searching * Configurable search results * Fuzzy searching * Searching of HTML and text files * Keywords can be added to HTML documents * Email notification of expired documents * A Protected server can be indexed * Searches on subsections of the database * Full source code included * The depth of the search can be limited * Full support for the ISO-Latin-1 character set SWISH++ alleges to have a facility for indexing non-text files. Maybe this is generalizable to Japanese. Package: swish++ Priority: optional Section: web Installed-Size: 374 Maintainer: Jim Pick <jim@example.com> Architecture: i386 Version: 3.0.3-3 Depends: perl5, libc6, libc6 (>= 2.1), libstdc++2.10 Filename: dists/unstable/main/binary-i386/web/swish++_3.0.3-3.deb Size: 167246 MD5sum: 29f55e40647ce3bea71eff8632775594 Description: Simple Web Indexing System for Humans ++ The author says - It's an order of magnitude faster than SWISH-E, automatically splits/merges large indexing jobs, and has a utility for aiding in the indexing of non-text files. Maybe SWISH-E? Package: swish-e Priority: optional Section: web Installed-Size: 153 Maintainer: Jim Pick <jim@example.com> Architecture: i386 Version: 1.1-1 Depends: libc6 Filename: dists/unstable/main/binary-i386/web/swish-e_1.1-1.deb Size: 60614 MD5sum: 5f97cf1c33bb71a6b5e5ea237fe9a066 Description: Simple Web Indexing System for Humans SWISH-Enhanced is a fast, powerful, flexible, and easy to use system for indexing collections of Web pages or other text files. Key features include the ability to limit searches to certain HTML tags (META, TITLE, comments, etc.). The SWISH-E software is free, and we include a package of Perl programs that enable anyone who is authorized to create and maintain their own indexes (AutoSwish). SWISH-E is an enhanced version of SWISH, which was originally written by Kevin Hughes and modified and released with his permission. FB> Or is it logically impossible in EUC-JP encoding to get FB> crossed up in this way? As Adrian points out, there's no way to avoid this. And unless you want to go deep into the innards of the search engine (and probably snafu), you can't use the obvious tricks that you could use with grep like searching for "(BOL|<7bit>)(<8bit><8bit>)*<your string here>" to reduce the false positives. Gotta get a Japanese- or Unicode-/UTF-8-capable indexing engine. -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules." -------------------------------------------------------------------- Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae Next Technical Meeting: March 11 (Sat) 13:00 Temple University Japan * Topic: TBD -------------------------------------------------------------------- more info: http://www.tlug.gr.jp Sponsor: Global Online Japan
- References:
- tlug: Two Qs re translation project
- From: "Frank Bennett (=?iso-2022-jp?B?GyRCJVUlaSVzJS8kWSVNJUMlSBsoQg==?= )" <bennett@example.com>
Home | Main Index | Thread Index
- Prev by Date: tlug: Jse search engines
- Next by Date: Re: tlug: Linux DVD
- Prev by thread: Re: tlug: Two Qs re translation project
- Next by thread: RE: tlug: Two Qs re translation project
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links