Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tlug: Two Qs re translation project



Frank writes:

    >> UTF-8 doesn't suffer from this problem, btw. By design, the
    >> head byte is structurally different from the tail byte(s) so a
    >> 8-bit clean string search won't deliver a false positive.

    FB> Looks like it's time for Frank to go back to school.

    FB> Duh, can UTF-8 be interpreted correctly by browsers in common
    FB> circulation, and if so (or if it's on a rising wave) what is
    FB> the best reference text on it?

Probably most commercial browsers do (IE, Netscape, Mozilla claims to
but I can't test easily).  Lynx will not, unless the user is fortunate
enough to have a UTF-8 font.  But what you would probably prefer to do
is to store your stuff in UTF-8, and spit it back out in whatever code
the request came in.

The reference is _The Unicode Standard, Version 2.0_, or you can wait
for version 3.0 which will be out RSN.  About $50 IIRC,
Addison-Wesley.  Available from Amazon (unless you are obeying rms's
boycott).

    FB> Also ... if we move to a new encoding, we'll need a conversion
    FB> tool.  Is there a Unix filter that can munge one of the common
    FB> Jse encodings into UTF-8?

Ask Jim Breen <jwb@example.com>.  I don't think any of the
standard ones (nkf, kcc, ack, jcode) do yet.  Large tables are
required.

    FB> After looking around and thinking about our requirements, I've
    FB> tentatively settled on MySQL as the database engine to use for
    FB> the dictionary of equivalents in our translation project.  If
    FB> I should think twice and consider an alternative, please let
    FB> us know.

I haven't heard anything bad about MySQL I18N, but I'm not basically a
database person.  All I need are find(1) and egrep(1).

    FB> One of the things that I'll need to do with the statutory
    FB> material (on the English and on the Japanese side) is quick
    FB> searching for words and phrases.  Does anyone have information
    FB> on Japanese-capable search engines that can be run under
    FB> Linux?  I remember there was a mention of Glimpse for
    FB> Japanese, but if I recall correctly, the patch is not being
    FB> maintained.

ftp.lab.kdd.co.jp, ftp.ring.gr.jp, and ftp.etl.go.jp are good places
to look.

>From the Debian package list.  Get sources from any Debian FTP mirror
or http mirror in dists/unstable/main/sources ("unstable" doesn't
usually mean the application is unstable, it means that your Debian
dependencies are probably incoherent; I also suggest NOT applying any
Debianization patches; there's a good chance they'll introduce
problems rather than fix them).  Or do a web search for the upstream
site.

ht://Dig looks like a poor option.

    Package: htdig
    Priority: optional
    Section: web
    Installed-Size: 1700
    Maintainer: Gergely Madarasz <gorgo@example.com>
    Architecture: i386
    Version: 3.1.4-1
    Depends: libc6 (>= 2.1), libstdc++2.10, libz1, perl5
    Recommends: httpd
    Suggests: htdig-doc, catdoc | word2x, pstotext | gs, pstotext | xpdf | xpdf-i
    Filename: dists/unstable/main/binary-i386/web/htdig_3.1.4-1.deb
    Size: 666526
    MD5sum: fb0f2e8bfb85e1a1e953d59aaf903789
    Description: WWW search system for an intranet or small internet
     The ht://Dig system is a complete world wide web indexing and searching
     system for a small domain or intranet. This system is not meant to
     replace the need for powerful internet-wide search systems like Lycos,
     Infoseek, Webcrawler and AltaVista. Instead it is meant to cover the
     search needs for a single company, campus, or even a particular sub
     section of a web site.
     .
     As opposed to some WAIS-based or web-server based search engines,
     ht://Dig can span several web servers at a site. The type of these different
     web servers doesn't matter as long as they understand the HTTP 1.0
     protocol.
     .
     Features:
        * Intranet searching
        * It is free
        * Robot exclusion is supported
        * Boolean expression searching
        * Configurable search results
        * Fuzzy searching
        * Searching of HTML and text files
        * Keywords can be added to HTML documents
        * Email notification of expired documents
        * A Protected server can be indexed
        * Searches on subsections of the database
        * Full source code included
        * The depth of the search can be limited
        * Full support for the ISO-Latin-1 character set

SWISH++ alleges to have a facility for indexing non-text files.  Maybe
this is generalizable to Japanese.

    Package: swish++
    Priority: optional
    Section: web
    Installed-Size: 374
    Maintainer: Jim Pick <jim@example.com>
    Architecture: i386
    Version: 3.0.3-3
    Depends: perl5, libc6, libc6 (>= 2.1), libstdc++2.10
    Filename: dists/unstable/main/binary-i386/web/swish++_3.0.3-3.deb
    Size: 167246
    MD5sum: 29f55e40647ce3bea71eff8632775594
    Description: Simple Web Indexing System for Humans ++
     The author says - It's an order of magnitude faster than SWISH-E,
     automatically splits/merges large indexing jobs, and has a utility
     for aiding in the indexing of non-text files.

Maybe SWISH-E?

    Package: swish-e
    Priority: optional
    Section: web
    Installed-Size: 153
    Maintainer: Jim Pick <jim@example.com>
    Architecture: i386
    Version: 1.1-1
    Depends: libc6
    Filename: dists/unstable/main/binary-i386/web/swish-e_1.1-1.deb
    Size: 60614
    MD5sum: 5f97cf1c33bb71a6b5e5ea237fe9a066
    Description: Simple Web Indexing System for Humans
     SWISH-Enhanced is a fast, powerful, flexible, and easy to use system
     for indexing collections of Web pages or other text files. Key
     features include the ability to limit searches to certain HTML tags
     (META, TITLE, comments, etc.). The SWISH-E software is free, and we
     include a package of Perl programs that enable anyone who is
     authorized to create and maintain their own indexes
     (AutoSwish). SWISH-E is an enhanced version of SWISH, which was
     originally written by Kevin Hughes and modified and released with his
     permission.

 
    FB> Or is it logically impossible in EUC-JP encoding to get
    FB> crossed up in this way?

As Adrian points out, there's no way to avoid this.  And unless you
want to go deep into the innards of the search engine (and probably
snafu), you can't use the obvious tricks that you could use with grep
like searching for "(BOL|<7bit>)(<8bit><8bit>)*<your string here>" to
reduce the false positives.

Gotta get a Japanese- or Unicode-/UTF-8-capable indexing engine.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."
--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links