TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] a japanese dictionary: regex v. db query

Date: Wed, 05 Apr 2006 16:16:17 +0900

From: "Stephen J. Turnbull" <stephen@example.com>

Subject: Re: [tlug] a japanese dictionary: regex v. db query

References: <442EADBF.3060701@example.com> <a2588e100604010926o733cdc7en57335349dac956a8@example.com> <aaa8c4580604011516m6605d116j3df1fb29deabb94b@example.com> <a2588e100604031616v60663498qdb3d3d03bed3a3f6@example.com> <4432084F.6050400@example.com> <87u099h8rt.fsf@example.com> <20060404103045.507444eb.jep200404@example.com>

Organization: The XEmacs Project

User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.4.19 (linux)
>>>>> "Jim" == Jim  <jep200404@example.com> writes:

    Jim> "Stephen J. Turnbull" wrote:

    >> What is the regular expression for "all characters with 16
    >> strokes"?

    Jim> Oh boy. That made me think. That was the right question to
    Jim> highlight the limitations of regexes.

Regexps as such are somewhat limited, but they're actually quite
flexible and powerful when used judiciously.  It's just that they
suffer from the "to a man with a hammer, every problem looks like a
thumb" syndrome, perhaps more so than any other tool I know of.

The issues here are different.  It's easy enough to design a flat file
format like

KANJI<TAB>ON1 ... ONn<TAB>KUN1 ... KUNm<TAB>STROKES<TAB>RADICAL ...

and do "egrep '^(.)\t[^\t]\t[^\t]\t(\d+)\t' db-file".  The problem is
if you ever decide to change the format, you break all programs
written to the old standard.  And if you decide to add "\t" as a
character in your database, you're really hosed.  For David, these
*probably* don't much matter since he'll probably use an existing
database like EDICT.

Finally, queries involving multiple features, connected with logical
operators, are going to be O(n^m) where n is the size of the database
and m is the number of features.  If you have a real database, on the
other hand, you can generally get that down to something like O(m*lg
n), which can be important in some applications.  Probably not for an
input method, though.

    Jim> I can not think of how to express "all characters with 16
    Jim> strokes" in the present schemes of regexes as I know them.

POSIX character classes, eg, [:UPPER:] can easily be table driven.
Write the table, add the extension class, you're done.  I believe that
XML already has a regex spec for Unicode.  Also Emacs Lisp regexps
have all the necessary features.

    Jim> Of course one could extend regexes to also match _attributes_
    Jim> of characters, such as brush stroke count. As if the syntax
    Jim> of regexes wasn't "simple" enough already, I shudder at the
    Jim> thought of what the extended regex syntax would be.

Extend POSIX classes, with a "FILTER" keyword.  Then the syntax for
"16 stroke kanji" would be "[[:FILTER STROKECOUNT 16:]]", where
STROKECOUNT would be

typedef int (*filter_function) (widechar);

This could be done reasonably efficiently by constructing a table at
compile-time (assuming that you're going to be reusing the regexp).


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.
References:

[tlug] a japanese dictionary
From: David Stibbe

Re: [tlug] a japanese dictionary
From: dabicho

Re: [tlug] a japanese dictionary
From: Michael Engel

Re: [tlug] a japanese dictionary
From: dabicho

Re: [tlug] a japanese dictionary
From: David Stibbe

Re: [tlug] a japanese dictionary
From: Stephen J. Turnbull

[tlug] a japanese dictionary: regex v. db query
From: Jim

Prev by Date: Re: [tlug] a japanese dictionary: regex v. db query

Next by Date: Re: [tlug] a japanese dictionary: regex v. db query

Previous by thread: Re: [tlug] a japanese dictionary: regex v. db query

Next by thread: Re: [tlug] a japanese dictionary

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links