Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] a japanese dictionary: regex v. db query



>>>>> "Jim" == Jim  <jep200404@example.com> writes:

    Jim> "Stephen J. Turnbull" wrote:

    >> What is the regular expression for "all characters with 16
    >> strokes"?

    Jim> Oh boy. That made me think. That was the right question to
    Jim> highlight the limitations of regexes.

Regexps as such are somewhat limited, but they're actually quite
flexible and powerful when used judiciously.  It's just that they
suffer from the "to a man with a hammer, every problem looks like a
thumb" syndrome, perhaps more so than any other tool I know of.

The issues here are different.  It's easy enough to design a flat file
format like

KANJI<TAB>ON1 ... ONn<TAB>KUN1 ... KUNm<TAB>STROKES<TAB>RADICAL ...

and do "egrep '^(.)\t[^\t]\t[^\t]\t(\d+)\t' db-file".  The problem is
if you ever decide to change the format, you break all programs
written to the old standard.  And if you decide to add "\t" as a
character in your database, you're really hosed.  For David, these
*probably* don't much matter since he'll probably use an existing
database like EDICT.

Finally, queries involving multiple features, connected with logical
operators, are going to be O(n^m) where n is the size of the database
and m is the number of features.  If you have a real database, on the
other hand, you can generally get that down to something like O(m*lg
n), which can be important in some applications.  Probably not for an
input method, though.

    Jim> I can not think of how to express "all characters with 16
    Jim> strokes" in the present schemes of regexes as I know them.

POSIX character classes, eg, [:UPPER:] can easily be table driven.
Write the table, add the extension class, you're done.  I believe that
XML already has a regex spec for Unicode.  Also Emacs Lisp regexps
have all the necessary features.

    Jim> Of course one could extend regexes to also match _attributes_
    Jim> of characters, such as brush stroke count. As if the syntax
    Jim> of regexes wasn't "simple" enough already, I shudder at the
    Jim> thought of what the extended regex syntax would be.

Extend POSIX classes, with a "FILTER" keyword.  Then the syntax for
"16 stroke kanji" would be "[[:FILTER STROKECOUNT 16:]]", where
STROKECOUNT would be

typedef int (*filter_function) (widechar);

This could be done reasonably efficiently by constructing a table at
compile-time (assuming that you're going to be reusing the regexp).


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links