
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] a japanese dictionary: regex v. db query
>>>>> "Jim" == Jim <jep200404@example.com> writes:
Jim> "Stephen J. Turnbull" wrote:
>> What is the regular expression for "all characters with 16
>> strokes"?
Jim> Oh boy. That made me think. That was the right question to
Jim> highlight the limitations of regexes.
Regexps as such are somewhat limited, but they're actually quite
flexible and powerful when used judiciously. It's just that they
suffer from the "to a man with a hammer, every problem looks like a
thumb" syndrome, perhaps more so than any other tool I know of.
The issues here are different. It's easy enough to design a flat file
format like
KANJI<TAB>ON1 ... ONn<TAB>KUN1 ... KUNm<TAB>STROKES<TAB>RADICAL ...
and do "egrep '^(.)\t[^\t]\t[^\t]\t(\d+)\t' db-file". The problem is
if you ever decide to change the format, you break all programs
written to the old standard. And if you decide to add "\t" as a
character in your database, you're really hosed. For David, these
*probably* don't much matter since he'll probably use an existing
database like EDICT.
Finally, queries involving multiple features, connected with logical
operators, are going to be O(n^m) where n is the size of the database
and m is the number of features. If you have a real database, on the
other hand, you can generally get that down to something like O(m*lg
n), which can be important in some applications. Probably not for an
input method, though.
Jim> I can not think of how to express "all characters with 16
Jim> strokes" in the present schemes of regexes as I know them.
POSIX character classes, eg, [:UPPER:] can easily be table driven.
Write the table, add the extension class, you're done. I believe that
XML already has a regex spec for Unicode. Also Emacs Lisp regexps
have all the necessary features.
Jim> Of course one could extend regexes to also match _attributes_
Jim> of characters, such as brush stroke count. As if the syntax
Jim> of regexes wasn't "simple" enough already, I shudder at the
Jim> thought of what the extended regex syntax would be.
Extend POSIX classes, with a "FILTER" keyword. Then the syntax for
"16 stroke kanji" would be "[[:FILTER STROKECOUNT 16:]]", where
STROKECOUNT would be
typedef int (*filter_function) (widechar);
This could be done reasonably efficiently by constructing a table at
compile-time (assuming that you're going to be reusing the regexp).
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
Home |
Main Index |
Thread Index