Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] a japanese dictionary: regex v. db query
- Date: Wed, 05 Apr 2006 16:16:17 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] a japanese dictionary: regex v. db query
- References: <442EADBF.3060701@example.com> <a2588e100604010926o733cdc7en57335349dac956a8@example.com> <aaa8c4580604011516m6605d116j3df1fb29deabb94b@example.com> <a2588e100604031616v60663498qdb3d3d03bed3a3f6@example.com> <4432084F.6050400@example.com> <87u099h8rt.fsf@example.com> <20060404103045.507444eb.jep200404@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.4.19 (linux)
>>>>> "Jim" == Jim <jep200404@example.com> writes: Jim> "Stephen J. Turnbull" wrote: >> What is the regular expression for "all characters with 16 >> strokes"? Jim> Oh boy. That made me think. That was the right question to Jim> highlight the limitations of regexes. Regexps as such are somewhat limited, but they're actually quite flexible and powerful when used judiciously. It's just that they suffer from the "to a man with a hammer, every problem looks like a thumb" syndrome, perhaps more so than any other tool I know of. The issues here are different. It's easy enough to design a flat file format like KANJI<TAB>ON1 ... ONn<TAB>KUN1 ... KUNm<TAB>STROKES<TAB>RADICAL ... and do "egrep '^(.)\t[^\t]\t[^\t]\t(\d+)\t' db-file". The problem is if you ever decide to change the format, you break all programs written to the old standard. And if you decide to add "\t" as a character in your database, you're really hosed. For David, these *probably* don't much matter since he'll probably use an existing database like EDICT. Finally, queries involving multiple features, connected with logical operators, are going to be O(n^m) where n is the size of the database and m is the number of features. If you have a real database, on the other hand, you can generally get that down to something like O(m*lg n), which can be important in some applications. Probably not for an input method, though. Jim> I can not think of how to express "all characters with 16 Jim> strokes" in the present schemes of regexes as I know them. POSIX character classes, eg, [:UPPER:] can easily be table driven. Write the table, add the extension class, you're done. I believe that XML already has a regex spec for Unicode. Also Emacs Lisp regexps have all the necessary features. Jim> Of course one could extend regexes to also match _attributes_ Jim> of characters, such as brush stroke count. As if the syntax Jim> of regexes wasn't "simple" enough already, I shudder at the Jim> thought of what the extended regex syntax would be. Extend POSIX classes, with a "FILTER" keyword. Then the syntax for "16 stroke kanji" would be "[[:FILTER STROKECOUNT 16:]]", where STROKECOUNT would be typedef int (*filter_function) (widechar); This could be done reasonably efficiently by constructing a table at compile-time (assuming that you're going to be reusing the regexp). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- References:
- [tlug] a japanese dictionary
- From: David Stibbe
- Re: [tlug] a japanese dictionary
- From: dabicho
- Re: [tlug] a japanese dictionary
- From: Michael Engel
- Re: [tlug] a japanese dictionary
- From: dabicho
- Re: [tlug] a japanese dictionary
- From: David Stibbe
- Re: [tlug] a japanese dictionary
- From: Stephen J. Turnbull
- [tlug] a japanese dictionary: regex v. db query
- From: Jim
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] a japanese dictionary: regex v. db query
- Next by Date: Re: [tlug] a japanese dictionary: regex v. db query
- Previous by thread: Re: [tlug] a japanese dictionary: regex v. db query
- Next by thread: Re: [tlug] a japanese dictionary
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links