Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] re: Searching for kanji strings
- Date: Fri, 20 Jan 2006 10:58:04 +0900
- From: David Riggs <dariggs@example.com>
- Subject: [tlug] re: Searching for kanji strings
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
Steven T, thanks for the FreeWAIS idea-- and yes, storage is cheap. >Steve S. wrote: > >> Would it be worth while to just write a simple [C] executable for this? > >It is wise to avoid premature optimization. > >First make it work _right_, then make it work _fast_. > >In the mean time, while using the existing solutions, >David will learn how good they are for a wide variety of cases, >and will also learn exactly what needs to be optimized. > >David, of the hundreds of megabytes of text, how big is each file? >What is the longest line in any of those files? >What is the largest file? > >Jim I am indeed learning a lot as I go through this. To answer the data questions: each line is 20 to 80 characters (not bytes-- the original data is big5, converted to utf-8 locally). The 326MB is in 2460 files in 56 folders, 2.5MB max file size, with many less than 10K, and examples of everything in between. The organization is by the "Taisho" number, which maps to the title of a text. Most happily the file names are just numbers, not kanji, so no encoding problems there. My current perl script does what I had originally hoped for. It is invoked as "supersearch kanjistring fileglob". ( In is in utf-8, just for the "maru" and hard space in the skipping string--) #!/usr/bin/perl -0777 -n BEGIN {($s = shift) =~ s/(.)/$1\[0-9pabc()|。 \n\015]*/g;} @example.com = m/$s.*[0-9pabc()|。 \n\015]*/gxo; if (@example.com) {print "in file: ", $ARGV, "\n"; foreach $one ( @example.com) {print $one, "\n";}} Briefly, what it does is: #1.grap first arg, replace every character with # that character+junk-skipper, and put into $s #2. put all the matches in the entire slurped file # into @example.com array #3. if there is anything in @example.com, print out the file name, # then each match, which has at lease one line number Hoping I can make the next step and figure out how to index this heap. It is really pretty fast, for what it does, but it would be a big help to figure out how to do real indexing and do flexible context searching, as Steven says is done in FreeWAIS. Thanks, David Riggs, Kyoto
- Follow-Ups:
Home | Main Index | Thread Index
- Prev by Date: [tlug] Docbook XML for documenting database tables
- Next by Date: Re: [tlug] Skype Ports/Servers(IPs) Query
- Previous by thread: Re: [tlug] Docbook XML for documenting database tables
- Next by thread: [tlug] Optimizing Search for kanji strings
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links