Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] re: Searching for kanji strings



Steven T, thanks for the FreeWAIS idea-- and yes, storage is cheap.

 >Steve S. wrote:
 >
 >> Would it be worth while to just write a simple [C] executable for this?
 >
 >It is wise to avoid premature optimization.
 >
 >First make it work _right_, then make it work _fast_.
 >
 >In the mean time, while using the existing solutions,
 >David will learn how good they are for a wide variety of cases,
 >and will also learn exactly what needs to be optimized.
 >
 >David, of the hundreds of megabytes of text, how big is each file?
 >What is the longest line in any of those files?
 >What is the largest file?
 >
 >Jim


I am indeed learning a lot as I go through this.

To answer the data questions: each line is 20 to 80 characters (not 
bytes-- the original data is big5, converted to utf-8 locally).

The 326MB is in 2460 files in 56 folders, 2.5MB max file size, with many 
less than 10K, and examples of everything in between. The organization 
is by the "Taisho" number, which maps to the title of a text. Most 
happily the file names are just numbers, not kanji, so no encoding 
problems there.

My current perl script does what I had originally hoped for. It is 
invoked as "supersearch kanjistring fileglob". ( In is in utf-8, just 
for the "maru" and hard space in the skipping string--)

#!/usr/bin/perl -0777 -n
BEGIN {($s = shift) =~ s/(.)/$1\[0-9pabc()|。 \n\015]*/g;}
@example.com = m/$s.*[0-9pabc()|。 \n\015]*/gxo;
if (@example.com) {print "in file: ", $ARGV, "\n";
	 foreach $one ( @example.com) {print $one, "\n";}}


Briefly, what it does is:

#1.grap first arg, replace every character with
#  that character+junk-skipper, and put into $s
#2. put all the matches in the entire slurped file
#   into @example.com array
#3. if there is anything in @example.com, print out the file name,
#   then each match, which has at lease one line number


Hoping I can make the next step and figure out how to index this heap. 
It is really pretty fast, for what it does, but it would be a big help 
to figure out how to do real indexing and do flexible context searching, 
as Steven says is done in FreeWAIS.

Thanks,

David Riggs, Kyoto


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links