[tlug] re: Searching for kanji strings

Date: Fri, 20 Jan 2006 10:58:04 +0900
From: David Riggs <dariggs@example.com>
Subject: [tlug] re: Searching for kanji strings
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2

Steven T, thanks for the FreeWAIS idea-- and yes, storage is cheap.

 >Steve S. wrote:
 >
 >> Would it be worth while to just write a simple [C] executable for this?
 >
 >It is wise to avoid premature optimization.
 >
 >First make it work _right_, then make it work _fast_.
 >
 >In the mean time, while using the existing solutions,
 >David will learn how good they are for a wide variety of cases,
 >and will also learn exactly what needs to be optimized.
 >
 >David, of the hundreds of megabytes of text, how big is each file?
 >What is the longest line in any of those files?
 >What is the largest file?
 >
 >Jim

I am indeed learning a lot as I go through this.

To answer the data questions: each line is 20 to 80 characters (not 
bytes-- the original data is big5, converted to utf-8 locally).

The 326MB is in 2460 files in 56 folders, 2.5MB max file size, with many 
less than 10K, and examples of everything in between. The organization 
is by the "Taisho" number, which maps to the title of a text. Most 
happily the file names are just numbers, not kanji, so no encoding 
problems there.

My current perl script does what I had originally hoped for. It is 
invoked as "supersearch kanjistring fileglob". ( In is in utf-8, just 
for the "maru" and hard space in the skipping string--)

#!/usr/bin/perl -0777 -n
BEGIN {($s = shift) =~ s/(.)/$1\[0-9pabc()|。　\n\015]*/g;}
@example.com = m/$s.*[0-9pabc()|。　\n\015]*/gxo;
if (@example.com) {print "in file: ", $ARGV, "\n";
	 foreach $one ( @example.com) {print $one, "\n";}}

Briefly, what it does is:

#1.grap first arg, replace every character with
#  that character+junk-skipper, and put into $s
#2. put all the matches in the entire slurped file
#   into @example.com array
#3. if there is anything in @example.com, print out the file name,
#   then each match, which has at lease one line number

Hoping I can make the next step and figure out how to index this heap. 
It is really pretty fast, for what it does, but it would be a big help 
to figure out how to do real indexing and do flexible context searching, 
as Steven says is done in FreeWAIS.

Thanks,

David Riggs, Kyoto

Follow-Ups:
- [tlug] Optimizing Search for kanji strings
  - From: Jim

Prev by Date: [tlug] Docbook XML for documenting database tables
Next by Date: Re: [tlug] Skype Ports/Servers(IPs) Query
Previous by thread: Re: [tlug] Docbook XML for documenting database tables
Next by thread: [tlug] Optimizing Search for kanji strings
Index(es):
- Date
- Thread

Home | Main Index | Thread Index