Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.



On 1/14/06, Josh Glover <jmglov@example.com> wrote:
On 14/01/06, David Riggs <dariggs@example.com> wrote:

> On the other hand, instead of searching each time, is there a text
> indexing and search system which works with unicode? All I find googling
> around is commerical stuff which seems orientated towards western languages.

I don't know about this, but Perl's regexp engine handles Unicode and
multi-line strings. Give Perl a whirl. (Sorry.)

-Josh

Completely untested and off the top of my head.

As Jim says elsewhere there are probably optimisations - particuarly if you know, for instance, that big chunks of the file aren't interesting.  And If you can afford to load the whole line in at once there are much more efficient ways to deal with it.  There's also also an optimisation when the last character on the line is clearly not a part of your match but it's probably not worth much.

#!/usr/bin/perl
use strict;
use warnings;
use utf8;  # Source file is utf8, not ascii

binmode STDIN, ':utf8'; # Read multibyte characters, not bytewise
binmode STDOUT, ':utf8'; # ... and write them

my $oldline='';
while(defined(my $line=<STDIN>)) {
   chomp $line;
   $line=~s/\b//g; # Remove whitespace; add to this to remove punctuation; multibyte characters allowed in the regexp as the source file is in utf8
   $oldline.=$line;

   # Won't work if your pattern will match a substring of itself, if it does will you need
   # to mention the next character
   while(my ($throwaway, $match, $remainder)=$oldline=~/(.*?)(regexp here)(.*)/) {
      print "Matched "$match" on $.\n"; # Will say line 2000 even if the match spans line 1999.
      $oldline=$remainder;
   }
   # You know the length of what you're trying to match, trim the front of $oldline to remove
   # characters you definitely don't want.  If you can't do this with substr or another regexp this algorithm may be crap.
   $oldline=substr($oldline, -2, 2); # Supposed to keep the last 2 kanji; Someone correct this, it's probably wrong!
}

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links