Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: Two Qs re translation project



I've installed freeWAIS-sf with Baba-san's Japanese patch,
and run a simple test against a grep for the same term.
Here are the results, if anyone is interested.  The scripts
I used, both of them Tcl wrappers, are reproduced below.

One thing that emerged is that they turn up different
results.  I was searching for the work "saibansho" (court),
and the Kakasi word-splitter (correctly) leaves
"saikosaibansho" (Supreme Court) as one word unsplit.
One section (section 3) of the new Code of Civil
Procedure (the text base used for the search) contains
this term, and no instance of "saibansho" on its own.
WAIS passed this over.  Grep picked it up.  I'm not sure
yet which is the better behavior for us, but it's a
difference, anyway.

There is a pretty important difference in performance:

*****

bash-2.01# time ./TEMP.grep

[snip]

Hits: 237

real    0m3.096s
user    0m1.730s
sys     0m1.370s

*****

bash-2.01# time ./TEMP.wais

[snip]

Hits: 236

real    0m0.702s
user    0m0.630s
sys     0m0.080s

bash-2.01# 

*****

Here are the scripts ...

("grep" suffers extra loss in the way the script is
written, but Tcl wouldn't recognize a command-line wildcard
under the "exec" command, so I had to make a list of files
and tackle them one at a time.)

*****

#!/usr/bin/tclsh
#
# For grep ...
#
set ifh [open "|find ./civpro/ -mindepth 1 -maxdepth 1 -type f" r]
set count 0
while {[gets $ifh line]>-1} {
  catch {exec grep -l --extended-regexp \(^|/\)\(\[^/\]\[^/\]\)*裁判所 $line} error
  if {![regexp process $error]} {
#    puts [file tail $error]
incr count
  }
}
puts \n\[snip\]\n
puts "Hits: $count"

*****

#!/usr/bin/tclsh
#
# For WAIS ...
#
set ifh [open "|waisq -s /.commons/waisindexes/ \
            -m 2000 \
            -S civpro_ja.src \
            -f - \
            裁判所 | \
          waisq -g" r]
set count 0
while {[gets $ifh line]>-1} {
  if {[regexp ":headline \\\"(\[a-zA-Z0-9=_-\]*)" $line ignore file]} {
#   puts $file
   incr count
  }
}
puts \n\[snip\]\n
puts "Hits: $count"

*****

And, uh ... that's it.

Cheers,
Frank Bennett

--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links