Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: Two Qs re translation project
- To: tlug@example.com
- Subject: Re: tlug: Two Qs re translation project
- From: "Frank Bennett (=?iso-2022-jp?B?GyRCJVUlaSVzJS8kWSVNJUMlSBsoQg==?= )" <bennett@example.com>
- Date: Sat, 29 Jan 2000 16:12:08 +0900
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=iso-2022-jp
- In-Reply-To: <000701bf6968$d3c9db80$10210685@example.com>; from Frank Bennett on Fri, Jan 28, 2000 at 05:22:33PM +0900
- References: <000701bf6968$d3c9db80$10210685@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
I've installed freeWAIS-sf with Baba-san's Japanese patch, and run a simple test against a grep for the same term. Here are the results, if anyone is interested. The scripts I used, both of them Tcl wrappers, are reproduced below. One thing that emerged is that they turn up different results. I was searching for the work "saibansho" (court), and the Kakasi word-splitter (correctly) leaves "saikosaibansho" (Supreme Court) as one word unsplit. One section (section 3) of the new Code of Civil Procedure (the text base used for the search) contains this term, and no instance of "saibansho" on its own. WAIS passed this over. Grep picked it up. I'm not sure yet which is the better behavior for us, but it's a difference, anyway. There is a pretty important difference in performance: ***** bash-2.01# time ./TEMP.grep [snip] Hits: 237 real 0m3.096s user 0m1.730s sys 0m1.370s ***** bash-2.01# time ./TEMP.wais [snip] Hits: 236 real 0m0.702s user 0m0.630s sys 0m0.080s bash-2.01# ***** Here are the scripts ... ("grep" suffers extra loss in the way the script is written, but Tcl wouldn't recognize a command-line wildcard under the "exec" command, so I had to make a list of files and tackle them one at a time.) ***** #!/usr/bin/tclsh # # For grep ... # set ifh [open "|find ./civpro/ -mindepth 1 -maxdepth 1 -type f" r] set count 0 while {[gets $ifh line]>-1} { catch {exec grep -l --extended-regexp \(^|/\)\(\[^/\]\[^/\]\)*裁判所 $line} error if {![regexp process $error]} { # puts [file tail $error] incr count } } puts \n\[snip\]\n puts "Hits: $count" ***** #!/usr/bin/tclsh # # For WAIS ... # set ifh [open "|waisq -s /.commons/waisindexes/ \ -m 2000 \ -S civpro_ja.src \ -f - \ 裁判所 | \ waisq -g" r] set count 0 while {[gets $ifh line]>-1} { if {[regexp ":headline \\\"(\[a-zA-Z0-9=_-\]*)" $line ignore file]} { # puts $file incr count } } puts \n\[snip\]\n puts "Hits: $count" ***** And, uh ... that's it. Cheers, Frank Bennett -------------------------------------------------------------------- Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae Next Technical Meeting: March 11 (Sat) 13:00 Temple University Japan * Topic: TBD -------------------------------------------------------------------- more info: http://www.tlug.gr.jp Sponsor: Global Online Japan
- References:
- RE: tlug: Two Qs re translation project
- From: "Frank Bennett" <bennett@example.com>
Home | Main Index | Thread Index
- Prev by Date: FW: tlug: Two Qs re translation project
- Next by Date: tlug: mysterious X core / namazu
- Prev by thread: RE: tlug: Two Qs re translation project
- Next by thread: FW: tlug: Two Qs re translation project
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links