Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new webpage: rikai.com



On Wed, Sep 13, 2000 at 08:15:09PM +0100, Simon Cozens wrote:
> Ah, I see. I'll have a free equivalent GPLed by the end of next week.

Or sooner. This is the back-end, which you can use already; it requires you
to install ChaSen (which includes the Text::ChaSen Perl module, although
that's not installed by default) and the HTML::Parser module. 

If you now do

    perl annotate < old.html > new.html

new.html will be a copy of old.html with pop-up boxes giving the deinflected
compound, kana reading and part of speech. Cool, huh? Suggestions (and
patches!) welcome.

ChaSen's installed size is less than 400k, so don't worry about having to drag
down masses of stuff - you don't.

Adding the front-end HTTP proxy is trivial, since there's a module called
POE::Filter::HTTPD which does just that...

-- 
I've looked at the listing, and it's right!
		-- Joel Halpern

---cut here---
use HTML::Parser;
use Text::ChaSen;

$cset = '<META http-equiv=\"Content-Type\" content=\"text/html; charset=EUC-JP\">';
$res = Text::ChaSen::getopt_argv('chasen-perl', '-j', '-F', '%m\t\t%M\t\t%y\t\t%U(%P-)\t\t%T \t%F \n');
$p = HTML::Parser->new( api_version => 3, marked_sections => 1);
$p->handler(text => \&dotext, 'text');
$p->handler(default => sub { print @example.com}, 'text'); # Just spit out markup as-is
$p->parse_file(*STDIN);

sub dotext {
    $_ = shift;
    return unless /\S/; # Forget empty things...
    for (split /([\x80-\xff]+)/) { 
        unless (/[\x80-\xff]/) { print $_; next; } # Split out non-EUC
        @example.com = split /\n/, Text::ChaSen::sparse_tostr($_); # Parse it!
        pop @example.com;
        for (@example.com) {
            my ($kanji, $deinflected, $yomi, $pos) = split /\t\t/, $_,4;
            if ($pos eq "̤Ãθì") { print $kanji; next; } # Pass unknowns.
            print <<EOF
    <A HREF="javascript:" onMouseOver='
        mywin = window.open("","","width=200,height=200");
        mywin.document.write("$cset<B>Word</B>:$kanji<P><B>Root</B>:$deinflected<P><B>Reading</B>: $yomi<P><B>Part of Speech</B>: $pos");
        mywin.document.close();
    ' 
    onMouseOut='mywin.window.close(); return true;'>
$kanji</A>
EOF
        }
    }
} # Compact, isn't it?


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links