Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "My Kanpo" open law project



On Thu, Mar 01, 2001 at 04:36:24PM +0900, Jim Breen wrote:
> >> Date: Thu, 1 Mar 2001 15:48:34 +0900
> >> From: "Frank BENNETT <bennett@example.com>
> >> 
> >> I have come up with an algorithm for converting vertically-formatted
> >> Japanese PDF text to reading-order text that can be viewed on a
> >> TTY console.  It still has some rough spots, but with a bit more
> >> work, it promises to do a good job of conversion.  If PDF irritates
> >> you as much as it does me, this may interest you.
> 
> Sounds great. Any chance to see some samples?

Your wish is my command set.  A partial converted archive is available for
viewing at:

  http://www.nomolog.nagoya-u.ac.jp/~bennett/rippinhood/

Use username "just", password "lookin" to authorize a connection. I'll leave
offsite password access in place until the 10th -- I'm planning to offer an
advance view to the MOF Printing Bureau anyway, so you-all can share a
password with them until then.

I would appreciate it if this info were confined to members of TLUG, though.
It's not that I'm trying to keep this under wraps, but I _don't_ have approval
from our faculty to serve this stuff to the world at large, and it _would_ be
embarrassing if someone in government complained to my Dean.  If you want
unrestricted access, set up the software and run your own mirror.  :-)

If you do take a look, the archive stops on 2 February, which is not right.
We do have source on file down to the present, but a bug in the cascading
conversion algorithm is holding things up.  I haven't gotten around to
repairing it yet because this _is_ still just a test suite.  However, you can
use what's there to get a feel for the search engine and see what the
conversion filter does with vertical PDF.

Our gateway is VERY slow at the moment.  You might want to try the connection
morning-times, when things seem to be a little less sluggish from this end ...

Bye the bye, if you run a search for インターネット, the _last_ two pages in
the list returned show what the algorithm does for orthodox vertically
formatted pages arranged into ranks that do not change height mid-page --
very nice.  The other pages returned show what sort of irritating broken
weirdness happens when tables and other erratica are thrown into the middle of
the page -- common in Kanpo, so something that I need to fix.  Most of the
weirdness that results can be controlled, with a little more effort and
possibly some work (by someone other than ignorant me) on the source code to
xpdf's pdftotext filter. Ultimately, it would be nice to see the entire
formatting algorithm incorporated into xpdf -- it is indifferent to text
direction -- but Python is working very nicely as a prototyping platform, so
that can wait until things stabilize.

Cheers,
Frank

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links