Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"My Kanpo" open law project



Dear TLUG people,

Late last year and early this, I wrote to the list a few times about
converting some access-restricted Japanese PDF files into plain text. As I may
have mentioned, this was part of a larger scheme aimed at freeing up access to
the laws of Japan -- for Japanese people as well as for gringos like myself.
With your help, I have been able to get my tools into a form (just barely) fit
for public inspection.  This is a progress report cum publicity announcement
cum invitation to do extra work^H
                  get involved.  Apart from making Japan a better place,
I have come up with an algorithm for converting vertically-formatted
Japanese PDF text to reading-order text that can be viewed on a
TTY console.  It still has some rough spots, but with a bit more
work, it promises to do a good job of conversion.  If PDF irritates
you as much as it does me, this may interest you.

The target text appears once each week at the following site:

  http://kanpou.pb-mof.go.jp/

This is "Kanpo", the Japanese official gazette, in which all laws, Ministry
regulations and orders, Ministry announcements, and certain judicial
determinations and orders are published.  Contrary to the warning language
published on the site, all of this material is excluded from copyright
protection under section 13 of the Japanese Copyright Act.  This has been
confirmed by the Ministry of Finance Printing Bureau, both to me and to an
editor at the Asahi Shinbun.

The files for each issue of Kanpo last on the site for one week only, and are
afflicted with Ad*b* Acr*b*t access control restrictions that prevent Acr*b*t
from doing much more that displaying the text on your monitor.  No save, no
print, no cut and paste.  (If they could have prevented the information from
entering your brain, I suppose they would have imposed that restriction as
well ...)

The objective is to create a persistent copy of the text content of these
files, under a full text search engine, so that people with an interest in
this material (journalists, lawyers, businessmen, members of the general
public, and bureaucrats themselves) can experience the convenience of
unrestricted access to the law as it rolls off the press.

I have written a (pretty ugly) Python script (my first) that automates the
downloading, conversion and indexing of the archive.  On March 10, the
script and related patches will be "cited" in a short article that I have
placed in the Tokyo Internet Law Review, edited by law students in the
University of Tokyo.  The "cite" will lead to the CVS archive for the suite on
SourceForge, who have allowed the project in after full disclosure of what
it's all about.

The docs and whatnot are still pretty rough, but anyone who is interested can
grab a copy of the sources thus:

  cvs -d:pserver:anonymous@example.com:/cvsroot/mykanpo login
  cvs -d:pserver:anonymous@example.com:/cvsroot/mykanpo co dist

Beware, though, that I need to do some reorganization; the "dist" module
includes the full sources of Python, freeWAIS-sf, kakasi, and some other
weighty items.  If you're on a narrow bandwidth line, you won't want to check
the dist module out en bloc.

The CGI script in the suite will require lots of work to make it capable of
handling the sort of load that an archive of this kind will eventually
attract.  Text is currently stored in disk files, and the WAIS search engine I
used for development launches a fresh instance for every search.  For
production, everything should move into MySQL, accessed through a fastCGI, and
that will require lots of rewriting.

That's about it.  Thanks to those who offered advice on this.  Comments are,
as ever, welcome.  And if you know of anyone who might be interested in
helping to push this forward, do please put them in touch.

Cheers,
Frank

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links