Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tlug: Japanese PDF file contents



Short version:  Is there a PDF->text stripper of some sort that will work
with Japanese PDF files encoded in Shift-JIS?

***

Long version:

Another translation query.  The Japanese government issues updates to law
through an official registry called "Kanpo".  This publication is now
online, at: 

  http://kanpou.pb-mof.go.jp/

The actual text of the updates is distributed as a series of PDF files. 

A colleague and I were commiserating with one another about the
difficulties of applying the update to the actual text of the law --- "In
sections 1, 4(3) an 23.2(6)(ii) of the Dogcatcher Investment Assistance
Act, replace the word "net" with the words "rope or collar". 

I suggested that it might be possible to generate a patch file off of the
Kanpo PDF --- the phrasing is highly structured, so this might work well
enough to save some work for the community.

I've been playing with Tcl's HTTP facilities, and I can see that it
will be a simple matter to walk through the menus and snatch the PDF
files themselves on a daily basis.  However, I can't find anything like
intelligible text in there.  Does anyone know if there is a stripper out
there that can dump just the text of a Japanese PDF to a file so that it
can be made useful to a scripting language?

Many thanks for any suggestions,
Frank Bennett

--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links