Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Encrypted PDF



Yes, folks, it's time for that topic again.  Periodically, I have come
back to the list with requests for information about uncanning the
text in Japanese PDF files.  Here I am again.

This is a longish message.  The short questions for those in a hurry are:

  o Does anyone here know of a tool other than xpdf/pdftotext for
    extracting text from an encrypted PDF file containing Japanese
    shift-JIS encoded text?

and

  o Is anyone here friends with Derek Noonburg?
  
Now the full story.  But first, a brief recap:

  The last time we visited our hero, he was attempting to extract Japanese
  plain text from access-restricted PDF files published to the Web by the
  Printing Bureau of the Japanese Ministry of Finance.  (His idea is that the
  law, of all things, should not be published in a form that restricts
  distribution, so he is trying to give the Japanese government a nudge in the
  right direction.) The encoding of the text in the files is Shift-JIS, in
  vertical orientation.  The original files can be found at:

    http://kanpou.pb-mof.go.jp/

The last time I tuned into the list on this, someone (Shimpei?) suggested that
I use xpdf.  I fetched that, compiled pdftotext with the decryption patches
(now merged into the main source tree with version 0.91), and *thought* that
it had pretty well solved the problem, apart from missing a few vertical-style
characters that I can hack in on my own.

Today, I discovered that pdftotext stops processing many of the target files
before the actual end of the text.  This seems to be associated with the
substitution of ASCII for special characters, such as "TM", "ae", "ff" and so
forth -- but no such characters exist in the PDF text at the point where
pdftotext thinks it finds them.  I have tried disabling these substitutions
in the source of pdftotext, but the output stops at the same point anyway.

This is now well beyond my meagre computing skills.  I either need to find a
way to fix pdftotext for use on this class of PDF file, find another
decryption/extraction tool, or give up on the project as a serious
republication effort.

Xpdf and pdftotext are written by Derek Noonburg.  I patched his source in
order to get around the access restriction on these files.  He has this to say
about that:

  I occasionally get email asking if I can explain how to crack a PDF file, or
  if I can help decrypt a PDF file. I won't help these people because I
  believe that an author's requests relating to the use of his/her work
  should be honored.

  I distribute source code (for Xpdf) under a particular license (the GPL)
  which depends entirely on users' goodwill for its effectiveness. If any of
  my users ever decided to violate the license, I would probably never even
  know about it, much less be able to do anything about it. The only thing I
  can do is trust the users.

  In light of this, it would be very hypocritical of me to, on one hand, ask
  people to honor my licensing restrictions, and, on the other hand, bypass
  (or assist others in bypassing) another author's requested restrictions.

I believe that this is a special case; the ultimate author of Japanese law is
the Japanese public.  All I am trying to do is make it available to them, via
means which are legal in Japan to the best of my knowledge.  I think it's a
persuasive case, but since the author of xpdf doesn't know me from a box of
apples, it would help if I could go to him with an introduction. ... ?

Cheers,
Frank

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links