Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] PDF to text converter (was: Anyone alive out here ?)



On Thu, 5 Sep 2024 12:58:01 +0100, Darren Cook <darren@example.com> wrote:

> Quite closely related, I've been wondering what the state of the 
> art for open-source OCR is, particularly of Japanese text.
> ...
> This could then lead on to the greatest unsolved computing 
> challenge of the 21st century, which is a PDF to text converter.

Here is a link that talks about "extracting tabular data from PDF 
files and images of tables."  (I have not tried the software.)
Apparently it uses Tesseract for OCR.

https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html


Here is another link to a PDF-Extraction-Kit, called MinerU. 
(Sorry, I haven't tried this one either.)

https://github.com/opendatalab/MinerU

Hope this helps,
jimb.
Jim Blackson




Home | Main Index | Thread Index