
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] PDF to text converter (was: Anyone alive out here ?)
On Thu, 5 Sep 2024 12:58:01 +0100, Darren Cook <darren@example.com> wrote:
> Quite closely related, I've been wondering what the state of the
> art for open-source OCR is, particularly of Japanese text.
> ...
> This could then lead on to the greatest unsolved computing
> challenge of the 21st century, which is a PDF to text converter.
Here is a link that talks about "extracting tabular data from PDF
files and images of tables." (I have not tried the software.)
Apparently it uses Tesseract for OCR.
https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html
Here is another link to a PDF-Extraction-Kit, called MinerU.
(Sorry, I haven't tried this one either.)
https://github.com/opendatalab/MinerU
Hope this helps,
jimb.
Jim Blackson
Home |
Main Index |
Thread Index