TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] PDF to text converter (was: Anyone alive out here ?)

Date: Fri, 06 Sep 2024 16:43:22 +0900

From: Jim Blackson <blackson@example.com>

Subject: Re: [tlug] PDF to text converter (was: Anyone alive out here ?)

References: <2d9532be-b1af-42d1-a8cd-8eae13f9f9d5@codewiz.org> <f48593f1-8a3a-440d-9b29-33199b6dcef2@dcook.org>
On Thu, 5 Sep 2024 12:58:01 +0100, Darren Cook <darren@example.com> wrote:

> Quite closely related, I've been wondering what the state of the 
> art for open-source OCR is, particularly of Japanese text.
> ...
> This could then lead on to the greatest unsolved computing 
> challenge of the 21st century, which is a PDF to text converter.

Here is a link that talks about "extracting tabular data from PDF 
files and images of tables."  (I have not tried the software.)
Apparently it uses Tesseract for OCR.

https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html


Here is another link to a PDF-Extraction-Kit, called MinerU. 
(Sorry, I haven't tried this one either.)

https://github.com/opendatalab/MinerU

Hope this helps,
jimb.
Jim Blackson
References:

Re: [tlug] Anyone alive out here ?
From: Bernie Innocenti

Re: [tlug] Anyone alive out here ?
From: Darren Cook

Prev by Date: Re: [tlug] Anyone alive out here ?

Next by Date: Re: [tlug] Anyone alive out here ?

Previous by thread: Re: [tlug] Anyone alive out here ?

Next by thread: Re: [tlug] Anyone alive out here ?

Index(es):

Date

Thread

Home | Main Index | Thread Index