Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



> It is, it just doesn't get much traffic because all linguistic
> problems are straightforward. ;-)

:-) The problem described by Travis is one that comes up in various
places, but I struggle with it most in PDF files. Whenever I tell people
that PDF to Text extraction is a hard problem, research-level problem,
and there are even academic conferences for it, they look at me like
obviously I'm an idiot and haven't tried applying Algorithms to it.
(https://xkcd.com/1831/)

Another key place it needs to be dealt with is in OCR. And for pulling
out paragraphs from emails. And now I'll add restoring git commits to
the list!

Darren


Home | Main Index | Thread Index