Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines

Darren Cook writes:

 > :-) The problem described by Travis is one that comes up in various
 > places, but I struggle with it most in PDF files. Whenever I tell people
 > that PDF to Text extraction is a hard problem, research-level problem,

 > Another key place it needs to be dealt with is in OCR.

But those are really different problems from Travis's, since spaces
don't exist as coded characters, but rather as offsets in image space.

BTW I was amused that Travis pointed out you didn't have to go to
ancient languages to find inconsistency in use of word-separating
spaces in a single script.  Of course it was Japanese!


Home | Main Index | Thread Index