Re: [tlug] Character encoding stuff

> (1) In particular, when scraping jigsaw puzzle manufacturer websites, I 
> want to know what characters I'm looking at. ...

I'll mention this as useful for character encoding work, but I don't
know if it helps for what you are doing:

This is a heavy-duty set of functions, the ICU library, developed by IBM
originally (IIRC). It is built-in to php 5.4.x, can be added as a pecl
module for earlier versions.

> But it would be nice to get more than just numbers: stuff like 
> "Cyrillic", "Punctuation" etc. 

Is this a tool to use interactively? To satisfy your curiosity?
Or you want to normalize/simplify/transliterate, to make your pattern
matching simpler?


Darren Cook, Software Researcher/Developer

