Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] HTML entity to Unicode conversion



On Fri, 11 Jul 2003, David Riggs wrote:

> How can I convert my diacritics in HTML entities (for example ū for u 
> with macron) into utf8 form? I finally got my Mac diacritics into a 
> standard form, and now I would like to change them, and the SJIS kanji into 
> Unicode.
> 
> I hope there is a Perl script or maybe something even simple out there.

Your example, 363, is the Unicode (in decimal) for the u-macron.  So the 
conversion would just be from the HTML entity, &#nnn;, into the UTF-8 
bytes for that number.  Using regular expressions, that would be a simple 
program in Java (or Perl?).  (Sorry, I don't know of one offhand.)

It would not be so simple if the HTML entities were representing some 
other encoding (e.g., ISO-8859-2), or if they were named (e.g., ©), 
or if they were in various bases (e.g., ū).

11011011


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links