Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: font/char set question



> Just wondering: Do you, or does anyone else, maintain a publicly available
> list of wierd hyphens or other Unicode characters that don't strictly
> speaking map neatly back to anything in Shi(f)t-JIS, but in practice can be
> converted to something that does? (Encapsulated in a neat little class
> representing legacy-compatible-UTF8 strings would be best...)

By far the most common one is FULLWIDTH HYPHEN-MINUS (U+FF0D), which
should be turned into katakana long vowel.

Some others that might come up are SMALL HYPHEN-MINUS (U+FE63) and SMALL
EM DASH (U+FE58) (convert both to ascii hyphen). Then PRESENTATION FORM
FOR VERTICAL EN DASH (U+FE32) and PRESENTATION FORM FOR VERTICAL EM
DASH(U+FE31) into ascii vertical bar.

Sample php code to do all those conversions:
  $s=str_replace("-","ー",$s);
  $s=str_replace("﹣","-",$s);
  $s=str_replace("﹘","-",$s);
  $s=str_replace("︲","|",$s);
  $s=str_replace("︱","|",$s);

And the round-trip conversion:
  $sjis=mb_convert_encoding($s,"SJIS","UTF-8");
  $utf8=mb_convert_encoding($sjis,"UTF-8","SJIS");
  if($s!==$utf8)complain_to_user();

Darren



-- 
Darren Cook
http://dcook.org/mlsn/ (English-Japanese-German-Chinese free dictionary)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links