Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Kanji file names-- how to change encoding,Mac OS X/Darwin file names




On Jan 21, 2006, at 10:47 PM, Stephen J. Turnbull wrote:

>>>>>> "Alain" == Alain Hoang <hoanga@example.com> writes:
>
>     Alain> These UTF-8 normalization forms and their interactions when
>     Alain> actually trying to deal with them are currently something
>     Alain> that looks like some black magic
>
> It's basically trivial.  In German, you can write ss or you can write
> ß, althouh the latter, composed, form is canonical.  The
> normalization forms simply dictate maximally composed and minimally
> composed forms, with rules for handling cases where there are multiple
> extrema.  Conformant software is supposed to handle both forms.

	Thanks for the explanation.  I guess it's not really that much
black magic just ignorance on my own part.  Hopefully, I'm
less ignorant than I was before on this topic.  :)

>
>     Alain> The subtle differences of NFD and NFC manifested itself
>     Alain> when I was trying to write some text files using Vietnamese
>     Alain> in OS X then moved them over to a FreeBSD machine and
>     Alain> noticed the accent marks weren't attached.  *sigh*
>
> By "not attached" do you mean "not displayed as composed"?  The
> necessary information to fix that is in the large Unidata table, which
> tells you which characters are composed from others.  If you mean
> "lost", then you have seriously non-conforming software somewhere in
> the pipeline.

		Yes, I meant not composed.   The characters were definitely
not lost just displaying in a not composed form.  The software was
not non-conforming.  It was just the user that was non-aware of the  
issues.

After trying to recall what I exactly did awhile back I finally  
retraced my steps.
I was trying to write something in Vietnamese and display that
in HTML.  For some reason, I decided I wanted to use the HTML
escape codes for this so in SubEthaEdit[1] I typed in something like
this:

tiếng việt

Then, I used the Copy as XHTML function to get the HTML
escaped sequences which gave me something like
this:

tie&#770;&#769;ngvie&#803;&#770;t

	When it displayed in Firefox, the rising accent mark (sorry
don't know the name given to it in Unicode) over the ê was
definitely not displaying over the ê.  I have a small example
at http://samsara.bebear.net/tv2.html

	What was confusing was that Safari showed the composed
form while Firefox did not.

	Looking back at on all this, I can attribute this all to major
user error on my part for not understanding one whit on
normalization forms for UTF-8 back then.   But I figure I should
explain as much as I can remember in case anyone else ever
runs into a similar issue (unlikely) and can avoid it spending a
couple of hours confused like I did.


Alain

[1] An OS X Text Editor.  I've found it handy for doing quick
and dirty editing in UTF-8.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links