
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Kanji file names-- how to change encoding,Mac OS X/Darwin file names
On Jan 21, 2006, at 10:47 PM, Stephen J. Turnbull wrote:
>>>>>> "Alain" == Alain Hoang <hoanga@example.com> writes:
>
> Alain> These UTF-8 normalization forms and their interactions when
> Alain> actually trying to deal with them are currently something
> Alain> that looks like some black magic
>
> It's basically trivial. In German, you can write ss or you can write
> ß, althouh the latter, composed, form is canonical. The
> normalization forms simply dictate maximally composed and minimally
> composed forms, with rules for handling cases where there are multiple
> extrema. Conformant software is supposed to handle both forms.
Thanks for the explanation. I guess it's not really that much
black magic just ignorance on my own part. Hopefully, I'm
less ignorant than I was before on this topic. :)
>
> Alain> The subtle differences of NFD and NFC manifested itself
> Alain> when I was trying to write some text files using Vietnamese
> Alain> in OS X then moved them over to a FreeBSD machine and
> Alain> noticed the accent marks weren't attached. *sigh*
>
> By "not attached" do you mean "not displayed as composed"? The
> necessary information to fix that is in the large Unidata table, which
> tells you which characters are composed from others. If you mean
> "lost", then you have seriously non-conforming software somewhere in
> the pipeline.
Yes, I meant not composed. The characters were definitely
not lost just displaying in a not composed form. The software was
not non-conforming. It was just the user that was non-aware of the
issues.
After trying to recall what I exactly did awhile back I finally
retraced my steps.
I was trying to write something in Vietnamese and display that
in HTML. For some reason, I decided I wanted to use the HTML
escape codes for this so in SubEthaEdit[1] I typed in something like
this:
tiếng việt
Then, I used the Copy as XHTML function to get the HTML
escaped sequences which gave me something like
this:
tiếngviệt
When it displayed in Firefox, the rising accent mark (sorry
don't know the name given to it in Unicode) over the ê was
definitely not displaying over the ê. I have a small example
at http://samsara.bebear.net/tv2.html
What was confusing was that Safari showed the composed
form while Firefox did not.
Looking back at on all this, I can attribute this all to major
user error on my part for not understanding one whit on
normalization forms for UTF-8 back then. But I figure I should
explain as much as I can remember in case anyone else ever
runs into a similar issue (unlikely) and can avoid it spending a
couple of hours confused like I did.
Alain
[1] An OS X Text Editor. I've found it handy for doing quick
and dirty editing in UTF-8.
Home |
Main Index |
Thread Index