Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] I hate encodings!

On Tue, 29 Aug 2006 12:08:01 +0900
"Jeff Madsen" <> wrote:

> Hope that question made sense - you can probably detect my confusion 
> already!

As far as i know there is no such documentation. But i can give
you some hints on what you should do:

1) Use wherever possible utf-8.
	utf-8 is a super set of most (all?) other character sets.
	Thus you can represent any other encoding in utf-8.
	You should not consider to use anything else but utf-8
	to store data, unless you have a special reason to do so.
	(It makes conversions and internationalitation [i18n]
	and multilinguqlization [m17n] very difficult)

2) Always use utf-8 internaly in your programs, no matter
   what character set your data uses.
	Even if you have to use a non-utf-8 encoding for your
	data outside your program, it still makes sense to use
	utf-8 within your program. This will allow an easy switch
	to another encoding, or make it possible to add another encoding

3) Use iconv and similar libraries to convert between character sets.
	Using a library that is publicly available to handle character set
	conversion minimizes your work and gives you an already tested
	and known to work subsystem.

4) Be aware that upper case <-> lower case conversions depend
   on the language used.
	There are languages out there that use different characters
	for upper case version of characters than most other languages.
	One example is Turkish, an uppercase "i" is not as one would
	expect an "I" but "İ" (a lower case "I" would be "ı").
	I know at least of one program where this caused a segfault.

You should of course have a look at the documentation of the libraries
and programs involved. Also reading the locale(5) manpage will give you
some hints on how languages and everything around them is handled.

				Attila Kinali


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links