Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]



> In UTF-8, all characters contain exactly one byte without the high bit set.

uh?

The wikipedia page that was linked to shows one example.

"""
For example, the character aleph (×), which is Unicode U+05D0, is
encoded into UTF-8 in this way:

   * It falls into the range of U+0080 to U+07FF. The table shows it
will be encoded using two bytes, 110yyyyy 10zzzzzz.
   * Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
   * The eleven bits are put in their order into the positions marked
by "y"-s and "z"-s: 11010111 10010000.
   * The final result is the two bytes, more conveniently expressed
as the two hexadecimal bytes 0xD7 0x90. That is the encoding of the
character aleph (×) in UTF-8.
"""

U+05D0 codepoint is turned into 11010111 10010000 . Both byte having
the high bit set.

I am misunderstanding something or can we check this again?

Guillaume
Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links