Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]




On 1/20/07, Curt Sampson <cjs@example.com > wrote:
On Sat, 20 Jan 2007, Guillaume Proux wrote:

>> > In UTF-8, all characters contain exactly one byte without the high bit
>> set.
>

Yeah. That would probably be better expresed as, "wrong!"

sure - in fact all bytes in a utf-8 sequence except those in the ascii range have their high bit set - this is what allows ascii only tools to work with utf-8 data.

The following seems to do the trick - counting ascii bytes + the "counter" bytes at the start of each multi-byte sequence.

def utf8_char_count
        split('').select { |c| c[0] < 128  ||  c[0]  > 192 }.length
end

or the equivalent of Jims c version posted earlier :

def utf8_char_count
    split('').select { |c| (c[0] & 192 ) != 128 }.length
end

Ian


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links