TLUG Mailing List

Mailing List Archive

   tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]

Date: Sat, 20 Jan 2007 17:27:00 +0900

From: "Ian MacLean" <imaclean@example.com>

Subject: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]

References: <45AAFDA9.90504@example.com> <19dd68ba0701160412y2eb95062r6235fed92b752784@example.com> <Pine.NEB.4.64.0701162139360.10912@example.com> <3156339d0701161820lb684aeubcd51914b19a87bf@example.com> <Pine.NEB.4.64.0701171657080.1515@example.com> <3156339d0701180035k2a4f2b70o3bbf00612501470@example.com> <Pine.NEB.4.64.0701201123230.1314@example.com> <20070119230346.6435923f.jep200404@example.com> <19dd68ba0701192031s18a5ac56o28327d22a9c38a39@example.com> <Pine.NEB.4.64.0701201512260.1314@example.com>

On 1/20/07, Curt Sampson <cjs@example.com > wrote:
On Sat, 20 Jan 2007, Guillaume Proux wrote:

>> > In UTF-8, all characters contain exactly one byte without the high bit
>> set.
>

Yeah. That would probably be better expresed as, "wrong!"

sure - in fact all bytes in a utf-8 sequence except those in the ascii range have their high bit set - this is what allows ascii only tools to work with utf-8 data.

The following seems to do the trick - counting ascii bytes + the "counter" bytes at the start of each multi-byte sequence.

def utf8_char_count
        split('').select { |c| c[0] < 128 || c[0] > 192 }.length
end

or the equivalent of Jims c version posted earlier :

def utf8_char_count
    split('').select { |c| (c[0] & 192 ) != 128 }.length
end

Ian

Follow-Ups:

Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
From: Ian MacLean

References:

[tlug] What is the most appropriate scripting language
From: Dave M G

Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
From: Guillaume Proux

Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
From: Curt Sampson

Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
From: Ian MacLean

Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
From: Curt Sampson

Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
From: Ian MacLean

Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
From: Curt Sampson

UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
From: Jim

Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
From: Guillaume Proux

Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
From: Curt Sampson

Prev by Date: [tlug] Keyboard mappings

Next by Date: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]

Previous by thread: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]

Next by thread: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links