Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- Date: Sat, 20 Jan 2007 17:27:00 +0900
- From: "Ian MacLean" <imaclean@example.com>
- Subject: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- References: <45AAFDA9.90504@example.com> <19dd68ba0701160412y2eb95062r6235fed92b752784@example.com> <Pine.NEB.4.64.0701162139360.10912@example.com> <3156339d0701161820lb684aeubcd51914b19a87bf@example.com> <Pine.NEB.4.64.0701171657080.1515@example.com> <3156339d0701180035k2a4f2b70o3bbf00612501470@example.com> <Pine.NEB.4.64.0701201123230.1314@example.com> <20070119230346.6435923f.jep200404@example.com> <19dd68ba0701192031s18a5ac56o28327d22a9c38a39@example.com> <Pine.NEB.4.64.0701201512260.1314@example.com>
On 1/20/07, Curt Sampson <cjs@example.com > wrote:On Sat, 20 Jan 2007, Guillaume Proux wrote:
>> > In UTF-8, all characters contain exactly one byte without the high bit
>> set.
>
Yeah. That would probably be better expresed as, "wrong!"
sure - in fact all bytes in a utf-8 sequence except those in the ascii range have their high bit set - this is what allows ascii only tools to work with utf-8 data.
The following seems to do the trick - counting ascii bytes + the "counter" bytes at the start of each multi-byte sequence.
def utf8_char_count
split('').select { |c| c[0] < 128 || c[0] > 192 }.length
end
or the equivalent of Jims c version posted earlier :
def utf8_char_count
split('').select { |c| (c[0] & 192 ) != 128 }.length
end
Ian
- Follow-Ups:
- References:
- [tlug] What is the most appropriate scripting language
- From: Dave M G
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Guillaume Proux
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Curt Sampson
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Ian MacLean
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Curt Sampson
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Ian MacLean
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Curt Sampson
- UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- From: Jim
- Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- From: Guillaume Proux
- Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- From: Curt Sampson
Home | Main Index | Thread Index
- Prev by Date: [tlug] Keyboard mappings
- Next by Date: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- Previous by thread: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- Next by thread: Re: UTF-8: each character is one byte . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links