Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] multibyte tr (or i18n coreutils)
- Date: Sun, 18 Apr 2010 16:52:57 -0400
- From: Jun-Dai Bates-Kobashigawa <jd.lists@example.com>
- Subject: Re: [tlug] multibyte tr (or i18n coreutils)
- References: <20100418175123.GA8181@example.com>
It's an interesting problem, Michal. So the Linux boxes I have at hand (CentOS 5.3 and Debian 4.0) only show version 5.9.7:$ tr --versiontr (GNU coreutils) 5.97Copyright (C) 2006 Free Software Foundation, Inc.This is free software. You may redistribute copies of it under the terms ofthe GNU General Public License <http://www.gnu.org/licenses/gpl.html>.There is NO WARRANTY, to the extent permitted by law.
Written by Jim Meyering.Clearly it's not handling the multibyte chars:$ LC_ALL=cs_CZ.UTF-8 echo 'ššš' | tr 'š' '123'121212$ echo '僕はサチだ' | tr 'サチ' '純苔'動蔯贔苔蔠$ echo '僕はサチだ' | sed s/サチ/純苔/僕は純苔だI'm a bit of a Linux novice, but sed should work as well, yes? Much more laborious than using tr, but depending on what you're doing, the POSIX equivalence classes might help:$ echo "sššsqwerty" | sed s/[[=s=]]/1/g1111qwertyOr with a bit more typing than your tr command:$ export REPLACEMENTS="s/[abcč]/2/g s/[dďeěf]/3/g s/[ghií]/4/g s/[jkl]/5/g s/[mnňoó]/6/g s/[pqrřsš]/7/g s/[tťuúůvwxyýzž]/8/g"$ export STRING='ššššabe'$ echo $STRINGššššabe$ for arg in $REPLACEMENTS; do export STRING=`echo $STRING | sed $arg`; done$ echo $STRING7777223* * * * *Interestingly, on OSX it works fine:$ echo "ššš" | tr 'š' '12'111$ LC_ALL=cs_CZ.UTF-8 echo "ššš" | tr '[a,b,c,č,d,ď,e,ě,f,g,h,i,í,j,k,l,m,n,ň,o,ó,p,q,r,ř,s,š,t,ť,u,ú,ů,v,w,x,y,ý,z,ž]' '[2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9]'777$ echo '僕はサチだ' | tr 'サチ' '純苔'僕は純苔だ(There's no --version option in the OSX one)Incidentally, the man page on my Linux box makes no mention of supporting LC_ALL, but the OSX one does. Also, I'm not sure which version this refers to, but I did find this:"Currently tr fully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the -C option will cause it to complement the set of characters, whereas -c will cause it to complement the set of values. This distinction will matter only when some values are not characters, and this is possible only in locales using multibyte encodings when the input contains encoding errors."So I guess maybe the answer is that the Gnu version doesn't, but the BSD (and I'm guessing Solaris) version does?Jun-Dai2010/4/18 Michal Hajek <hajek1@example.com>Hello,
I need something like:
tr \
'[a,b,c,č,d,ď,e,ě,f,g,h,i,í,j,k,l,m,n,ň,o,ó,p,q,r,ř,s,š,t,ť,u,ú,ů,v,w,x,y,ý,z,ž]'\
'[2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9]'
but tr[1] does not seem to understand multibyte characters.
For example:
LC_ALL=cs_CZ.UTF-8 echo "ššš" |tr \
'[a,b,c,č,d,ď,e,ě,f,g,h,i,í,j,k,l,m,n,ň,o,ó,p,q,r,ř,s,š,t,ť,u,ú,ů,v,w,x,y,ý,z,ž]'\
'[2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9]'
gives:
]8]8]8
Is there another simple way of doing the above substitution?
Or is there a way to persuade "tr" to work with utf8 ?
Thanks in advance
Michal
[1]
$ tr --version
tr (GNU coreutils) 8.4
Packaged by Gentoo (8.4 (p1))
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jim Meyering.
--
To unsubscribe from this mailing list,
please see the instructions at http://lists.tlug.jp/list.html
The TLUG mailing list is hosted by the award-winning Internet provider
ASAHI Net.
Visit ASAHI Net's English-language Web page: http://asahi-net.jp/en/
- Follow-Ups:
- Re: [tlug] multibyte tr (or i18n coreutils)
- From: Kalin KOZHUHAROV
- References:
- [tlug] multibyte tr (or i18n coreutils)
- From: Michal Hajek
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] multibyte tr (or i18n coreutils)
- Next by Date: Re: [tlug] multibyte tr (or i18n coreutils)
- Previous by thread: Re: [tlug] multibyte tr (or i18n coreutils)
- Next by thread: Re: [tlug] multibyte tr (or i18n coreutils)
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links