Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] multibyte tr (or i18n coreutils)



It's an interesting problem, Michal.  So the Linux boxes I have at hand (CentOS 5.3 and Debian 4.0) only show version 5.9.7:

$ tr --version
tr (GNU coreutils) 5.97
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Written by Jim Meyering.

Clearly it's not handling the multibyte chars:
$ LC_ALL=cs_CZ.UTF-8 echo 'ššš' | tr 'š' '123'
121212
$ echo '僕はサチだ' | tr 'サチ' '純苔'
動蔯贔苔蔠
$ echo '僕はサチだ' | sed s/サチ/純苔/
僕は純苔だ

I'm a bit of a Linux novice, but sed should work as well, yes?  Much more laborious than using tr, but depending on what you're doing, the POSIX equivalence classes might help:
$ echo "sššsqwerty" | sed s/[[=s=]]/1/g
1111qwerty

Or with a bit more typing than your tr command:
$ export REPLACEMENTS="s/[abcč]/2/g s/[dďeěf]/3/g s/[ghií]/4/g s/[jkl]/5/g s/[mnňoó]/6/g s/[pqrřsš]/7/g s/[tťuúůvwxyýzž]/8/g"
$ export STRING='ššššabe'
$ echo $STRING
ššššabe
$ for arg in $REPLACEMENTS; do export STRING=`echo $STRING | sed $arg`; done
$ echo $STRING
7777223

* * * * *

Interestingly, on OSX it works fine:
$ echo "ššš" | tr 'š' '12'
111
$ LC_ALL=cs_CZ.UTF-8 echo "ššš" | tr '[a,b,c,č,d,ď,e,ě,f,g,h,i,í,j,k,l,m,n,ň,o,ó,p,q,r,ř,s,š,t,ť,u,ú,ů,v,w,x,y,ý,z,ž]' '[2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9]'
777
$ echo '僕はサチだ' | tr 'サチ' '純苔'
僕は純苔だ


(There's no --version option in the OSX one)

Incidentally, the man page on my Linux box makes no mention of supporting LC_ALL, but the OSX one does.  Also, I'm not sure which version this refers to, but I did find this:
"Currently tr fully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the -C option will cause it to complement the set of characters, whereas -c will cause it to complement the set of values. This distinction will matter only when some values are not characters, and this is possible only in locales using multibyte encodings when the input contains encoding errors."

http://www.gnu.org/software/coreutils/manual/html_node/tr-invocation.html

So I guess maybe the answer is that the Gnu version doesn't, but the BSD (and I'm guessing Solaris) version does?

  Jun-Dai



2010/4/18 Michal Hajek <hajek1@example.com>
Hello,

I need something like:
tr \
'[a,b,c,č,d,ď,e,ě,f,g,h,i,í,j,k,l,m,n,ň,o,ó,p,q,r,ř,s,š,t,ť,u,ú,ů,v,w,x,y,ý,z,ž]'\
'[2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9]'

but tr[1] does not seem to understand multibyte characters.
For example:
LC_ALL=cs_CZ.UTF-8 echo "ššš" |tr \
'[a,b,c,č,d,ď,e,ě,f,g,h,i,í,j,k,l,m,n,ň,o,ó,p,q,r,ř,s,š,t,ť,u,ú,ů,v,w,x,y,ý,z,ž]'\
'[2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9]'

gives:
]8]8]8

Is there another simple way of doing the above substitution?

Or is there a way to persuade "tr" to work with utf8 ?

Thanks in advance

Michal

[1]
$ tr --version
tr (GNU coreutils) 8.4
Packaged by Gentoo (8.4 (p1))
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jim Meyering.


--
To unsubscribe from this mailing list,
please see the instructions at http://lists.tlug.jp/list.html

The TLUG mailing list is hosted by the award-winning Internet provider
ASAHI Net.
Visit ASAHI Net's English-language Web page: http://asahi-net.jp/en/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links