
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
- Date: Tue, 05 Aug 2003 12:14:36 +0900
- From: David Riggs <dariggs@example.com>
- Subject: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
I have managed to cobble together a way to get my Macintosh files to Linux,
but I have to do it one folder at a time. Can anyone help me make the Linux
scripts a recursive process, so I don't have to navigate to each folder in
turn and run these two scripts (see below)?
I start out by running my Nisus Mac files, which include both SJIS kanji,
and lots of diacritics (like the long o, o+ macron, of say, Kyoto) through
a macro from Nobumi Iyanaga. This macro converts the diacrititics into
HTML Entities, like Ō These entities are pure ASCII seven bit code,
so they will pass through any conversion. If you try to convert the
diacritic letters without this, you will find that they use the restricted
part of the byte, so they are garbled, and worse, since they can signal the
start of a kanji, often the diacritic letter combines with the following
letter and is converted into some kanji or other. Converting to HTML
Entities in Nisus (which knows, in its own secret way which is kanji and
which is diacritics) avoids this problem.
I then take my files to Linux and run the recode utility, and then this
little Perl script. This does an in-place conversion, so the original data
is destroyed (run on a copy!). Recode converts from SJIS to UTF8 in a
flash, assuming that all is in SJIS. It is probably best to run recode with
the -f switch, to force conversion even if there are odd bits that do not
conform (like old footnote flags or something.)
recode sjis..u8 *
myconvert *
Where myconvert is the perl script in my /bin as follows:
#!/usr/bin/env perl
#convert from HTML entity to UTF8, based on Nobumi.
# Mac end of line hard return, to Unix New Line
$^I = ""; #direct <> op to rename to + string ( backup) and
output to orig name
while (<>) { #read all files passed in call, renaming to
orig (no backup)
s/(&#)([0-9]+)(;)/pack ('U*', $2)/eg; #Entity to utf8
s/\015/\012/g; #Mac end of line to unix.
print; #output back
to original file name
}
;
So, it works just fine, I get both kanji and beautiful diacritics, all in
the same Unicode encoding. But my question is, how to I make this
recursive? I have hundreds of files in scores of directories, and it can
get to be a mess, expecially if there is a problem, and errors are all to
easy to make. Is there a way to put a recursive envelope around this? Or
perhaps is it possible that I can make a TAR file of it all and run this
code on the TAR?
Thanks,
David
Home |
Main Index |
Thread Index