[tlug] Converting Encoding-- Mac SJIS to UTF8 Linux

Date: Tue, 05 Aug 2003 12:14:36 +0900
From: David Riggs <dariggs@example.com>
Subject: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux

I have managed to cobble together a way to get my Macintosh files to Linux, 
but I have to do it one folder at a time. Can anyone help me make the Linux 
scripts a recursive process, so I don't have to navigate to each folder in 
turn and run these two scripts (see below)?

I start out by running my Nisus Mac files, which include both SJIS kanji, 
and lots of diacritics (like the long o, o+ macron, of say, Kyoto)  through 
a macro from Nobumi Iyanaga. This macro  converts the diacrititics into 
HTML Entities, like &#332;   These entities are pure ASCII seven bit code, 
so they will pass through any conversion. If you try to convert the 
diacritic letters without this, you will find that they use the restricted 
part of the byte, so they are garbled, and worse, since they can signal the 
start of a kanji, often the diacritic letter combines with the following 
letter and is converted into some kanji or other. Converting to HTML 
Entities in Nisus (which knows, in its own secret way which is kanji and 
which is diacritics) avoids this problem.

I then take my files to Linux and run the recode utility, and then this 
little Perl script. This does an in-place conversion, so the original data 
is destroyed (run on a copy!). Recode converts from SJIS to UTF8 in a 
flash, assuming that all is in SJIS. It is probably best to run recode with 
the -f switch, to force conversion even if there are odd bits that do not 
conform (like old footnote flags or something.)

recode sjis..u8 *

myconvert *

Where myconvert is the perl script in my /bin as follows:

#!/usr/bin/env perl
#convert from HTML entity  to UTF8, based on Nobumi.
# Mac end of line hard return, to Unix New Line

$^I = "";               #direct <> op to rename to + string ( backup) and 
output to orig name

while (<>) {                 #read all files passed in call, renaming to 
orig (no backup)
    s/(&#)([0-9]+)(;)/pack ('U*', $2)/eg;           #Entity to utf8
    s/\015/\012/g;                                    #Mac end of line to unix.
print;                                                       #output back 
to original file name
	}
;


So, it works just fine, I get both kanji and beautiful diacritics, all in 
the same Unicode encoding. But my question is, how to I make this 
recursive? I have hundreds of files in scores of directories, and it can 
get to be a mess, expecially if there is a problem, and errors are all to 
easy to make. Is there a way to put a recursive envelope around this? Or 
perhaps is it possible that I can make a TAR file of it all and run this 
code on the TAR?

Thanks,

David

Follow-Ups:
- Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
  - From: Brett Robson
- Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
  - From: Viktor Pavlenko

Prev by Date: [tlug] Re: What to do when a hard disk goes during the rainy season
Next by Date: Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
Previous by thread: [tlug] Re: What to do when a hard disk goes during the rainy season
Next by thread: Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
Index(es):
- Date
- Thread

Home | Main Index | Thread Index