Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
- Date: Tue, 05 Aug 2003 12:14:36 +0900
- From: David Riggs <dariggs@example.com>
- Subject: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
I have managed to cobble together a way to get my Macintosh files to Linux, but I have to do it one folder at a time. Can anyone help me make the Linux scripts a recursive process, so I don't have to navigate to each folder in turn and run these two scripts (see below)? I start out by running my Nisus Mac files, which include both SJIS kanji, and lots of diacritics (like the long o, o+ macron, of say, Kyoto) through a macro from Nobumi Iyanaga. This macro converts the diacrititics into HTML Entities, like Ō These entities are pure ASCII seven bit code, so they will pass through any conversion. If you try to convert the diacritic letters without this, you will find that they use the restricted part of the byte, so they are garbled, and worse, since they can signal the start of a kanji, often the diacritic letter combines with the following letter and is converted into some kanji or other. Converting to HTML Entities in Nisus (which knows, in its own secret way which is kanji and which is diacritics) avoids this problem. I then take my files to Linux and run the recode utility, and then this little Perl script. This does an in-place conversion, so the original data is destroyed (run on a copy!). Recode converts from SJIS to UTF8 in a flash, assuming that all is in SJIS. It is probably best to run recode with the -f switch, to force conversion even if there are odd bits that do not conform (like old footnote flags or something.) recode sjis..u8 * myconvert * Where myconvert is the perl script in my /bin as follows: #!/usr/bin/env perl #convert from HTML entity to UTF8, based on Nobumi. # Mac end of line hard return, to Unix New Line $^I = ""; #direct <> op to rename to + string ( backup) and output to orig name while (<>) { #read all files passed in call, renaming to orig (no backup) s/(&#)([0-9]+)(;)/pack ('U*', $2)/eg; #Entity to utf8 s/\015/\012/g; #Mac end of line to unix. print; #output back to original file name } ; So, it works just fine, I get both kanji and beautiful diacritics, all in the same Unicode encoding. But my question is, how to I make this recursive? I have hundreds of files in scores of directories, and it can get to be a mess, expecially if there is a problem, and errors are all to easy to make. Is there a way to put a recursive envelope around this? Or perhaps is it possible that I can make a TAR file of it all and run this code on the TAR? Thanks, David
- Follow-Ups:
- Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
- From: Brett Robson
- Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
- From: Viktor Pavlenko
Home | Main Index | Thread Index
- Prev by Date: [tlug] Re: What to do when a hard disk goes during the rainy season
- Next by Date: Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
- Previous by thread: [tlug] Re: What to do when a hard disk goes during the rainy season
- Next by thread: Re: [tlug] Converting Encoding-- Mac SJIS to UTF8 Linux
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links