Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unzipping archives with japanese filenames in an unknown encoding in a smarter way?



On Fri, Feb 08, 2019 at 04:37:12PM +0900, Claus Aranha wrote:
> Hello!
> 
> What is the best way to unzip a file that may have filenames in
> Japanese in an arbitrary encoding and avoid getting mojibake?
> 
> I can use -O on unzip to tell what encoding I want (UTF-8, EUC,
> Shift_JIS, Windows_31J (???)) but trying the different encodings until
> finding one that works just seems inefficient.
> 
> Is there a better way?

As far as I understand, the -O switch does only specify which encoding
the file names inside the ZIP file are read as; the output file name
encoding will always be the one your system uses. So it's a conversion
from an unknown encoding and the tool should guess the source file name
encoding automatically.

(The -O switch is an Ubuntu patch BTW. Original versions of unzip don't
have it.)

Guessing the text encoding of an unknown byte sequence is guesswork and
can only rely on heuristics, like the chardet algorithm in libuchardet
(if the source encoding deviates from a standard).

I don't know specifically if any ZIP program supports sniffing source
file name encodings; unzip likely doesn't. If you find a program that
does that'd be great. Otherwise, I could imagine a wrapper written in
e.g. Python that opens the ZIP file, goes over all file names in it and
applies an encoding detection routine to each (may yield different
results across multiple files) and then calls unzip with the correct
encoding, or extracts files one by one.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links