Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] iconv / Python / unicode question



>>>>> "Frank" == Frank Bennett <bennett@example.com> writes:

    Frank> Is there a toggle in the python unicode object that will
    Frank> just drop non-conforming characters on the floor?  Failing
    Frank> that, is there __any__ filter that will strip out these
    Frank> blocking characters from a file, so that it can be run
    Frank> through these tools without blowing them up?

When converting to Unicode, pass in 'replace' or 'ignore' for the
errors param to the built-in function unicode():

http://www.python.org/doc/lib/built-in-funcs.html

unicode(object[, encoding[, errors]])

    Return the Unicode string version of object using one of the
    following modes:

    If encoding and/or errors are given, unicode() will decode the
    object which can either be an 8-bit string or a character buffer
    using the codec for encoding. The encoding parameter is a string
    giving the name of an encoding. Error handling is done according
    to errors; this specifies the treatment of characters which are
    invalid in the input encoding. If errors is 'strict' (the
    default), a ValueError is raised on errors, while a value of
    'ignore' causes errors to be silently ignored, and a value of
    'replace' causes the official Unicode replacement character,
    U+FFFD, to be used to replace input characters which cannot be
    decoded. See also the codecs module.

    [snip]

When converting a Unicode string to some other encoding, do the same
when calling your_unistring.encode():

http://www.python.org/doc/lib/string-methods.html

encode([encoding[,errors]])

    Return an encoded version of the string. Default encoding is the
    current default string encoding. errors may be given to set a
    different error handling scheme. The default for errors is
    'strict', meaning that encoding errors raise a ValueError. Other
    possible values are 'ignore' and 'replace'. New in version 2.0.

Ben

-- 
Brought to you by the letters Q and S and the number 11.
"Wuzzle means to mix."
Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links