Python: Sanitize a string for unicode? [duplicate]

Python: Sanitize a string for unicode? [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?
I have a string that I'm trying to make safe for the unicode() function:
>>> s = " foo “bar bar ” weasel"
>>> s.encode('utf-8', 'ignore')
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?
Somewhat related to this question, although I was unable to solve my problem from it.
This also fails:
>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
s.decode('utf-8')
File "C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.
Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.
So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:
>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo “bar bar” weasel
>>> type(uni)
<type 'unicode'>
That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.
>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'
Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

EDIT. Looks like your string is encoded in such a way that “ (LEFT DOUBLE QUOTATION MARK) becomes \x93 and ” (RIGHT DOUBLE QUOTATION MARK) becomes \x94. There is a number of codepages with such a mapping, CP1250 is one of them, so you may use this:
s = s.decode('cp1250')
For all the codepages which map “ to \x93 see here (all of them also map ” to \x94, which can be verified here).

Related

Issue in encode/decode in python 3 with non-ascii character

I am trying to use python3 unicode_escape to escape \n in my string, but the challenge is there are non-ascii characters present in the whole string, and if I use utf8 to encode and then decode the bytes using unicode_escape then the special character gets garbled. Is there any way to have the \n escaped with a new line without garbling the special character?
s = "hello\\nworld└--"
print(s.encode('utf8').decode('unicode_escape'))
Expected Result:
hello
world└--
Actual Result:
hello
worldâ--

As user wowcha observes, the unicode-escape codec assumes a latin-1 encoding, but your string contains a character that is not encodable as latin-1.
>>> s = "hello\\nworld└--"
>>> s.encode('latin-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
Encoding the string as utf-8 gets around the encoding problem, but results in mojibake when decoding from unicode-escape
The solution is to use the backslashreplace error handler when encoding. This will convert the problem character to an escape sequence that can be encoded as latin-1 and does not get mangled when decoded from unicode-escape.
>>> s.encode('latin-1', errors='backslashreplace')
b'hello\\nworld\\u2514--'
>>> s.encode('latin-1', errors='backslashreplace').decode('unicode-escape')
'hello\nworld└--'
>>> print(s.encode('latin-1', errors='backslashreplace').decode('unicode-escape'))
hello
world└--

Try removing the second escape backslash and decode using utf8:
>>> s = "hello\nworld└--"
>>> print(s.encode('utf8').decode('utf8'))
hello
world└--

I believe the problem you are having is that unicode_escape was deprecated in Python 3.3 and it seems to be assuming your code is 'latin-1' due to that being the original codec used within the unicode_excape function...
Looking at the python documentation for codecs we see that Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default. which tells us that unicode_escape assumes that your text is ISO Latin-1.
So if we run your code with latin1 encoding we get this error:
s.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
And the unicode character error is '\u2514' which when converted is '└' the simplest way to put it is the character cannot be used within a Latin-1 string hence why you get a different character.
I also think it's right to point out that within your string you have '\\n' and not just '\n' the extra backslash means this symbol is not carriage return but instead it is ignored the backward slash indicates to ignore the '\n'. Perhaps try not using the \\n...

How to initialize a UTF-16 in code?

Using Python3 to minimize the pain when dealing with Unicode, I can print a UTF-8 character as such:
>>> print (u'\u1010')
တ
But when trying to do the same with UTF-16, let's say U+20000, u'\u20000' is the wrong way to initialize the character:
>>> print (u'\u20000')
  0
>>> print (list(u'\u20000'))
['\u2000', '0']
It reads a 2 UTF-8 characters instead.
I've also tried the big U, i.e. u'\U20000', but it throws some escape error:
>>> print (u'\U20000')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
Big U outside the string didn't work too:
>>> print (U'\u20000')
 0
>>> print (U'\U20000')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape

These are not UTF-8 and UTF-16 literals, but just unicode literals, and they mean the same:
>>> print(u'\u1010')
တ
>>> print(u'\U00001010')
တ
>>> print(u'\u1010' == u'\U00001010')
True
The second form just allows you to specify a code point above U+FFFF.
How to do this the easiest way: encode your source file as UTF-8 (or UTF-16), and then you can just write u"တ" and u"𠀀".
UTF-8 and UTF-16 are ways to encode those to bytes. To be technical, in UTF-8 that would be "\xf0\xa0\x80\x80" (which I would probably write as u"𠀀".encode("utf-8")).

As #Mark Ransom commented, Python's UTF16 \U notation requires eight characters to work.
Therefore, the Python code to use is:
u"\U00020000"
as listed on this page:
Python source code u"\U00020000"

Why is Python trying to automatically encode my Unicode string? [duplicate]

This question already has an answer here:
Unicode error Ordinal not in range
(1 answer)
Closed 6 years ago.
I'm trying to read a bunch of e-mail messages from files that are encoded in ISO-8859-1, then write (parts of) them out to a JSON file with UTF-8 encoding. I've currently got a program that reads them and produces objects with str type properties containing the various fields of the message. I want to convert these str strings (encoded bitstrings) to unicode strings (abstract Unicode objects) so that I can later re-encode them with UTF-8 when I write out the file. So I use the decode method of str, like this:
msg_dict = {u'Id' : message.message_id.decode('iso-8859-1'),
u'Subject' : message.subject.decode('iso-8859-1'),
u'SenderEmail' : message.sender_email.decode('iso-8859-1'),
u'SenderName' : message.sender_name.decode('iso-8859-1'),
u'Date': message.date.isoformat()}
According to the documentation I've read, decode should take the str object, interpret its bytes according to the given encoding, and return a unicode object representing those characters. But when I run my code, I get this error:
File "/home/edward/long/path/omitted/dumpMails.py", line 38, in <module>
u'Subject' : message.subject.decode('iso-8859-1'),
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
How could I be getting an encode error when I call decode? My best guess is that Python has decided to automatically convert the returned unicode back to a str, using the default encoding. But why is it trying to do this? Is it something to do with putting unicodes in a dictionary?

Python will automatically try and encode a value if it is not yet a byte string. You cannot decode a Unicode string, after all, so Python tries to be helpful and tries to make it a bytestring first.
In other words, the string is already decoded to unicode:
>>> decoded = u'åüøî'
>>> decoded.decode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
You'll either have to test if it is already a Unicode string, or if it is always a Unicode string, just don't try to decode it.
Incidentally, you'll see the inverse problem if you have a byte string that you are trying to encode; Python will implicitly decode such a value first, so that it has a unicode object to encode for you:
>>> encoded = u'åüøî'.encode('utf8')
>>> encoded.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Note the decode keyword in that error message.

Strange behavior of string format in python 2.7

Working with svn logs in xml format i've accidentally got an error in my script.
Error message is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
By debugging input data i have found what was wrong. Here is an example:
a=u'\u0440\u0435\u044c\u0434\u0437\u0444\u043a\u044b\u0443\u043a \u043c\u0443\u043a\u044b\u0448\u0449\u0442 \u0430\u0448\u0447'
>>> print a
реьдзфкыук мукышщт ашч
>>> print '{}'.format(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
Can you please explain what is wrong with format?
Seems like it sees u before string bytes and try to decode it from UTF8.
However in Python 3 above example works without error.

You are mixing Unicode and byte string values. Use a unicode format:
print u'{}'.format(a)
Demo:
>>> a=u'\u0440\u0435\u044c\u0434\u0437\u0444\u043a\u044b\u0443\u043a \u043c\u0443\u043a\u044b\u0448\u0449\u0442 \u0430\u0448\u0447'
>>> print u'{}'.format(a)
реьдзфкыук мукышщт ашч
In Python 3, strings are unicode values by default; in Python 2, u"..." indicates a unicode value and regular strings ("...") are byte strings.
Mixing a byte strings and unicode value results in automatic encoding or decoding with the default codec (ASCII), and that's what happens here. The str.format() method has to encode the Unicode value to a byte string to interpolate.

Handle wrongly encoded character in Python unicode string

I am dealing with unicode strings returned by the python-lastfm library.
I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.
For example, the original string i am expecting in the variable a is "Glück"
>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.
How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \xfc.

You have to convert your unicode string into a standard string using some encoding e.g. utf-8:
some_unicode_string.encode('utf-8')
Apart from that: this is a dupe of
BeautifulSoup findall with class attribute- unicode encode error
and at least ten other related questions on SO. Research first.

Your unicode string is fine:
>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'
The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)
>>> print u'Gl\xfcck'
Glück

At the beginning of your code, just after imports, add these 3 lines.
import sys # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')
It will override system default encoding (ascii) for the course of your program.
Edit: You shouldn't do this unless you are sure of the consequences, see comment below. This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')

Do not str() cast to string what you've got from model fields, as long as it is an unicode string already.
(oops I have totally missed that it is not django-related)

I stumble upon this bug myself while processing a file containing german words that I was unaware it has been encoded in UTF-8. The problem manifest itself when I start processing words and some of them would't show the decoding error.
# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40)
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
I solve the error calling the encode method on the string:
>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Sanitize a string for unicode? [duplicate] - python

Related

Issue in encode/decode in python 3 with non-ascii character

How to initialize a UTF-16 in code?

Why is Python trying to automatically encode my Unicode string? [duplicate]

Strange behavior of string format in python 2.7

Handle wrongly encoded character in Python unicode string

Categories

Resources