Python throws UnicodeEncodeError although I am doing str.decode(). Why? - python

Consider this function:
def escape(text):
print repr(text)
escaped_chars = []
for c in text:
try:
c = c.decode('ascii')
except UnicodeDecodeError:
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
It should escape all non ascii characters by the corresponding htmlentitydefs. Unfortunately python throws
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
when the variable text contains the string whose repr() is u'Tam\xe1s Horv\xe1th'.
But, I don't use str.encode(). I only use str.decode(). Do I miss something?

It's a misleading error-report which comes from the way python handles the de/encoding process. You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module. And somewhere there lies the origin for this misleading Exception messages.
You may check for yourself: either
u'\x80'.encode('ascii')
or
u'\x80'.decode('ascii')
will throw a UnicodeEncodeError, where a
u'\x80'.encode('utf8')
will not, but
u'\x80'.decode('utf8')
again will!
I guess you are confused by the meaning of encoding and decoding.
To put it simple:
decode encode
ByteString (ascii) --------> UNICODE ---------> ByteString (utf8)
codec codec
But why is there a codec-argument for the decode method? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec as an argument. If not provided it assumes you mean the sys.getdefaultencoding() to be implicitly used.
so when you use c.decode('ascii') you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii.
See also:
https://stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

You're passing a string that's already unicode. So, before Python can call decode on it, it has to actually encode it - and it does so by default using the ASCII encoding.
Edit to add It depends on what you want to do. If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call: text.encode('ascii', 'xmlcharrefreplace').

Python has two types of strings: character-strings (the unicode type) and byte-strings (the str type). The code you have pasted operates on byte-strings. You need a similar function to handle character-strings.
Maybe this:
def uescape(text):
print repr(text)
escaped_chars = []
for c in text:
if (ord(c) < 32) or (ord(c) > 126):
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
I do wonder whether either function is truly necessary for you. If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a content.encode('UTF-8') as the final step before delivering it to the client. Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding.

This answer always works for me when I have this problem:
def byteify(input):
'''
Removes unicode encodings from the given input string.
'''
if isinstance(input, dict):
return {byteify(key):byteify(value) for key,value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
from How to get string objects instead of Unicode ones from JSON in Python?

I found solution in this-site
reload(sys)
sys.setdefaultencoding("latin-1")
a = u'\xe1'
print str(a) # no exception

decode a str make no sense.
I think you can check ord(c)>127

Related

Converting widechars to system ANSI encoding in Python

I am currently trying to make my screen reader work better with Becky! Internet Mail. The problem which I am facing is related to the list view in there. This control is not Unicode aware but the items are custom drawn on screen so when someone looks at it content of all fields regardless of encoding looks okay. When accessed via MSAA or UIA however basic ANSI chars and mails encoded with the code page set for non Unicode programs have they text correct whereas mails encoded in Unicode do not.
Samples of the text :
Zażółć gęślą jaźń
is represented by:
Zażółć gęślą jaźń
In this case it is damaged CP1250 as per answer below.
However:
⚠️
is represented by:
⚠️
⏰
is represented by:
⏰
and
高生旺
is represented by:
é«ç”źć—ş
I've just assumed that these strings are damaged beyond repair, however when unicode beta support in windows 10 is enabled they are exposed correctly.
Is it possible to simulate this behavior in Python?
The solution needs to work in both Python 2 and 3.
At the moment I am simply replacing known combinations of these characters with their proper representations, but it is not very good solution, because lists containing replacements and characters to replace needs to be updated with each new discovered character.
your utf-8 is decoded as cp1250.
What I did in python3 is this:
orig = "Zażółć gęślą jaźń"
wrong = "Zażółć gęślą jaźń"
for enc in range(437, 1300):
try:
res = orig.encode().decode(f"cp{enc}")
if res == wrong:
print('FOUND', res, enc)
except:
pass
...and the result was the 1250 codepage.
So your solution shall be:
import sys
def restore(garbaged):
# python 3
if sys.version_info.major > 2:
return garbaged.encode('cp1250').decode()
# python 2
else:
# is it a string
try:
return garbaged.decode('utf-8').encode('cp1250')
# or is it unicode
except UnicodeEncodeError:
return garbaged.encode('cp1250')
EDIT:
The reason why "高生旺" can not be recovered from é«ç”źć—ş:
"高生旺".encode('utf-8') is b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'.
The problem is the \x98 part. In cp1250 there is no character set for that value. If you try this:
"高生旺".encode('utf-8').decode('cp1250')
You will get this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2: character maps to <undefined>
The way to get "é«ç”źć—ş" is:
"高生旺".encode('utf-8').decode('cp1250', 'ignore')
But the ignore part is critical, it causes data loss:
'é«ç”źć—ş'.encode('cp1250') is b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'.
If you compare these two:
b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'
you will see that the \x98 character is missing so when you try to restore the original content, you will get a UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte.
If you try this:
'é«ç”źć—ş'.encode('cp1250').decode('utf-8', 'backslashreplace')
The result will be '\\xe9\\xab生旺'. \xe9\xab\x98 could be decoded to 高, from \xe9\xab it is not possible.

Encoding unicode with python

I’m trying to understand how encoding unicode works in python2.7, so far it’s easy to find the solution but I haven’t found any clear explanation as what is going on here. Here is an example.
The introduction
We have a unicode variable we received, called filter_type
filter_type = u'some_välüe'.
We put this into a dict and pass this into the python library urllib.urlencode.
Like so:
urllib.urlencode({"param:" ..., "filter_type": filter_type}
This issue.
Inside urllib.urlencode it loops around the data given to it and wraps the keys and values into the str() builtin function to get a string representation of each key and value before encoding it into a url.
We get an error similar to the following:
{UnicodeEncodeError}'ascii' codec can't encode character u'\xf1' in position 42: ordinal not in range(128).
You get this same error by doing str(u'some_välüe').
So after some research and digging into this it looks like when you wrap unicode values in the str() it tries to encode the value into the default encoding that is set. (my assumption)
>>> import sys
>>> sys.getdefaultencoding()
ascii
The solution.
So we can fix this by encoding these unicode strings with utf-8.
filter_type = u'some_välüe'.encode('utf-8').
The question.
But here is the question. Before i mentioned that urllib.urlencode wraps keys and values into the str() function.
These values are already encoded now, so..
What does str() does in this case now?
Does the representation of a unicode object change when it’s encoded to utf-8?
If it does why did str() try to encode the unicode object to ascii (default) in the first place.

What's the proper way to convert to unicode?

Say you have a string
s = "C:\Users\Eric\Desktop\beeline.txt"
that you want to move to Unicode if it's not.
return s if PY3 or type(s) is unicode else unicode(s, "unicode_escape")
If there's a chance that the string will have a \U (ie, a user directory) and you'll probably get Unicode decode errors.
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 3-4: truncated \UXXXXXXXX escape
Is there anything wrong with just forcing it like so:
return s if PY3 or type(s) is unicode else unicode(s.encode('string-escape'), "unicode_escape")
Or is explicitly checking for the existence of \U ok as it's the only corner case?
I want the code to be work for both python 2 & 3.
It works fine with English, but when faced with an actual unicode example, the forcable translation might not use the same encoding as it would by default, leaving you with unpleasent errors.
I wrapped your given code in a function called assert_unicode (replaced the is with isinstance) and ran a test on text in hebrew (which is simply saying 'hello'), check it out:
In [1]: def assert_unicode(s):
return s if isinstance(s, unicode) else unicode(s, 'unicode_escape')
In [2]: assert_unicode(u'שלום')
Out[2]: u'\u05e9\u05dc\u05d5\u05dd'
In [3]: assert_unicode('שלום')
Out[3]: u'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
You see? both return a unicode object, but there's still a lot of difference. And if you try printing or working with the second example, it would probably fail (a simple print for example failed for me, and I'm using console2 which is very unicode-friendly).
The solution to this? use utf-8. It is a standard these days, and if you make sure everything will be treated as utf-8 as well, it should work like a charm for any given language:
In [4]: def assert_unicode(s):
return s if isinstance(s, unicode) else unicode(s, 'utf-8')
In [5]: assert_unicode(u'שלום')
Out[5]: u'\u05e9\u05dc\u05d5\u05dd'
In [6]: assert_unicode('שלום')
Out[6]: u'\u05e9\u05dc\u05d5\u05dd'
The below routine is similar in spirit to the answer by #yuvi, but it goes through multiple encodings (configurable) and returns the encoding used. It also handles errors (passing by only converting things that are basestring) more gracefully.
#unicode practice, this routine forces stringish objects to unicode
#preferring utf-8 but works through other encodings on error
#return values are the encoded string and the encoding used
def to_unicode_or_bust_multile_encodings(obj, encoding=['utf-8','latin-1','Windows-1252']):
'noencoding'
successfullyEncoded = False
for elem in encoding:
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
try:
obj = unicode(obj, elem)
successfullyEncoded = True
#if we succeed then exit early
break
except:
#encoding did not work, try the next one
pass
if successfullyEncoded:
return obj, elem
else:
return obj,'no_encoding_found'
What's the proper way to convert to unicode?
Here it is:
unicode_string = bytes_object.decode(character_encoding)
Now the question becomes: I have a sequence of bytes, what character encoding should I use to convert them into a Unicode string?
The answer depends on where the bytes come from.
In your case, the bytestring is specified using a Python literal for bytestrings (Python 2) therefore the encoding is the character encoding of your Python source file. If there is no character encoding declaration at the top of the file (a comment that looks like: # -*- coding: utf-8 -*-) then the default source encoding is 'ascii' on Python 2 ('utf-8' -- Python 3). So the answer in your case is:
if isinstance(s, str) and not PY3:
return s.decode('ascii')
Or you could use Unicode literals directly (Python 2 and Python 3.3+):
unicode_string = u"C:\\Users\\Eric\\Desktop\\beeline.txt"

Python unicode character in __str__

I'm trying to print cards using their suit unicode character and their values. I tried doing to following:
def __str__(self):
return u'\u2660'.encode('utf-8')
like suggested in another thread, but I keep getting errors saying UnicodeEncodeError: ascii, ♠, 0, 1, ordinal not in range(128). What can I do to get those suit character to show up when I print a list of cards?
Where does that UnicodeEncodeError occur exactly? I can think about two possible issues here:
The UnicodeEncodeError occurs in you __unicode__ method.
Your __unicode__ method returns a byte string instead of a unicode object and that byte string contains non-ASCII characters.
Do you have a __unicode__ method in your class?
I tried this on the Python console according to the actual data from your comment:
>>> u'\u2660'.encode('utf-8')
'\xe2\x99\xa0'
>>> print '\xe2\x99\xa0'
♠
It seems to work. Could you please try to print the same on your console? Maybe your console encoding is the problem.
Depending on how you have encoded those "suit symbols" into a byte string, you'll need to make the unicode string back for it by mentioning the appropriate codec (for example, thebytestr.decode('latin-1') if latin-1 is how you encoded it!), before making the utf-8 encoding of that unicode string. Just unicode(something) uses the default encoding, which is ASCII and therefore totally ignorant of any "suit symbols"!-)
As I said back then (3 months ago), I'd go for implementing __unicode__ instead of __str__, but that's just a minor issue of simplicity. The core point is, rather: if your byte string includes anything outside of the limited ASCII encoding, you must know what encoding your byte string uses, and decode it back into Unicode by explicitly using that codec!
I ran the same code and got
>>> u'\u2660'.encode('utf-8')
'\xe2\x99\xa0'
>>> print ('\xe2\x99\xa0')
â™ 

How to check if a string in Python is in ASCII?

I want to I check whether a string is in ASCII or not.
I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).
Is there another way to check?
I think you are not asking the right question--
A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.
Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer
by trying:
try:
mystring.decode('ascii')
except UnicodeDecodeError:
print "it was not a ascii-encoded unicode string"
else:
print "It may have been an ascii-encoded unicode string"
def is_ascii(s):
return all(ord(c) < 128 for c in s)
In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.
def isascii(s):
"""Check if the characters in string s are in ASCII, U+0-U+7F."""
return len(s) == len(s.encode())
To check, pass the test string:
>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True
New in Python 3.7 (bpo32677)
No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.
print("is this ascii?".isascii())
# True
Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:
try:
mystring.encode('ascii')
except UnicodeEncodeError:
pass # string is not ascii
else:
pass # string is ascii
Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.
Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.
Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.
Instead of ord(u'é'), try this:
>>> [ord(x) for x in u'é']
That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].
Instead of chr() to reverse this, there is unichr():
>>> unichr(233)
u'\xe9'
This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.
Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.
The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.
Ran into something like this recently - for future reference
import chardet
encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
print 'string is in ascii'
which you could use with:
string_ascii = string.decode(encoding['encoding']).encode('ascii')
How about doing this?
import string
def isAscii(s):
for c in s:
if c not in string.ascii_letters:
return False
return True
I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).
My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.
If you're getting a rude and persistent
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)
particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)
Eventually I determined that what I wanted to do was this:
escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))
Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):
# -*- coding: utf-8 -*-
That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').
>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'àéç'
To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html
from curses import ascii
def isascii(s):
return all(ascii.isascii(c) for c in s)
You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.
A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.
However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.
Like #RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.
>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True
I imagine a regular expression is well-optimized for this.
import re
def is_ascii(s):
return bool(re.match(r'[\x00-\x7F]+$', s))
To include an empty string as ASCII, change the + to *.
To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors
>>> ord("¶")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
For example
def is_ascii(s):
try:
return all(ord(c) < 128 for c in s)
except TypeError:
return False
I use the following to determine if the string is ascii or unicode:
>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>
Then just use a conditional block to define the function:
def is_ascii(input):
if input.__class__.__name__ == "str":
return True
return False

Categories