UnicodeDecodeError: 'ascii' codec can't decode '\xc3\xa8' together with '\xe8' - python

I am having this strange problem below:
>>> a=u'Pal-Andr\xe8'
>>> b='Pal-Andr\xc3\xa8'
>>> print "%s %s" % (a,b) # boom
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
>>> print "%s" % a
Pal-Andrè
>>> print "%s" % b
Pal-Andrè
Where I can print a, b separately but not both.
What's the problem? How can I print them both?

The actual problem is
b = 'Pal-Andr\xc3\xa8'
Now, b has a string literal not a unicode literal. So, when you are printing them as strings separately, a is treated as a Unicode String and b is treated as a normal string.
>>> "%s" % a
u'Pal-Andr\xe8'
>>> "%s" % b
'Pal-Andr\xc3\xa8'
Note the u at the beginning is missing. You can confirm further
>>> type("%s" % b)
<type 'str'>
>>> type("%s" % a)
<type 'unicode'>
But when you are printing them together, string becomes a unicode string and \xc3 is not a valid ASCII code and that is why the code is failing.
To fix it, you simply have to declare b also as a unicode literal, like this
>>> a=u'Pal-Andr\xe8'
>>> b=u'Pal-Andr\xc3\xa8'
>>> "%s" % a
u'Pal-Andr\xe8'
>>> "%s" % b
u'Pal-Andr\xc3\xa8'
>>> "%s %s" % (a, b)
u'Pal-Andr\xe8 Pal-Andr\xc3\xa8'

I am not sure what the real issue here, but one thing for sure a is a unicode string and b is a string.
You will have to encode or decode one of them before print them both.
Here is an example.
>>> b = b.decode('utf-8')
>>> print u"%s %s" % (a,b)
Pal-Andrè Pal-Andrè

Having a mix of Unicode and byte strings makes the combined print try to promote everything to Unicode strings. You've got to decode the byte string with the correct codec, else Python 2 will default to ascii. b is a byte string encoded in UTF-8. The format string is promoted as well, but it happens to work decoded from ASCII. Best to use Unicode everywhere:
>>> print u'%s %s' % (a,b.decode('utf8'))
Pal-Andrè Pal-Andrè

Related

Python - Can't concatenate more than 1 non-ascii string

I'm trying to create a new string containing more than 1 string with special characters in it. This doesn't work:
# -*- coding: utf-8 -*-
str1 = "I am"
str2 = "español"
str3 = "%s %s %s" % (str1, u'–', str2)
print str3
>> Traceback (most recent call last):
File "myscript.py", line 5, in <module>
str3 = "%s %s %s" % (str1, u'–', str2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
The strange thing is that if I delete the ñ or the – character, it creates the string correctly:
# -*- coding: utf-8 -*-
str1 = "I am"
str2 = "espaol"
str3 = "%s %s %s" % (str1, u'–', str2)
print str3
>> I am – espaol
or:
# -*- coding: utf-8 -*-
str1 = "I am"
str2 = "español"
str3 = "%s %s" % (str1, str2)
print str3
>> I am español
What is wrong about it?
You are mixing Unicode strings and byte strings. Don't do that. Make sure all your strings are of the same type. Preferably, that's unicode.
When mixing str and unicode, Python implicitly will decode or encode one or the other type using the ASCII codec. Avoid implicit operations by explicitly encoding or decoding to make everything one type.
This is what is causing your UnicodeDecodeError exception; you are mixing two str objects (byte strings, str1 and str3), but only str1 can be decoded as ASCII. str3 contains UTF-8 data and thus decoding fails. Explicitly creating unicode strings or decoding your data makes things work:
str1 = u"I am" # Unicode strings
str2 = u"español" # Unicode strings
str3 = u"%s %s %s" % (str1, u'–', str2)
print str3
or
str1 = "I am"
str2 = "español"
str3 = u"%s %s %s" % (str1.decode('utf-8'), u'–', str2.decode('utf-8'))
print str3
Note that I used a Unicode string literal for the formatting string too!
You really should read up on Unicode and codecs and Python. I strongly recommend the following articles:
Ned Batchelder's Pragmatic Unicode
Joel Spolsky's The Absolute Minimum Every Programmer Must Know About Unicode
The Python Unicode HOWTO

Check that a string contains only ASCII characters?

How do I check that a string only contains ASCII characters in Python? Something like Ruby's ascii_only?
I want to be able to tell whether string specific data read from file is in ascii
In Python 3.7 were added methods which do what you want:
str, bytes, and bytearray gained support for the new isascii() method, which can be used to test if a string or bytes contain only the ASCII characters.
Otherwise:
>>> all(ord(char) < 128 for char in 'string')
True
>>> all(ord(char) < 128 for char in 'строка')
False
Another version:
>>> def is_ascii(text):
if isinstance(text, unicode):
try:
text.encode('ascii')
except UnicodeEncodeError:
return False
else:
try:
text.decode('ascii')
except UnicodeDecodeError:
return False
return True
...
>>> is_ascii('text')
True
>>> is_ascii(u'text')
True
>>> is_ascii(u'text-строка')
False
>>> is_ascii('text-строка')
False
>>> is_ascii(u'text-строка'.encode('utf-8'))
False
You can also opt for regex to check for only ascii characters. [\x00-\x7F] can match a single ascii character:
>>> OnlyAscii = lambda s: re.match('^[\x00-\x7F]+$', s) != None
>>> OnlyAscii('string')
True
>>> OnlyAscii('Tannh‰user')
False
If you have unicode strings you can use the "encode" function and then catch the exception:
try:
mynewstring = mystring.encode('ascii')
except UnicodeEncodeError:
print("there are non-ascii characters in there")
If you have bytes, you can import the chardet module and check the encoding:
import chardet
# Get the encoding
enc = chardet.detect(mystring)['encoding']
A workaround to your problem would be to try and encode the string in a particular encoding.
For example:
'H€llø'.encode('utf-8')
This will throw the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
Now you can catch the "UnicodeDecodeError" to determine that the string did not contain just the ASCII characters.
try:
'H€llø'.encode('utf-8')
except UnicodeDecodeError:
print 'This string contains more than just the ASCII characters.'

Scrapy item pipeline

I am using scrappy spider and my own item pipeline
value['Title'] = item['Title'][0] if ('Title' in item) else ''
value['Name'] = item['Name'][0] if ('CompanyName' in item) else ''
value['Description'] = item['Description'][0] if ('Description' in item) else ''
When i do this i am getting the value prefixed with u
Example : When i pass the value to o/p and print it
value['Title'] = u'hospital'
What went wrong in my code and why i am getting u and how to remove it
Can anyone help me ?
Thanks,
The u means that the string is represented as unicode. You can remove the u by passing the string to str. str(u'test'). But you can treat is as normal string for most purposes. For example
>>> u'test' == 'test'
True
If you have characters that cannot be represented with plain ascii you should keep the unicode way. If you call str on non ascii characters you will get an exception.
>>> test=u'বাংলা'
>>> test
u'\u09ac\u09be\u0982\u09b2\u09be'
>>> str(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
The u is not part of the string, it is just a way to indicate the type of the string.
>>> type('test')
<type 'str'>
>>> type(u'test')
<type 'unicode'>
Se the following question for more details:
What does the 'u' symbol mean in front of string values?
To remove the u sign you may encode the string as ASCII like this: value['Title'].encode("ascii").

Python regex with unicode strings

Could not match unicode string in python 2.7.
expected result 749130
>>> print match("\d+", u'\ufeff749130'.encode('utf-8'))
None
>>> print match("\d+", u'\ufeff749130')
None
>>> print match("\d+", u'\ufeff749130'.decode('utf-8'))
Traceback (most recent call last):
....
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
No need to use str.decode on a unicode string. As stated in the comments, you may want to use search because match only matches from the beginning of the target string.
>>> print search("\d+", u'\ufeff749130').group()
749130

How to do string formatting with unicode emdash?

I am trying do string formatting with a unicode variable. For example:
>>> x = u"Some text—with an emdash."
>>> x
u'Some text\u2014with an emdash.'
>>> print(x)
Some text—with an emdash.
>>> s = "{}".format(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 9: ordinal not in range(128)
>>> t = "%s" %x
>>> t
u'Some text\u2014with an emdash.'
>>> print(t)
Some text—with an emdash.
You can see that I have a unicode string and that it prints just fine. The trouble is when I use Python's new (and improved?) format() function. If I use the old style (using %s) everything works out fine, but when I use {} and the format() function, it fails.
Any ideas of why this is happening? I am using Python 2.7.2.
The new format() is not as forgiving when you mix ASCII and unicode strings ... so try this:
s = u"{}".format(x)
The same way.
>>> s = u"{0}".format(x)
>>> s
u'Some text\u2014with an emdash.'
Using the following worked well for me. It is a variant on the other answers.
>>> emDash = u'\u2014'
>>> "a{0}b".format(emDash)
'a—b'

Categories