This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
(34 answers)
The smtplib.server.sendmail function in python raises UnicodeEncodeError: 'ascii' codec can't encode character
(1 answer)
Closed 8 months ago.
I am getting the above error when trying to run an automated email script. If I delete most of the body of the email it works fine. However, when I add over some number of characters (all are upper/lower case or numbers) I get the error. I tried changing the code in smtplib.py but then my script stopped working.
This is my code:
smtpObj = smtplib.SMTP('smtp endpoint', 587)
smtpObj.ehlo()
smtpObj.starttls()
smtpObj.login('login', sys.argv[1])
for name, email in unmessagedmembers.items():
body = """Subject: Research\na {} lot of text
n more text""".format(name)
print('Sending email to {}...'.format(email))
sendmailStatus = smtpObj.sendmail('me#email.com', email, body)
if sendmailStatus != {}:
print('There was a problem sending email to {}: {}'.format(email, sendmailStatus))
smtpObj.quit()
it is saying I get the error on the line:
sendmailStatus = smtpObj.sendmail('me#email.com', email, body)
The documentation states: "msg may be a string containing characters in the ASCII range, or a byte string. A string is encoded to bytes using the ascii codec, and lone \r and \n characters are converted to \r\n characters. A byte string is not modified."
Since email is likely a string, Python tries to encoded it using the ASCII codec when you call smtpObj.sendmail(), and the encoding comes down to:
email.encode('ascii')
If you were to run that line, you're likely to see the same error message.
To avoid it, encode email yourself and then pass it to smtpObj.sendmail():
email.encode() # by default will use utf8 in modern Python
sendmailStatus = smtpObj.sendmail('me#email.com', email, body')
Note that your example code has errors in the indentation, I assumed all the code was indented to be within the for block.
However, whether that causes problems on the receiving end is a different matter - you can try replacing any character outside the ASCII range with other characters, or look into adding headers that instruct the recipient to decode the message using the correct encoding.
I am currently working with a python script (appengine) that takes an input from the user (text) and stores it in the database for re-distribution later.
The text that comes in is unknown, in terms of encoding and I need to have it encoded only once.
Example Texts from clients:
This%20is%20a%20test
This is a test
Now in python what I thought I could do is decode it then encode it so both samples become:
This%20is%20a%20test
This%20is%20a%20test
The code that I am using is as follows:
#
# Dencode as UTF-8
#
pl = pl.encode('UTF-8')
#
#Unquote the string, then requote to assure encoding
#
pl = urllib.quote(urllib.unquote(pl))
Where pl is from the POST parameter for payload.
The Issue
The issue is that sometimes I get special (Chinese, Arabic) type chars and I get the following error.
'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
..snip..
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
does anyone know the best solution to process the string given the above issue?
Thanks.
Replace
pl = pl.encode('UTF-8')
with
pl = pl.decode('UTF-8')
since you're trying to decode a byte-string into a string of characters.
A design issue with Python 2 lets you .encode a bytestring (which is already encoded) by automatically decoding it as ASCII (which is why it apparently works for ASCII strings, failing only for non-ASCII bytes).
I'm trying to left align an UTF-8 encoded string with string.ljust. This exception is raised: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128). For example,
s = u"你好" // a Chinese string
stdout.write(s.encode("UTF-8").ljust(20))
Am I on the right track? Or I should use other approach to format?
Thanks and Best Regards.
Did you post the exact code and the exact error you received? Because your code works without throwing an error on both a cp437 and utf-8 terminal. In any case you should justify the Unicode string before sending it to the terminal. Note the difference because the UTF-8-encoded Chinese has length 6 when encoded instead of length 2:
>>> sys.stdout.write(s.encode('utf-8').ljust(20) + "hello")
你好 hello
>>> sys.stdout.write(s.ljust(20).encode('utf-8') + "hello")
你好 hello
Note also that Chinese characters are wider than the other characters in typical fixed-width fonts so things may still not line up as you like if mixing languages (see this answer for a solution):
>>> sys.stdout.write("12".ljust(20) + "hello")
12 hello
Normally you can skip explicit encoding to stdout. Python implicitly encodes Unicode strings to the terminal in the terminal's encoding (see sys.stdout.encoding):
sys.stdout.write(s.ljust(20))
Another option is using print:
print "%20s" % s # old-style
or:
print '{:20}'.format(s) # new-style
I'm trying to read some files using Python3.2, the some of the files may contain unicode while others do not.
When I try:
file = open(item_path + item, encoding="utf-8")
for line in file:
print (repr(line))
I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-16: ordinal not in range(128)
I am following the documentation here: http://docs.python.org/release/3.0.1/howto/unicode.html
Why would Python be trying to encode to ascii at any point in this code?
The problem is that repr(line) in Python 3 returns also the Unicode string. It does not convert the above 128 characters to the ASCII escape sequences.
Use ascii(line) instead if you want to see the escape sequences.
Actually, the repr(line) is expected to return the string that if placed in a source code would produce the object with the same value. This way, the Python 3 behaviour is just fine as there is no need for ASCII escape sequences in the source files to express a string with more than ASCII characters. It is quite natural to use UTF-8 or some other Unicode encoding these day. The truth is that Python 2 produced the escape sequences for such characters.
What's your output encoding? If you remove the call to print(), does it start working?
I suspect you've got a non-UTF-8 locale, so Python is trying to encode repr(line) as ASCII as part of printing it.
To resolve the issue, you must either encode the string and print the byte array, or set your default encoding to something that can handle your strings (UTF-8 being the obvious choice).
I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.
A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)
For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.
Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.
xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>
Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.