Python-How to solve UnicodeEncodeError [duplicate] - python

This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 6 months ago.
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
strg=""
iter=1
url='http://www.amazon.in/product-reviews/B00EOPJEYK/ref=cm_cr_pr_top_link_1? ie=UTF8&pageNumber=1&showViewpoints=0&sortBy=bySubmissionDateDescending'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
rows =soup.find_all('div',attrs={"class" : "reviewText"})
for row in soup.find_all('div',attrs={"class" : "reviewText"}):
strg = strg +str(iter)+"." + row.text + "\n\n"
iter=iter+1
with open('outp.txt','w') as f:
f.write(strg)
f.close()
I require this code to write the contents of the variable,strg to the file,outp.txt.
Instead I get this error:
Traceback (most recent call last):
File "C:\Python27\demo_amazon.py", line 14, in <module>
f.write(strg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 226: ordinal not in range(128)
strg stores the required output.There is some problem in the writing statement I guess.How to solve this?
Kindly help.
Thank you.

well, first of all, if you want to get rid of the unicode errors, you shall switch to Python 3 that defaults to unicode strings instead of ascii strings in python 2.
That said, to get rid of the UnicodeEncodeError exception, you shall do:
with open('outp.txt','w') as f:
f.write(strg.encode('utf8'))
as a reference, see that question. And try to use unicode strings as much as possible to avoid as much as possible changing charsets, by using u"this is an unicode string" instead of "this is an ascii string"
thus in your for loop:
strg = strg +str(iter)+"." + row.text + "\n\n"
should instead be:
strg = strg +unicode(iter)+u"." + row.text + u"\n\n"
and strg should be defined as strg = u""
N.B.: f.close() in your code is redundant with the use of the with keyword that actually takes care of closing the file when you exit the with block, through the __exit__() method of the File object.

Basically you have a non-ASCII character. I suggest using Unidecode which will try and find the "closest" ASCII character to the offending one. So, for instance it would turn é into e.
So you'd just do
from unidecode import unidecode
f.write(unidecode(strg))

Related

Printing all unicode characters in Python

I've written some code to create all 4-digit combinations of the hexidecimal system, and now I'm trying to use that to print out all the unicode characters that are associated with those values. Here's the code I'm using to do this:
char_list =["0","1","2","3","4","5","6","7","8","9","A","B","C","D","E","F"]
pairs = []
all_chars = []
# Construct pairs list
for char1 in char_list:
for char2 in char_list:
pairs.append(char1 + char2)
# Create every combination of unicode characters ever
for pair1 in pairs:
for pair2 in pairs:
all_chars.append(pair1 + pair2)
# Print all characters
for code in all_chars:
expression = "u'\u" + code + "'"
print "{}: {}".format(code,eval(expression))
And here is the error message I'm getting:
Traceback (most recent call last): File "C:\Users\andr7495\Desktop\unifun.py",
line 18, in <module> print "{}: {}".format(code,eval(expression))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0:
ordinal not in range(128)
The exception is thrown when the code tries to print u"\u0080", however, I can do this in the interactive interpreter without a problem.
I've tried casting the results to unicode and specifying to ignore errors, but it's not helping. I feel like I'm missing a basic understanding about how unicode works, but is there anything I can do to get my code to print out all valid unicode expressions?
import sys
for i in xrange(sys.maxunicode):
print unichr(i);
You're trying to format a Unicode character into a byte string. You can remove the error by using a Unicode string instead:
print u"{}: {}".format(code,eval(expression))
^
The other answers are better at simplifying the original problem however, you're definitely doing things the hard way.
it is likely a problem with your terminal (cmd.exe is notoriously bad at this) as most of the time when you "print" you are printing to a terminal and that ends up trying to do encodings ... if you run your code in idle or some other space that can render unicode you should see the characters. also you should not use eval try this
for uni_code in range(...):
print hex(uni_code),unichr(uni_code)
Here's a rewrite of examples in this article that saves the list to a file.
Python 3.x:
import sys
txtfile = "unicode_table.txt"
print("creating file: " + txtfile)
F = open(txtfile, "w", encoding="utf-16", errors='ignore')
for uc in range(sys.maxunicode):
line = "%s %s" % (hex(uc), chr(uc))
print(line, file=F)
F.close()

output html in terminal error

I'm trying to print out html content in following way:
from lxml import html
import requests
url = 'http://www.amazon.co.uk/Membrane-Protectors-FujiFilm-FinePix-SL1000/dp/B00D2UVI9C/ref=pd_sim_ph_3?ie=UTF8&refRID=06BDVRBE6TT4DNRFWFVQ'
page = requests.get(url)
print page.text
then i execute python print_url.py > out, and I got the following error:
print page.text UnicodeEncodeError: 'ascii' codec can't encode
character u'\xa3' in position 113525: ordinal not in range(128)
Could anyone give me some idea? I had these problem before, but i couldn't figure it out.
Thanks
Your page.txt is not in your local encoding. Instead it is probably unicode. To print the contents of page.text you must first encode them in the encoding that stdout expects:
import sys
print page.text.encode(sys.stdout.encoding)
The page contains non-ascii unicode characters. You may get this error if you try to print to a shell that doesn't support them, or because you are redirecting the output to a file and it's assuming an ascii encoding for output. I specify this because some shells will have no problem, while others will (my current shell/terminal defaults to uf8 for instance)
If you want the output to be encoded as utf8, you should explicitly encode it:
print page.text.encode('utf8')
If you want it to be encoded as something the shell can handle or ascii with non-printable characters removed or replaced, use one of these:
print page.text.encode(sys.stdout.encoding or "ascii", 'xmlcharrefreplace') - replace nonprintable characters with numeric entities
print page.text.encode(sys.stdout.encoding or "ascii", 'replace') - replace nonprintable characters with "?"
print page.text.encode(sys.stdout.encoding or "ascii", 'ignore') - replace nonprintable characters with nothing (delete them)

Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

I've got a data of twitter log and I have to sort the file to show each user's retweeted tweet ranking.
Here's the code.
import codecs
with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
tweet_list = tweets.readlines()
tweet_list.pop(0)
facul={}
for t in tweet_list:
t = t.split('\t')
t[-2] = int(t[-2])
if t[-2] <= 0:
continue
if not t[0] in facul:
facul[t[0]] = []
facul[t[0]].append(t)
def cmp_retweet(a,b):
if a[-2] < b[-2]:
return 1
if a[-2] > b[-2]:
return -1
return 0
for f in sorted(facul.keys()):
facul[f].sort(cmp=cmp_retweet)
print ('[%s]' %(f))
for t in facul[f][:5]:
print ('%d:%s:%s' % (t[-2], t[2], t[-1].strip())
Somehow I got an error saying:
print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
UnicodeEncodeError: 'ascii' codec can't encode characters in position
34-39: ordinal not in range(128)
Looks like Japanese language letters can't be decoded. How can I fix this?
I tried to use sys.setdefaultencoding("utf-8") but then I got an error:
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
This is how I tried it:
import codecs
import sys
sys.setdefaultencoding("utf-8")
with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
tweet_list = tweets.readlines()
p.s. I am using Python version 2.7.5
The basic issue, as you have discovered, is that ASCII cannot represent much of unicode.
So you have to make a choice on how to handle it:
don't display non-ASCII chars
display non-ASCII chars as some other type of representation
The first choice would like this:
for t in facul[f][:5]:
print ('%d:%s:%s' % (
t[-2],
t[2].encode('ascii', errors='ignore'),
t[-1].encode('ascii', errors='ignore').strip()
))
While the second choice would replace ignore with something like replace, xmlcharrefreplace, or backslashreplace.
Here's the reference.
The error message is giving you two clues: first, the problem is in the statement
print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
Second, the problem is related to an encode operation. If you don't remember what is meant by "encode", now would be a good time to re-read the Unicode HOWTO in the Python 2.7 docs.
It looks like your list t[] contains Unicode strings. The print() statement is emitting byte strings. The conversion of Unicode strings to byte strings is encoding. Because you aren't specifying an encoding, Python is implicitly doing a default encoding. It uses the ascii codec, which cannot handle any accented or non-Latin characters.
Try splitting that print() statement into two parts. First, insert the unicode t[] values into a unicode format string. Note the use of u'' syntax. Second, encode the unicode string to UTF and print.
s = u'%d:%s:%s' %(t[-2], t[2], t[-1].strip())
print s.encode('utf8')
(I haven't tested this change to your code. Let me know if it doesn't work.)
I think sys.setdefaultencoding() is probably a red herring, but I don't know your environment well.
By the way, the statement, as you write it above, has unbalanced parentheses. Did you drop a right parenthesis when you pasted in the code?
print ('%d:%s:%s' %(t[-2], t[2], t[-1].strip())

Converting Unicode to in python [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Convert Unicode to UTF-8 Python
I'm a very new python programmer, working on my first script. the script pulls in text from a plist string, then does some things to it, then packages it up as an HTML email.
from a few of the entries, I'm getting the dreaded Unicode "outside ordinal 128" error.
Having read as much as I can find about encoding, and decoding, I know that it is important for me to get the encoded, but I'm having a difficult time understanding when or how exactly to do this.
The offending variable is first pulled in using plistlib, and converted to HTML from markdown, like this:
entry = result['Entry Text']
donotecontent = markdown2.markdown(entry)
Later, it is put in the email like this:
html = donotecontent + '<br /><br />' + var3
part1 = MIMEText(html, 'html')
msg.attach(part1)
My question is, what is the best way for me to make sure that Unicode characters in this content doesn't cause this to throw an error. I prefer not to ignore the characters.
Sorry for my broken english. I am speaking Chinese/Japanese, and using CJK characters everyday.
Ceron solved almost of this problem, thus I won't talk about how to use encode()/decode() again.
When we use str() to cast any unicode object, it will encode unicode string to bytedata; when we use unicode() to cast str object, it will decode bytedata to unicode character.
And, the encoding must be what returned from sys.getdefaultencoding().
In default, sys.getdefaultencoding() return 'ascii' by default, the encoding/decoding exception may be thrown when doing str()/unicode() casting.
If you want to do str <-> unicode conversion by str() or unicode(), and also, implicity encoding/decoding with 'utf-8', you can execute the following statement:
import sys # sys.setdefaultencoding is cancelled by site.py
reload(sys) # to re-enable sys.setdefaultencoding()
sys.setdefaultencoding('utf-8')
and it will cause later execution of str() and unicode() convert any basestring object with encoding utf-8.
However, I would prefer to use encode()/decode() explicitly, because it makes code maintenance easier for me.
Assuming you're using Python 2.x, remember: there are two types of strings: str and unicode. str are byte strings, whereas unicode are unicode strings. unicode strings can be used to represent text in any language, but to store text in a computer or to send it via email, you need to represent that text using bytes. To represent text using bytes, you need an coding format. There are many coding formats, Python uses ascii by default, but ascii can only represent a few characters, mostly english letters. If you try to encode a text with other letters using ascii, you will get the famous "outside ordinal 128". For example:
>>> u'Cerón'.encode('ascii')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 3:
ordinal not in range(128)
The same happens if you use str(u'Cerón'), because Python uses ascii by default to convert unicode to str.
To make this work, you have to use a different coding format. UTF-8 is a coding format that can express any unicode text as bytes. To convert the u'Cerón' unicode string to bytes you have to use:
>>> u'Cerón'.encode('utf-8')
'Cer\xc3\xb3n'
No errors this time.
Now, back to your email problem. I can see that you're using MIMEText, which accepts an already encoded str argument, in your case is the html variable. MIMEText also accepts an argument specifying what kind of encoding is being used. So, in your case, if html is a unicode string, you have to encode it as utf-8 and pass the charset parameter too (because HTMLText uses ascii by default):
part1 = MIMEText(html.encode('utf-8'), 'html', 'utf-8')
But be careful, because if html is already a str instead of unicode, then the encoding will fail. This is one of the problems of Python 2.x, it allows you to encode an already encoded string but it throws an error.
Another problem to add to the list is that utf-8 is compatible with ascii characters, and Python will always try to automatically encode/decode strings using ascii. If you're not properly encoding your strings, but you only use ascii characters, things will work fine. However, if for some reason some non-ascii characters slips into your message, you will get the error, this makes errors harder to detect.
Remember: You can't decode a unicode, and you can't encode a str
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Checkout this excellent tutorial

Python unicode problem

I'm receiving some data from a ZODB (Zope Object Database). I receive a mybrains object. Then I do:
o = mybrains.getObject()
and I receive a "Person" object in my project. Then, I can do
b = o.name
and doing print b on my class I get:
José Carlos
and print b.name.__class__
<type 'unicode'>
I have a lot of "Person" objects. They are added to a list.
names = [o.nome, o1.nome, o2.nome]
Then, I trying to create a text file with this data.
delimiter = ';'
all = delimiter.join(names) + '\n'
No problem. Now, when I do a print all I have:
José Carlos;Jonas;Natália
Juan;John
But when I try to create a file of it:
f = open("/tmp/test.txt", "w")
f.write(all)
I get an error like this (the positions aren't exaclty the same, since I change the names)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 84: ordinal not in range(128)
If I can print already with the "correct" form to display it, why I can't write a file with it? Which encode/decode method should I use to write a file with this data?
I'm using Python 2.4.5 (can't upgrade it)
UnicodeEncodeError: 'ascii' codec
write is trying to encode the string using the ascii codec (which doesn't have a way of encoding accented characters like é or à.
Instead use
import codecs
with codecs.open("/tmp/test.txt",'w',encoding='utf-8') as f:
f.write(all.decode('utf-8'))
or choose some other codec (like cp1252) which can encode the characters in your string.
PS. all.decode('utf-8') was used above because f.write expects a unicode string. Better than using all.decode('utf-8') would be to convert all your strings to unicode early, work in unicode, and encode to a specific encoding like 'utf-8' late -- only when you have to.
PPS. It looks like names might already be a list of unicode strings. In that case, define delimiter to be a unicode string too: delimiter = u';', so all will be a unicode string. Then
with codecs.open("/tmp/test.txt",'w',encoding='utf-8') as f:
f.write(all)
should work (unless there is some issue with Python 2.4 that I'm not aware of.)
If 'utf-8' does not work, remember to try other encodings that contain the characters you need, and that your computer knows about. On Windows, that might mean 'cp1252'.
You told Python to print all, but since all has no fixed computer representation, Python first had to convert all to some printable form. Since you didn't tell Python how to do the conversion, it assumed you wanted ASCII. Unfortunately, ASCII can only handle values from 0 to 127, and all contains values out of that range, hence you see an error.
To fix this use:
all = "José Carlos;Jonas;Natália Juan;John"
import codecs
f = codecs.open("/tmp/test.txt", "w", "utf-8")
f.write(all.decode("utf-8"))
f.close()

Categories