UnicodeEncodeError when fetching url - python

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it's utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:
fromstring(response).text_content().encode('ascii', 'replace')
solves the problem.
Here it's my code:
from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()
print encoding
print fromstring(response).text_content()
Output:
utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.
UPDATE:
Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?
UPDATE2:
It's also happening when I create a file with this output. .encode('ascii', 'replace') is working but I'd like to have a more general solution.
Regards

Can you try wrapping your string with repr()?
This article might help.
print repr(fromstring(response).text_content())

As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:
import codecs
output_file = codecs.open('filename.txt','w','utf8')
I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.

Based on your first update I would say that the terminal told Python to output utf-8 and SublimeText made clear it expects ascii. So I think the solution will be in finding the right settings in SublimeText.
However, if you cannot change what SublimeText expects it is better to use the encode function like you already did in a separate function.
def smartprint( text ) :
if sys.stdout.encoding == None :
print text
else :
print text.encode( sys.stdout.encoding , 'replace' )
You can use this function instead of print. Keep in mind that your program's output when run in SublimeText differs from Terminal. Because of the replace accented characters will loose their accents when this code is run in SublimeText, e.g. é will be shown as e.

Related

Encoding problems in Python - 'ascii' codec can't encode character '\xe3' when using UTF-8

I've created a program to print out some html content. My source file is in utf-8, the server's terminal is in utf-8, and I also use:
out = out.encode('utf8')
to make sure, the character chain is in utf8.
Despite all that, when I use some characters like "ã", "é" in the string out, I get:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 84: ordinal not in range(128)
It seems to me that the print after:
print("Content-Type: text/html; charset=utf-8 \n\n")
It's being forced to use ASCII encoding... But, I just don't know this would be the case.
Thanks a lot.
Here it goes how I've solved the encoding problem in with Python 3.4.1:
First I've inserted this line in the code to check the output encoding:
print(sys.stdout.encoding)
And I saw that the output encoding was:
ANSI_X3.4-1968 -
which stands for ASCII and doesn't support characters like 'ã', 'é', etc.
so, I've deleted the previous line, and inserted theses ones here to change the standard output encoding with theses lines
import codecs
if sys.stdout.encoding != 'UTF-8':
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'UTF-8':
sys.stderr = codecs.getwriter('utf-8')(sys.stderr.buffer, 'strict')
Here is where I found the information:
http://www.macfreek.nl/memory/Encoding_of_Python_stdout
P.S.: everybody says it's not a good practice to change the default encoding. I really don't know about it. In my case it has worked fine for me, but I'm building a very small and simple webapp.
I guess you should read the file as unicode object, that way you might not need to encode it.
import codecs
file = codecs.open('file.html', 'w', 'utf-8')

Python ─ UTF-8 filename from HTML form via CherryPy

Python Header: # ! /usr/bin/env python
# -*- coding: utf-8 -*-
# image_upload.py
Cherrypy Config: cherrypy.config.update(
{'tools.encode.on': True,
'tools.encode.encoding': 'utf-8',
'tools.decode.on': True,
},)
HTML Header: <head><meta http-equiv="Content-Type"
content="text/html;charset=ISO-8859-1"></head>
""" Python 2.7.3
Cherrypy 3.2.2
Ubuntu 12.04
"""
With an HTML form, I'm uploading an image file to a database. That works so far without problems. However, if the filename ist not 100% in ASCII, there seems to be no way to retrieve it in UTF-8. This is weird, because with the HTML text input fields it works without problems, from saving until showing. Therefore I assume that it's an encoding or decoding problem with the web application framework CherryPy, because the upload is handeld by it, like here.
How it works:
The HTML form POSTs the uploaded file to another Python function, which receives the file in the standard dictionary **kwargs. From here you get the filename with extention, like this: filename = kwargs['file'].filename. But that's already with the wrong encoding. Until now the image hasn't been processed, stored or used in any way.
I'm asking for a solution, which would prevent it, to just parse the filename and change it back "manually". I guess the result already is in UTF-8, which makes it cumbersome to get it right. That's why getting CherryPy to do it, might be the best way. But maybe it's even an HTML issue, because the file comes from a form.
Here are the wrong decoded umlauts.
What I need is the input as result.
input → result input → result
ä → ä Ä → Ä
ö → ö Ö → Ö
ü → ü Ü → Ãœ
Following are the failed attempts to get the right result, which would be: "Würfel"
NOTE: img_file = kwargs['file']
original attempt:
result = img_file.filename.rsplit('.',1)[0]
result: "Würfel"
change system encoding:
reload(sys)
sys.setdefaultencoding('utf-8')
result: "Würfel"
encoding attempt 1:
result = img_file.filename.rsplit('.',1)[0].encode('utf-8')
result: "Würfel"
encoding attempt 2:
result = unicode(img_file.filename.rsplit('.',1)[0], 'urf-8')
Error Message:
TypeError: decoding Unicode is not supported
decoding attempt:
result = img_file.filename.rsplit('.',1)[0].decode('utf-8')
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
cast attempt:
result = str(img_file.filename.rsplit('.',1)[0])
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
Trying with your string it seems I can get the filename using latin1 encoding.
>>> s = u'W\xc3\xbcrfel.jpg'
>>> print s.encode('latin1')
Würfel.jpg
>>>
You simply need to use that .encode('latin1') before splitting.
But the problem here is broader. You really need to figure out why your web encoding is latin1 instead of utf8. I don't know cherrypy but try to ensure to use utf8 or you could get in other glitches when serving your application through a webserver like apache or nginx.
The problem is that you serve your HTML with charset ISO-8859-1; this makes the browsers confused and they use the charset also when sending to server. Serve all your HTML always with UTF-8, code in UTF-8, and set your terminal to UTF-8, and you shouldn't have problems.

Python Unicode Encode Error ordinal not in range<128> with Euro Sign

I have to read an XML file in Python and grab various things, and I ran into a frustrating error with Unicode Encode Error that I couldn't figure out even with googling.
Here are snippets of my code:
#!/usr/bin/python
# coding: utf-8
from xml.dom.minidom import parseString
with open('data.txt','w') as fout:
#do a lot of stuff
nameObj = data.getElementsByTagName('name')[0]
name = nameObj.childNodes[0].nodeValue
#... do more stuff
fout.write(','.join((name,bunch of other stuff))
This spectacularly crashes when a name entry I am parsing contains a Euro sign. Here is the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 60: ordinal not in range(128)
I understand why Euro sign will screw it up (because it's at 128, right?), but I thought doing # coding: utf-8 would fix that. I also tried adding .encode(utf-8) so that the name looks instead like
name = nameObj.childNodes[0].nodeValue.encode(utf-8)
But that doesn't work either. What am I doing wrong? (I am using Python 2.7.3 if anyone wants to know)
EDIT: Python crashes out on the fout.write() line -- it will go through fine where the name field is like:
<name>United States, USD</name>
But will crap out on name fields like:
<name>France, € </name>
when you are opening a file in python using the open built-in function you will always read the file in ascii. To access it in another encoding you have to use codecs:
import codecs
fout = codecs.open('data.txt','w','utf-8')
It looks like you're getting Unicode data from your XML parser, but you're not encoding it before writing it out. You can explicitly encode the result before writing it out to the file:
text = ",".join(stuff) # this will be unicode if any value in stuff is unicode
encoded = text.encode("utf-8") # or use whatever encoding you prefer
fout.write(encoded)

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

Encoding issue when trying to scrape a page

I'm using beautifulSoup to scrape a page that has a ISO-8859-1 encoding however I've run into my little hiccup.
I have a line that reads:
logging.info("Processing [%s]" % (link))
The variable link is one of the values scraped from beautifulsoup. It is a Unicode string and I can print it by typing print link. It shows up on the console exactly the way it was scraped but the line above throws this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
I've read up on Unicode right now but I can't figure out why it is able to print it but it can't log it.
The string in question is this:
booba-concert-à-bercy
Any ideas on where I'm mucking this up?
Thank you.
logging doesn't like unicode; pass it bytes.
logging.info("Processing [%s]" % (link.encode('utf-8')))
I managed to solve this by adding a file called sitecustomize.py in my Python/Lib/site-packages directory. This file contained two lines: import sys and sys.setdefaultencoding('utf-8').
The default encoding prior to that was ascii and therefore the issues. Now I don't need to specify an explicit encoding for the link variable as it uses the default encoding i.e. utf-8 and converts it to that.
Of course, I'll never see the characters properly until my terminal in the same encoding but that won't break my code.

Categories