I'm trying to print out html content in following way:
from lxml import html
import requests
url = 'http://www.amazon.co.uk/Membrane-Protectors-FujiFilm-FinePix-SL1000/dp/B00D2UVI9C/ref=pd_sim_ph_3?ie=UTF8&refRID=06BDVRBE6TT4DNRFWFVQ'
page = requests.get(url)
print page.text
then i execute python print_url.py > out, and I got the following error:
print page.text UnicodeEncodeError: 'ascii' codec can't encode
character u'\xa3' in position 113525: ordinal not in range(128)
Could anyone give me some idea? I had these problem before, but i couldn't figure it out.
Thanks
Your page.txt is not in your local encoding. Instead it is probably unicode. To print the contents of page.text you must first encode them in the encoding that stdout expects:
import sys
print page.text.encode(sys.stdout.encoding)
The page contains non-ascii unicode characters. You may get this error if you try to print to a shell that doesn't support them, or because you are redirecting the output to a file and it's assuming an ascii encoding for output. I specify this because some shells will have no problem, while others will (my current shell/terminal defaults to uf8 for instance)
If you want the output to be encoded as utf8, you should explicitly encode it:
print page.text.encode('utf8')
If you want it to be encoded as something the shell can handle or ascii with non-printable characters removed or replaced, use one of these:
print page.text.encode(sys.stdout.encoding or "ascii", 'xmlcharrefreplace') - replace nonprintable characters with numeric entities
print page.text.encode(sys.stdout.encoding or "ascii", 'replace') - replace nonprintable characters with "?"
print page.text.encode(sys.stdout.encoding or "ascii", 'ignore') - replace nonprintable characters with nothing (delete them)
Related
I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.
u'\u2014'.replace('\u','')
Result: u'\u2014'
I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.
pandas code
merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)
Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)
u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know
This is the output in ipython:
In [4]: print("val = ", u'\u2014')
val = —
Based on your comment, here is what you are doing wrong
"-" is not same as "EM Dash" Unicode character(u'\u2014')
So, you should do the following
print(u'\u2014'.replace("\u2014",""))
and that will work
EDIT:
since you are using python 2.x, you have to encode it with utf-8 as follows
u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")
Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.
Things that can help:
Converting to ascii using .encode('ascii', 'ignore')
As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')
Hope this helps.
I am very new to Python.Please help me fix this issue.
I am trying to get the revenue from the link below :
https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898
I am using below commands :
import re
import urllib.request
data=urllib.request.urlopen(url).read()
data1=data.decode("utf-8")
Issue :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position
10798: invalid start byte
Maybe better with requests:
import requests
url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text
The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:
import requests
url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"
response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")
print (data)
Please note that I used Python3 in my code example. The syntax for print() may vary a little.
0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0'. If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.
A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...
The rule when you are not sure of the exact convertion is to use the replace errors processing:
data1=data.decode("utf-8", errors="replace")
then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.
I'm writing a Python program using BeautifulSoup4, and when I fetch an HTML element that contains a stylized quotation mark u'\u2019' I am able to print out the whole element like so:
Code:
print "Using song: %s" % (song_link)
Result:
Using song: Cups (Pitch Perfect’s “When I’m Gone”)
But then when I try to print out just the text of that element, it fails:
Code:
print "Song text: %s" % (song_link.text)
Result:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 30: ordinal not in range(128)
Why is this happening? Why does this work one moment and then not work the next? It is reproducible.
The output of your first case is a byte string. The output of your second case is a Unicode string. Unicode strings are implicitly encoded to the terminal encoding, or ascii if the terminal encoding could not be determined, which results in your error.
Not knowing your environment, you need to determine why printing Unicode strings defaults to encoding in ascii, or explicitly encode the string yourself with .encode('utf8').
I use python 2.7 and I'm receiving a string from a server (not in unicode!).
Inside that string I find text with unicode escape sequences. For example like this:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.
Edit:
<\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u
The example text is meant in proper python syntax like this:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
The desired output is in proper python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
Try
>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'
And then you can encode to utf8 as usual.
Python does contain some special string codecs for cases like this.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python.
(On which your program should be performing all textual operations) -
Whenever you are outputting that text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are othe bytes outside the 32-127 range, the unicode_escape codec
assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:
decode the original string using utf-8
encode back to latin1
decode using "unicode_escape"
work on the text
encode back to utf-8
I am trying to parse this document with Python and BeautifulSoup:
http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine
The seventh Item down as this Text tag:
Rage Against the Machine's 1994–1995
Tour
When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)
I can resolve it by simply replacing u'\u2013' with '-' like so:
itemText = itemText.replace(u'\u2013', '-')
However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.
Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).
someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)
Thank you
Decoding it as UTF-8 should work:
itemText = itemText.decode('utf-8')
Normally, you should try to preserve characters as unicode or utf-8. Avoid converting characters to your local codepage, as this results in loss of information.
However, if you must, here are. Few things to do. Let's use your example character:
>>> s = u'\u2013'
If you want to print the string e.g. for debugging, you can use repr:
>>> print(repr(s))
u'\u2013'
In an interactive session, you can just type the variable name to achieve the same result:
>>> s
u'\u2013'
If you really want to convert it the text to your local codepage, and it is OK that characters outside this codepage are converted to '?', you can use this:
>>> s.encode('latin-1', 'replace')
'?'
If '?' is not good enough, you can use translate to convert selected characters into an equivalent character as in this answer.
You may need to explicitly declare your encoding.
On the first line of your file (or after the hashbang, if there is one), add the following line:
-*- coding: utf-8 -*-
This 'magic comment' forces Python to expect UTF-8 characters and should decode them successfully.
More details: http://www.python.org/dev/peps/pep-0263/