Python - decode ('utf-8') issue - python

I am very new to Python.Please help me fix this issue.
I am trying to get the revenue from the link below :
https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898
I am using below commands :
import re
import urllib.request
data=urllib.request.urlopen(url).read()
data1=data.decode("utf-8")
Issue :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position
10798: invalid start byte

Maybe better with requests:
import requests
url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text

The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:
import requests
url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"
response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")
print (data)
Please note that I used Python3 in my code example. The syntax for print() may vary a little.

0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0'. If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.
A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...
The rule when you are not sure of the exact convertion is to use the replace errors processing:
data1=data.decode("utf-8", errors="replace")
then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.

Related

Python 2.7, Requests library, can't get unicode

Documentation for Request library says that requests.get() method returns unicode always. But when I try to know what an encoding was returned, I see "windows-1251". That's a problem. When I try to get requests.get(url).text, there's an error, because current url's content has a Cyrillic symbols.
import requests
url = 'https://www.weblancer.net/jobs/'
r = requests.get(url)
print r.encoding
print r.text
I got something like that:
windows-1251
UnicodeEncodeError: 'ascii' codec can't encode characters in position 256-263: ordinal not in range(128)
Is it a problem of Python 2.7 or there is not a problem at all ?
Help me
From the docs:
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers.
requests.get().encoding is telling you the encoding that was used to convert the bitstream from the server into the Unicode text that is in the response.
In your case it is correct: the headers in the response say that the character set is windows-1251
The error you are having is after that. The python you are using is trying to encode the Unicode into ascii to print it, and failing.
You can say print r.text.encode(r.encoding) ... which is the same result as Padraic's suggestion in comments - that is r.content.
Note:
requests.get().encoding is an lvar: you can set it to what you want, if it guessed wrongly.

Fetching URL and converting to UTF-8 Python

I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))
I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.
The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.
The urlopen is returning to you a bytes object. That means it's a raw, encoded stream of bytes. Python 3 prints that in a repr format, which uses escape codes for non-ASCII characters. To get the canonical unicode you would have to decode it. The right way to do that would be to inspect the header and look for the encoding declaration. But for this we can assume UTF-8 and you can simply decode it as such, not encode it.
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test.decode("utf-8")) # <- note change
Now, Python 3 defaults to UTF-8 source encoding. So you can embed non-ASCII like this if your editor supports unicode and saving as UTF-8.
z = "ł"
print(z)
Printing it will only work if your terminal supports UTF-8 encoding. On Linux and OSX they do, so this is not a problem there.
The others are correct, but I'd like to offer a simpler solution. Use requests. It's 3rd party, so you'll need to install it via pip:
pip install requests
But it's a lot simpler to use than the urllib libraries. For your particular case, it handles the decoding for you out of the box:
import requests
r = requests.get("http://olx.pl/")
print(r.encoding)
# UTF-8
print(type(r.text))
# <class 'str'>
print(r.text)
# The HTML
Breakdown:
get sends an HTTP GET request to the server and returns the respose.
We print the encoding requests thinks the text is in. It chooses this based on the response header Martijin mentions.
We show that r.text is already a decoded text type (unicode in Python 2 and str in Python 3)
Then we actually print the response.
Note that we don't have to print the encoding or type; I've just done so for diagnostic purposes to show what requests is doing. requests is designed to simplify a lot of other details of working with HTTP requests, and it does a good job of it.

UnicodeEncodeError: handling special characters

I am trying to scrap a web page. To keep care of all characters other then ASCII, I have written this code.
mydata = ''.join([i if ord(i) < 128 else ' ' for i in response.text])
and processed it further using beautiful soup python library. Now this is not handling some special characters that are on webpage like [tick], [star] (can't show a picture here).
Any clue on how to escape these characters and replace them with a space.
Right now I have this error
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 62: character maps to <undefined>
It's always preferable to process everything in Unicode, and convert to any specific encoding only before storage or transfer. For example,
s = u"Hi, привет, ciao"
> s
u'Hi, \u043f\u0440\u0438\u0432\u0435\u0442, ciao'
> s.encode('ascii', 'ignore')
'Hi, , ciao'
> s.encode('ascii', 'replace')
'Hi, ??????, ciao'
If you need to replace non-ascii chars specifically with spaces, you can write and register your own conversion error handler, see codecs.register_error().
fp = open("output.txt","w")
gives you a file open for writing text using the default encoding, which in your case is an encoding that doesn't have the character ✓ (probably cp1252), hence the error. Open the file with an encoding that supports it and you'll be fine:
fp = open('output.txt', 'w', encoding='utf-8')
Note also that:
print("result: "+ str(ele))
can fail if your console doesn't support Unicode, which under Windows it likely will not. Use print(ascii(...)) to get an ASCII-safe representation for debugging purposes.
The probable reason your attempt to get rid of non-ASCII characters fails is that you are removing them before parsing the HTML, rather than from the values you get after parsing. So a literal ✓ would be removed, but if a character reference like ✓ were used, it would be left alone, get parsed by bs4, and end up as ✓.
(I am sad that the default reaction to Unicode errors always seems to be to try to get rid of non-ASCII characters completely, instead of fixing the code to handle them correctly.)
You're also extracting text in a pretty weird way, using str() to get markup and then trying to pick out the tags and remove them. This is unreliable—HTML is not that straightforward to parse, which is why BeautifulSoup is a thing—and needless because you already have a perfectly good HTML parser that can give you the pure text in an element (get_text()).
Most of your code is not necessary. request is already doing the correct decoding for you, beautifulsoup is doing the text extraction for you, and python is doing the correct encoding for you when writing to a file:
import requests
from bs4 import BeautifulSoup
#keyterm = input("Enter a keyword to search:")
URL = 'https://www.google.com/search?q=jaguar&num=30'
#NO_OF_LINKS_TO_BE_EXTRACTED = 10
print("Requesting data from %s" % URL)
response = requests.get(URL)
soup = BeautifulSoup(response.text)
#print(soup.prettify())
metaM = soup.findAll("span","st")
#metaM = soup.find("div", { "class" : "f slp" })
with open("output.txt", "w", encoding='utf8') as fp:
for ele in metaM:
print("result: %r" % ele)
fp.write(ele.get_text().replace('\n', ' ') + '\n')

output html in terminal error

I'm trying to print out html content in following way:
from lxml import html
import requests
url = 'http://www.amazon.co.uk/Membrane-Protectors-FujiFilm-FinePix-SL1000/dp/B00D2UVI9C/ref=pd_sim_ph_3?ie=UTF8&refRID=06BDVRBE6TT4DNRFWFVQ'
page = requests.get(url)
print page.text
then i execute python print_url.py > out, and I got the following error:
print page.text UnicodeEncodeError: 'ascii' codec can't encode
character u'\xa3' in position 113525: ordinal not in range(128)
Could anyone give me some idea? I had these problem before, but i couldn't figure it out.
Thanks
Your page.txt is not in your local encoding. Instead it is probably unicode. To print the contents of page.text you must first encode them in the encoding that stdout expects:
import sys
print page.text.encode(sys.stdout.encoding)
The page contains non-ascii unicode characters. You may get this error if you try to print to a shell that doesn't support them, or because you are redirecting the output to a file and it's assuming an ascii encoding for output. I specify this because some shells will have no problem, while others will (my current shell/terminal defaults to uf8 for instance)
If you want the output to be encoded as utf8, you should explicitly encode it:
print page.text.encode('utf8')
If you want it to be encoded as something the shell can handle or ascii with non-printable characters removed or replaced, use one of these:
print page.text.encode(sys.stdout.encoding or "ascii", 'xmlcharrefreplace') - replace nonprintable characters with numeric entities
print page.text.encode(sys.stdout.encoding or "ascii", 'replace') - replace nonprintable characters with "?"
print page.text.encode(sys.stdout.encoding or "ascii", 'ignore') - replace nonprintable characters with nothing (delete them)

Download html without Python unicode errors

I am trying to download page_source to a file. However, every time I get a:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 (or something else) in
position 8304: ordinal not in range(128)
I've tried using value.encode('utf-8'), but it seems every time it throws the same exception (in addition to manually trying to replace every non-ascii character). Is there a way to 'pre-process' the html to put it into a 'write-able' format?
There are third party libraries such as BeautifulSoup and lxml that can deal with encoding issues automatically. But here's a crude example using just urlllib2:
First download some webpage containing non-ascii characters:
>>> import urllib2
>>> response = urllib2.urlopen('http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')
>>> data = response.read()
Now have a look for the "charset" at the top of the page:
>>> data[:200]
'<html>\n<head>\n<title>Unicode 2.0 test page</title>\n<meta
content="text/html; charset=UTF-8" http-equiv="Content-type"/>\n
</head>\n<body>\n<p>This page contains characters from each of the
Unicode\ncharact'
If there was no obvious charset, "UTF-8" is usually a good guess, anyway.
Finally, convert the webpage to unicode text:
>>> text = data.decode('utf-8')
I am not sure, however http://www.crummy.com/software/BeautifulSoup/ has a function .prettify() that returns well formatted HTML. You could try using that for "preprocessing".
The problem is probably that you're trying to go str -> utf-8, when you need to go str -> unicode -> utf-8. In other words, try unicode(s, 'utf-8').encode('utf-8').
See http://farmdev.com/talks/unicode/ for more info.

Categories