Download html without Python unicode errors - python

I am trying to download page_source to a file. However, every time I get a:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 (or something else) in
position 8304: ordinal not in range(128)
I've tried using value.encode('utf-8'), but it seems every time it throws the same exception (in addition to manually trying to replace every non-ascii character). Is there a way to 'pre-process' the html to put it into a 'write-able' format?

There are third party libraries such as BeautifulSoup and lxml that can deal with encoding issues automatically. But here's a crude example using just urlllib2:
First download some webpage containing non-ascii characters:
>>> import urllib2
>>> response = urllib2.urlopen('http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')
>>> data = response.read()
Now have a look for the "charset" at the top of the page:
>>> data[:200]
'<html>\n<head>\n<title>Unicode 2.0 test page</title>\n<meta
content="text/html; charset=UTF-8" http-equiv="Content-type"/>\n
</head>\n<body>\n<p>This page contains characters from each of the
Unicode\ncharact'
If there was no obvious charset, "UTF-8" is usually a good guess, anyway.
Finally, convert the webpage to unicode text:
>>> text = data.decode('utf-8')

I am not sure, however http://www.crummy.com/software/BeautifulSoup/ has a function .prettify() that returns well formatted HTML. You could try using that for "preprocessing".

The problem is probably that you're trying to go str -> utf-8, when you need to go str -> unicode -> utf-8. In other words, try unicode(s, 'utf-8').encode('utf-8').
See http://farmdev.com/talks/unicode/ for more info.

Related

Python - decode ('utf-8') issue

I am very new to Python.Please help me fix this issue.
I am trying to get the revenue from the link below :
https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898
I am using below commands :
import re
import urllib.request
data=urllib.request.urlopen(url).read()
data1=data.decode("utf-8")
Issue :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position
10798: invalid start byte
Maybe better with requests:
import requests
url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text
The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:
import requests
url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"
response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")
print (data)
Please note that I used Python3 in my code example. The syntax for print() may vary a little.
0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0'. If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.
A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...
The rule when you are not sure of the exact convertion is to use the replace errors processing:
data1=data.decode("utf-8", errors="replace")
then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.

Python 2.7, Requests library, can't get unicode

Documentation for Request library says that requests.get() method returns unicode always. But when I try to know what an encoding was returned, I see "windows-1251". That's a problem. When I try to get requests.get(url).text, there's an error, because current url's content has a Cyrillic symbols.
import requests
url = 'https://www.weblancer.net/jobs/'
r = requests.get(url)
print r.encoding
print r.text
I got something like that:
windows-1251
UnicodeEncodeError: 'ascii' codec can't encode characters in position 256-263: ordinal not in range(128)
Is it a problem of Python 2.7 or there is not a problem at all ?
Help me
From the docs:
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers.
requests.get().encoding is telling you the encoding that was used to convert the bitstream from the server into the Unicode text that is in the response.
In your case it is correct: the headers in the response say that the character set is windows-1251
The error you are having is after that. The python you are using is trying to encode the Unicode into ascii to print it, and failing.
You can say print r.text.encode(r.encoding) ... which is the same result as Padraic's suggestion in comments - that is r.content.
Note:
requests.get().encoding is an lvar: you can set it to what you want, if it guessed wrongly.

Fetching URL and converting to UTF-8 Python

I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))
I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.
The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.
The urlopen is returning to you a bytes object. That means it's a raw, encoded stream of bytes. Python 3 prints that in a repr format, which uses escape codes for non-ASCII characters. To get the canonical unicode you would have to decode it. The right way to do that would be to inspect the header and look for the encoding declaration. But for this we can assume UTF-8 and you can simply decode it as such, not encode it.
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test.decode("utf-8")) # <- note change
Now, Python 3 defaults to UTF-8 source encoding. So you can embed non-ASCII like this if your editor supports unicode and saving as UTF-8.
z = "ł"
print(z)
Printing it will only work if your terminal supports UTF-8 encoding. On Linux and OSX they do, so this is not a problem there.
The others are correct, but I'd like to offer a simpler solution. Use requests. It's 3rd party, so you'll need to install it via pip:
pip install requests
But it's a lot simpler to use than the urllib libraries. For your particular case, it handles the decoding for you out of the box:
import requests
r = requests.get("http://olx.pl/")
print(r.encoding)
# UTF-8
print(type(r.text))
# <class 'str'>
print(r.text)
# The HTML
Breakdown:
get sends an HTTP GET request to the server and returns the respose.
We print the encoding requests thinks the text is in. It chooses this based on the response header Martijin mentions.
We show that r.text is already a decoded text type (unicode in Python 2 and str in Python 3)
Then we actually print the response.
Note that we don't have to print the encoding or type; I've just done so for diagnostic purposes to show what requests is doing. requests is designed to simplify a lot of other details of working with HTTP requests, and it does a good job of it.

utf8 codec can't decode byte 0x96 in python

I am trying to check if a certain word is on a page for many sites. The script runs fine for say 15 sites and then it stops.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15344: invalid start byte
I did a search on stackoverflow and found many issues on it but I can't seem to understand what went wrong in my case.
I would like to either solve it or if there is an error skip that site. Pls advice how I can do this as I am new and the below code itself has taken me a day to write. By the way the site which the script halted on was http://www.homestead.com
filetocheck = open("bloglistforcommenting","r")
resultfile = open("finalfile","w")
for countofsites in filetocheck.readlines():
sitename = countofsites.strip()
htmlfile = urllib.urlopen(sitename)
page = htmlfile.read().decode('utf8')
match = re.search("Enter your name", page)
if match:
print "match found : " + sitename
resultfile.write(sitename+"\n")
else:
print "sorry did not find the pattern " +sitename
print "Finished Operations"
As per Mark's comments I changed the code to implement beautifulsoup
htmlfile = urllib.urlopen("http://www.homestead.com")
page = BeautifulSoup((''.join(htmlfile)))
print page.prettify()
now I am getting this error
page = BeautifulSoup((''.join(htmlfile)))
TypeError: 'module' object is not callable
I am trying their quick start example from http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start. If I copy paste it then the code works fine.
I FINALLY got it to work. Thank you all for your help. Here is the final code.
import urllib
import re
from BeautifulSoup import BeautifulSoup
filetocheck = open("listfile","r")
resultfile = open("finalfile","w")
error ="for errors"
for countofsites in filetocheck.readlines():
sitename = countofsites.strip()
htmlfile = urllib.urlopen(sitename)
page = BeautifulSoup((''.join(htmlfile)))
pagetwo =str(page)
match = re.search("Enter YourName", pagetwo)
if match:
print "match found : " + sitename
resultfile.write(sitename+"\n")
else:
print "sorry did not find the pattern " +sitename
print "Finished Operations"
The byte at 15344 is 0x96. Presumably at position 15343 there is either a single-byte encoding of a character, or the last byte of a multiple-byte encoding, making 15344 the start of a character. 0x96 is in binary 10010110, and any byte matching the pattern 10XXXXXX (0x80 to 0xBF) can only be a second or subsequent byte in a UTF-8 encoding.
Hence the stream is either not UTF-8 or else is corrupted.
Examining the URI you link to, we find the header:
Content-Type: text/html
Since there is no encoding stated, we should use the default for HTTP, which is ISO-8859-1 (aka "Latin 1").
Examining the content we find the line:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Which is a fall-back mechanism for people who are, for some reason, unable to set their HTTP headings correctly. This time we are explicitly told the character encoding is ISO-8859-1.
As such, there's no reason to expect reading it as UTF-8 to work.
For extra fun though, when we consider that in ISO-8859-1 0x96 encodes U+0096 which is the control character "START OF GUARDED AREA" we find that ISO-8859-1 isn't correct either. It seems the people creating the page made a similar error to yourself.
From context, it would seem that they actually used Windows-1252, as in that encoding 0x96 encodes U+2013 (EN-DASH, looks like –).
So, to parse this particular page you want to decode in Windows-1252.
More generally, you want to examine headers when picking character encodings, and while it would perhaps be incorrect in this case (or perhaps not, more than a few "ISO-8859-1" codecs are actually Windows-1252), you'll be correct more often. You still need to have something catch failures like this by reading with a fallback. The decode method takes a second parameter called errors. The default is 'strict', but you can also have 'ignore', 'replace', 'xmlcharrefreplace' (not appropriate), 'backslashreplace' (not appropriate) and you can register your own fallback handler with codecs.register_error().
Many web pages are encoded incorrectly. For parsing HTML try BeautifulSoup as it can handle many types of incorrect HTML that are found in the wild.
Beautiful Soup is a Python HTML/XML parser designed for quick
turnaround projects like screen-scraping. Three features make it
powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a
parse tree that makes approximately as much sense as your original
document. This is usually good enough to collect the data you need and
run away.
Beautiful Soup provides a few simple methods and Pythonic
idioms for navigating, searching, and modifying a parse tree: a
toolkit for dissecting a document and extracting what you need. You
don't have to create a custom parser for each application.
Beautiful
Soup automatically converts incoming documents to Unicode and outgoing
documents to UTF-8. You don't have to think about encodings, unless
the document doesn't specify an encoding and Beautiful Soup can't
autodetect one. Then you just have to specify the original encoding.
Emphasis mine.
The site 'http://www.homestead.com' doesn't claim to be sending you utf-8, the response actually claims to be iso-8859-1:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
You must use the correct encoding for the page you actually received, not just guess randomly.

How to work with unicode in Python

I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
The actual line of code was
s=unicodestring.replace('\xa0','')
Anyway-I decided that I needed to preface it with an r so I ran this line of code:
s=unicodestring.replace(r'\xa0','')
It runs without error but I when I look at a slice of s I see that the \xaO is still there
may be you should be doing
s=unicodestring.replace(u'\xa0',u'')
s=unicodestring.replace('\xa0','')
..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)
The reason r'\xa0' did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0 into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..
The following are the same:
>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'
This is something resolved in Python v3, as the default string type is unicode, so you can just do..
>>> '\xa0'
'\xa0'
I am trying to clean all of the HTML out of a string so the final output is a text file
I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
<body>
<h1>
Hi
</h1>
</body>
</html>
Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.
There's also a good article here that puts it all together.
Instead of this, it's better to use standard python features.
For example:
string = unicode('Hello, \xa0World', 'utf-8', 'replace')
or
string = unicode('Hello, \xa0World', 'utf-8', 'ignore')
where replace will replace \xa0 to \\xa0.
But if \xa0 is really not meaningful for you and you want to remove it then use ignore.
Just a note regarding HTML cleaning. It is very very hard, since
<
body
>
Is a valid way to write HTML. Just an fyi.
You can convert it to unicode in this way:
print u'Hello, \xa0World' # print Hello, World

Categories