python urllib2.urlopen - html text is garbled - why? - python

The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html

I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)

Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.

Related

how to work around encoding problems in redirects

A website I try to scrape seems to have an encoding problem. The pages state, that they are encoded in utf-8, but if I try to scrape them and fetch the html source using requests, the redirect adress contains an encoding, that is not utf-8.
Browsers seem to be tolerant, so they fix this automatically, but the python requests package runs into an exception.
My code looks like this:
res= rq.get(url, allow_redirects=True)
This runs into an exception when trying to decode the redirect string in the following code (hidden somewhere in the requests package):
string.decode(encoding)
where string is the redirect string and encoding is 'utf8':
string= b'/aktien/herm\xe8s-aktie'
I found out, that the encoding in fact is encoded in something like 'Windows-1252'. Actually the redirect should go on '/aktien/herm%C3%A8s-aktie'.
Now my question: how can I either get requests to be more tolerant about such encoding bugs (like the browsers), or how can I alternatively pass an encoding?
I searched for encoding settings, but what I saw so far, requests always does that automatically based on the result.
Btw. the result page of the redirect starts with (it really states to be utf-8)
<!DOCTYPE html><html lang="de" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8">
You can use hooks= parameter in requests.get() method and explicitly urlencode the Location HTTP header. For example:
import requests
import urllib.parse
url = "<YOUR URL FROM EXAMPLE>"
def response_hook(hook_data, **kwargs):
if "Location" in hook_data.headers:
hook_data.headers["Location"] = urllib.parse.quote(
hook_data.headers["Location"]
)
res = requests.get(url, allow_redirects=True, hooks={"response": response_hook})
print(res.url)
Prints:
https://.../herm%C3%A8s-aktie

is there a way to use python-requests to access web pages which are in fact pdfs?

I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR
Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)

Display Google's search result in Python 2.7

I would like to know if there is a way to display Google's search results as an output in my Python program. Like if I type "Electricity" in my program then I want to display Google's search results as plain text. Is there a way to do it?
UPDATE
import urllib2
response = urllib2.urlopen("https://en.wikipedia.org/wiki/Machine_learning")
the_page = response.read(bytes)
content = str(the_page)
print the_page
I tried the above code but it is showing me errors, if I just type
the_page = response.read()
print the_page
It just prints the HTML format of the page but not the text string so How do I get the string alone?
import urllib2
response = urllib2.urlopen("https://en.wikipedia.org/wiki/Machine_learning")
content= response.read()
# Now it's time for parsing *content* to extract the relevant data
# Use regex, HTMLParser from standard library
# or use beautifulsoup, LXML (third party)

Python Sometimes Returns Strange Result When Reading HTML from URL

I created a function to read HTML content from specific url. Here is the code:
def __retrieve_html(self, address):
html = urllib.request.urlopen(address).read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
return str(html)
However the function is not always return the correct string. In some cases it returns a very long weird string.
For example if I use the URL: http://www.merdeka.com, sometimes it will give the correct html string, but sometimes also returns a result like:
HTML content: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xfdyW\x1c\xb7\xd28\x8e\xffm\x9f\x93\xf7\xa0;y>\xc1\xbeA\xcc\xc2b\x03\x86\x1cl\xb0\x8d1\x86\x038yr\......Very long and much more characters.
It seems that it only happen in any pages that have a lot of content. For simple page like Facebook.com login page and Google.com index, it never happened. What is this? Where is my mistake and how to handle it?
It appears the response from http://www.merdeka.com is gzipped compressed.
Give this a try:
import gzip
import urllib.request
def __retrieve_html(self, address):
with urllib.request.urlopen(address) as resp:
html = resp.read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
if resp.info().get('Content-Encoding') == 'gzip':
html = gzip.decompress(html)
return html
How to decode your html object, I leave as an exercise to you.
Alternatively, you could just use the Requests module: http://docs.python-requests.org/en/latest/
Install it with:
pip install requests
Then execute like:
import requests
r = requests.get('http://www.merdeka.com')
r.text
Requests didn't appear to have any trouble with the response from http://www.merdeka.com
You've got bytes instead of string, because urrlib can't decode the response for you. This could be because some sites omit encoding declaration in their content-type header.
For example, google.com has:
Content-Type: text/html; charset=UTF-8
and that http://www.merdeka.com website has just:
Content-Type: text/html
So, you need to manually decode the response, for example with utf-8 encoding
html = urllib.request.urlopen(address).read().decode('utf-8')
The problem is that you need to set correct encoding and if it is not in the server headers, your need to guess it somehow.
See this question for more information How to handle response encoding from urllib.request.urlopen()
PS: Consider moving from somewhat deprecated urllib to the requests lib. It's simplier, trendier and sexier at this time :)

python library to fetch content(http text) from the websites?

I'd like something equivalent to urlopen from urllib python library to fetch data from the web,
urlopen does not seem to work on sites like google or youtube, probably due to incorrect headers,
So any other python based content fetcher i can use?
Here is an example with urllib2 that fetches a web page:
import urllib2
response = urllib2.urlopen('http://google.com/')
html = response.read()
The example is taken from the urllib2 documentation.

Categories