python urllib2 returns garbage - python

I am trying to download a web page with python and access some elements on the page. I have an issue when I download the page: the content is garbage. This is the first lines of the page:
‹í}évÛH²æïòSd±ÏmÉ·’¸–%ÕhµÕ%ÙjI¶«JããIÐ(‰îî{æ1æ÷¼Æ¼Í}’ù"à""’‚d÷t»N‰$–\"ãˈŒˆŒÜøqïíîùï'û¬¼­gôÁnžm–úq<ü¹R¹¾¾._›å ìUôv»]¹¡gJÌqÃÍ’‡%z‹[ÎÖ3†[(,jüËȽÚ,í~ÌýX;y‰Ùò×f)æ7q…JzÉì¾F<ÞÅ]­Uª
this problem happen only on the following website: http://kickass.to. Is it possible that they have somehow protected their page? this is my python code:
import urllib2
import chardet
url = 'http://kickass.to/'
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KH
TML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
f = open('page.html','w')
f.write(page)
f.close()
print response.headers['content-type']
print chardet.detect(page)
and result:
text/html; charset=UTF-8
{'confidence': 0.0, 'encoding': None}
it looks like an encoding issue but chardet detects 'None'.. Any ideas?

This page is returned in gzip encoding.
(Try printing out response.headers['content-encoding'] to verify this.)
Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).
urllib2 doesn't support deflating, but you can use gzip module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .

Related

Trying to use fancyURLopener in Python3 for a PDF, but it gives me a DeprecationWarning error

I am trying to access a PDF file from a bank's website for PDF mining, but it keeps returning HTTP 403 error. So as a workaround, I am trying to change my User-Agent to a browser for accessing the file (and downloading it).
The code below is part of what I have right now. This returns the following error:
C:\Users\Name\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: MyOpener style of invoking requests is deprecated. Use newer urlopen functions/methods
How do I fix this?
import urllib.request
my_url = 'someurl here'
class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
page = myopener.open(my_url)
page.read()
You can try this:
import urllib2
def download_file(download_url):
response = urllib2.urlopen(download_url)
f = open("the_downloaded_file.pdf", 'wb')
f.write(response.read())
f.close()
download_file("some url to pdf here")
For me, urllib2 was giving a squiggly line in my VS Code IDE
I changed urllib2 to urllib and it worked, based off this answer
from urllib.request import urlopen
def http_get_save(url, encoded_path):
with urlopen(url) as response:
body = response.read().decode()
with open(encoded_path, 'w') as f:
f.write(body)

Getting information in inspect element

I'm trying to find all the information inside "inspect" when using a browser for example chrome, currently i can get the page "source" but it doesn't contain everything that inspect contains
when i tried using
with urllib.request.urlopen(section_url) as url:
html = url.read()
I got the following error message: "urllib.error.HTTPError: HTTP Error 403: Forbidden"
Now I'm assuming this is because the url I'm trying to get is from a https url instead of a http one, and i was wondering if there is a specific way to get that information from https since the normal methods arn't working.
Note: I've also tried this, but it didn't show me everything
f = requests.get(url)
print(f.text)
You need to have a user-agent to make the browser think you're not a robot.
import urllib.request, urllib.error, urllib.parse
url = 'http://www.google.com' #Input your url
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(req)
html = response.read()
response.close()
adapted from https://stackoverflow.com/a/3949760/6622817

How to Google in Python Using urllib or requests

What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.
In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)

Reading walmart product page with urllib doesn't work when using "user-agent" string

I'm building django based website where some data is dynamically loaded using Ajax from a user specified url. For this I'm using urllib2 and later on BeautifulSoup. I came to strange thing with Walmart links. Take a look:
import urllib2
url_to_parse = 'http://www.walmart.com/ip/JVC-HARX300-High-Quality-Full-Size-Headphone/13241375'
# 1 - read the url without user-agent string
opened_url = urllib2.urlopen(url_to_parse)
print len(opened_url.read())
# prints 309316
# 2 - read the url wit user-agent string
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0' }
req = urllib2.Request(url_to_parse, '', headers)
opened_url = urllib2.urlopen(req)
print len(opened_url.read())
# prints 0
My question is why on #2 a zero is printed? I use the user-agent method to deal with other websites (like amazon).
Wget is able to get the page content with no problems btw.
Your problem is not the User-Agent, it is your data parameter.
From the docs:
data may be a string specifying additional data to send to the server,
or None if no such data is needed.
It seems WalMart does not like your empty string. Change your call to this:
req = urllib2.Request(url_to_parse, None, headers)
Now both ways print the same value.

Python http download page source

hello there
i was wondering if it was possible to connect to a http host (I.e. for example google.com)
and download the source of the webpage?
Thanks in advance.
Using urllib2 to download a page.
Google will block this request as it will try to block all robots. Add user-agent to the request.
import urllib2
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request('http://www.google.com', None, headers)
response = urllib2.urlopen(req)
page = response.read()
response.close() # its always safe to close an open connection
You can also use pyCurl
import sys
import pycurl
class ContentCallback:
def __init__(self):
self.contents = ''
def content_callback(self, buf):
self.contents = self.contents + buf
t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents
You can use urllib2 module.
import urllib2
url = "http://somewhere.com"
page = urllib2.urlopen(url)
data = page.read()
print data
See the doc for more examples
The documentation of httplib (low-level) and urllib (high-level) should get you started. Choose the one that's more suitable for you.
Using requests package:
# Import requests
import requests
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = requests.get(url).content
or with the urllib
from urllib.request import urlopen
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = urlopen(url).read()
so here's another approach to this problem using mechanize. I found this to bypass a website's robot checking system. i commented out the set_all_readonly because for some reason it wasn't recognized as a module in mechanize.
import mechanize
url = 'http://www.example.com'
br = mechanize.Browser()
#br.set_all_readonly(False) # allow everything to be written to
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] # [('User-agent', 'Firefox')]
response = br.open(url)
print response.read() # the text of the page
response1 = br.response() # get the response again
print response1.read() # can apply lxml.html.fromstring()

Categories