urllib2 does not read entire page

urllib2 does not read entire page - python

A portion of code that I have that will parse a web site does not work.
I can trace the problem to the .read function of my urllib2.urlopen object.
page = urllib2.urlopen('http://magiccards.info/us/en.html')
data = page.read()
Until yesterday, this worked fine; but now the length of the data is always 69496 instead of 122989, however when I open smaller pages my code works fine.
I have tested this on Ubuntu, Linux Mint and windows 7. All have the same behaviour.
I'm assuming that something has changed on the web server; but the page is complete when I use a web browser. I have tried to diagnose the issue with wireshark but the page is received as complete.
Does anybody know why this may be happening or what I could try to determine the issue?

The page seems to be misbehaving unless you request the content encoded as gzip. Give this a shot:
import urllib2
import zlib
request = urllib2.Request('http://magiccards.info/us/en.html')
request.add_header('Accept-Encoding', 'gzip')
response = urllib2.urlopen(request)
data = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
As Nathan suggested, you could also use the great Requests library, which accepts gzip by default.
import requests
data = requests.get('http://magiccards.info/us/en.html').text

Yes, the server is closing connection and you need keep-alive to be sent. urllib2 does not have that facility ( :-( ). There used be urlgrabber which you could use have a HTTPHandler that works alongside with urllib2 opener. But unfortunately, I dont find that working too. At the moment, you could be other libraries, like requests as demonstrated in the other answer or httplib2.
import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://magiccards.info/us/en.html", "GET")
print len(content)

Related

Python : Extract requests query parameters that any given URL receives

Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.

You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7

Python3 Download Incorrectly Encoded Image From URL

The problem I am currently having is trying to download an image that displays as an animated gif, but appears encoded as a jpg. I say that it appears to be encoded as a jpg because the file extension and mime-type are both .jpg add image/jpeg.
When downloading the file to my local machine (Mac OSX), then attempting to open the file I get the error:
The file could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
While I realize that some people would maybe just ignore that image, if it can be fixed, I'm looking for a solution to do that, not just ignore it.
The url in question is here:
http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg
Here is my code, and I am open to suggestions:
from PIL import Image
import requests
response = requests.get(media, stream = True)
response.raise_for_status()
with open(uploadedFile, 'wb') as img:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
img.write(chunk)
img.close()

According to Wheregoes, the link of the image:
http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg
receives a 302 redirect to the page that contains it:
http://www.supergrove.com/gif-images/gif-images-22-1000-about-gif-on-pinterest/
Therefore, your code is trying to download a web page as an image.
I tried:
r = requests.get(the_url, headers=headers, allow_redirects=False)
But it returns zero content and status_code = 302.
(Indeed that was obvious it should happen ...)
This server is configured in a way that it will never fulfill that request.
Bypassing that limitation sounds illegal difficult, to the best of my -limited perhaps- knowledge.

Had to answer my own question in this case, but the answer to this problem, was to add a referer for the request. Most likely an htaccess file preventing some direct file access on the image's server unless the request came from their own server.
from fake_useragent import UserAgent
from io import StringIO,BytesIO
import io
import imghdr
import requests
# Set url
mediaURL = 'http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg'
# Create a user agent
ua = UserAgent()
# Create a request session
s = requests.Session()
# Set some headers for the request
s.headers.update({ 'User-Agent': ua.chrome, 'Referrer': media })
# Make the request to get the image from the url
response = s.get(mediaURL, allow_redirects=False)
# The request was about to be redirected
if response.status_code == 302:
# Get the next location that we would have been redirected to
location = response.headers['Location']
# Set the previous page url as referer
s.headers.update({'referer': location})
# Try the request again, this time with a referer
response = s.get(mediaURL, allow_redirects=False, cookies=response.cookies)
print(response.headers)
Hat tip to #raratiru for suggesting the use of allow_redirects.
Also noted in their answer is that the image's server might be intentionally blocking access to prevent general scrapers from viewing their images. Hard to tell, but regardless, this solution works.

Python CookieJar saves cookie, but doesn't send it to website

I am trying to login to website using urllib2 and cookiejar. It saves the session id, but when I try to open another link, which requires authentication it says that I am not logged in. What am I doing wrong?
Here's the code, which fails for me:
import urllib
import urllib2
import cookielib
cookieJar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))
# Gives response saying that I logged in succesfully
response = opener.open("http://site.com/login", "username=testuser&password=" + md5encode("testpassword"))
# Gives response saying that I am not logged in
response1 = opener.open("http://site.com/check")

Your implementation seems fine... and should work.
It should be sending in the correct cookies, but I see it as the case when the site is actually not logging you in.
How can you say that its not sending the cookies or may be cookies that you are getting are not the one that authenticates you.
Use : response.info() to see the headers of the responses to see what cookies you are receiving actually.
The site may not be logging you in because :
Its having a check on User-agent that you are not setting, since some sites open from 4 major browsers only to disallow bot access.
The site might be looking for some special hidden form field that you might not be sending in.
1 piece of advise:
from urllib import urlencode
# Use urlencode to encode your data
data = urlencode(dict(username='testuser', password=md5encode("testpassword")))
response = opener.open("http://site.com/login", data)
Moreover 1 thing is strange here :
You are md5 encoding your password before sending it over. (Strange)
This is generally done by the server before comparing to database.
This is possible only if the site.com implements md5 in javascript.
Its a very rare case, since only may be 0.01 % websites do that..
Check that - that might be the problem, and you are providing the hashed form and not the actual password to the server.
So, server would have been again calculating a md5 for your md5 hash.
Check out.. !!
:)

I had a similar problem with my own test server, which worked fine with a browser, but not with the urllib2.build_opener solution.
The problem seems to be in urllib2. As these answers suggest, it's easy to use more powerful mechanize library instead of urllib2:
cookieJar = cookielib.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(cookieJar)
opener = mechanize.build_opener(*browser.handlers)
And the opener will work as expected!

urllib2 basic authentication oddites

I'm slamming my head against the wall with this one. I've been trying every example, reading every last bit I can find online about basic http authorization with urllib2, but I can not figure out what is causing my specific error.
Adding to the frustration is that the code works for one page, and yet not for another.
logging into www.mysite.com/adm goes absolutely smooth. It authenticates no problem. Yet if I change the address to 'http://mysite.com/adm/items.php?n=201105&c=200' I receive this error:
<h4 align="center" class="teal">Add/Edit Items</h4>
<p><strong>Client:</strong> </p><p><strong>Event:</strong> </p><p class="error">Not enough information to complete this task</p>
<p class="error">This is a fatal error so I am exiting now.</p>
Searching google has lead to zero information on this error.
The adm is a frame set page, I'm not sure if that's relevant at all.
Here is the current code:
import urllib2, urllib
import sys
import re
import base64
from urlparse import urlparse
theurl = 'http://xxxxxmedia.com/adm/items.php?n=201105&c=200'
username = 'XXXX'
password = 'XXXX'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl,username,password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
pagehandle = urllib2.urlopen(theurl)
url = 'http://xxxxxxxmedia.com/adm/items.php?n=201105&c=200'
values = {'AvAudioCD': 1,
'AvAudioCDDiscount': 00, 'AvAudioCDPrice': 50,
'ProductName': 'python test', 'frmSubmit': 'Submit' }
#opener2 = urllib2.build_opener(urllib2.HTTPCookieProcessor())
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
This is just one of the many versions I've tried. I've followed every example from Urllib2 Missing Manual but still receive the same error.
Can anyone point to what I'm doing wrong?

Run into a similar problem today. I was using basic authentication on the website I am developing and I couldn't authenticate any users.
Here are a few things you can use to debug your problem:
I used slumber.in and httplib2 for testing purposes. I ran both from ipython shell to see what responses I was receiving.
Slumber actually uses httplib2 beneath the covers so they acted similarly. I used tcpdump and later tcpflow (which shows information in a much more readable form) to see what was really being sent and received. If you want a GUI, see wireshark or alternatives.
I tested my website with curl and when I used curl with my username/password it worked correctly and showed the requested page. But slumber and httplib2 were still not working.
I tested my website and browserspy.dk to see what were the differences. Important thing is browserspy's website works for basic authentication and my web site did not, so I could compare between the two. I read in a lot of places that you need to send HTTP 401 Not Authorized so that the browser or the tool you are using could send the username/password you provided. But what I didn't know was, you also needed the WWW-Authenticate field in the header. So this was the missing piece.
What made this whole situation odd was while testing I would see httplib2 send basic authentication headers with most of the requests (tcpflow would show that). It turns out that the library does not send username/password authentication on the first request. If "Status 401" AND "WWW-Authenticate" is in the response, then the credentials are sent on the second request and all the requests to this domain from then on.
So to sum up, your application may be correct but you might not be returning the standard headers and status code for the client to send credentials. Use your debug tools to find which is which. Also, there's debug mode for httplib2, just set httplib2.debuglevel=1 so that debug information is printed on the standard output. This is much more helpful then using tcpdump because it is at a higher level.
Hope this helps someone.

About an year ago, I went thro' the same process and documented how I solved the problem - The direct and simple way to authentication and the standard one. Choose what you deem fit.
HTTP Authentication in Python
There is an explained description, in the missing urllib2 document.

From the HTML you posted, it still think that you authenticate successfully but encounter an error afterwards, in the processing of your POST request. I tried your URL and failing authentication, I get a standard 401 page.
In any case, I suggest you try again running your code and performing the same operation manually in Firefox, only this time with Wireshark to capture the exchange. You can grab the full text of the HTTP request and response in both cases and compare the differences. In most cases that will lead you to the source of the error you get.

I also found the passman stuff doesn't work (sometimes?). Adding the base64 user/pass header as per this answer https://stackoverflow.com/a/18592800/623159 did work for me. I am accessing jenkins URL like this: http:///job//lastCompletedBuild/testR‌‌eport/api/python
This works for me:
import urllib2
import base64
baseurl="http://jenkinsurl"
username=...
password=...
url="%s/job/jobname/lastCompletedBuild/testReport/api/python" % baseurl
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request = urllib2.Request(url)
request.add_header("Authorization", "Basic %s" % base64string)
result = urllib2.urlopen(request)
data = result.read()
This doesn't work for me, error 403 each time:
import urllib2
baseurl="http://jenkinsurl"
username=...
password=...
##urllib2.HTTPError: HTTP Error 403: Forbidden
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username,password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(passman)))
req = urllib2.Request(url)
result = urllib2.urlopen(req)
data = result.read()

Cookie Problem in Python

I'm working on a simple HTML scraper for Hulu in python 2.6 and am having problems with logging on to my account. Here's my code so far:
import urllib
import urllib2
from cookielib import CookieJar
#make a cookie and redirect handlers
cookies = CookieJar()
cookie_handler= urllib2.HTTPCookieProcessor(cookies)
redirect_handler= urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)#make opener w/ handlers
#build the url
login_info = {'username':USER,'password':PASS}#USER and PASS are defined
data = urllib.urlencode(login_info)
req = urllib2.Request("http://www.hulu.com/account/authenticate",data)#make the request
test = opener.open(req) #open the page
print test.read() #print html results
The code compiles and runs, but all that prints is:
Login.onError("Please \074a href=\"/support/login_faq#cant_login\"\076enable cookies\074/a\076 and try again.");
I assume there is some error in how I'm handling cookies, but just can't seem to spot it. I've heard Mechanize is a very useful module for this type of program, but as this seems to be the only speed bump left, I was hoping to find my bug.

What you're seeing is a ajax return. It is probably using javascript to set the cookie, and screwing up your attempts to authenticate.

The error message you are getting back could be misleading. For example the server might be looking at user-agent and seeing that say it's not one of the supported browsers, or looking at HTTP_REFERER expecting it to be coming from hulu domain. My point is there are two many variables coming in the request to keep guessing them one by one
I recommend using an http analyzer tool, e.g. Charles or the one in Firebug to figure out what (header fields, cookies, parameters) the client sends to server when you doing hulu login via a browser. This will give you the exact request that you need to construct in your python code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.