I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.
import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))
Related
I am attempting to download a .xlsx file from https://matmatch.com/ using Python Requests, however I am having issues with my script downloading the html content of the page rather than the xlsx document I want. I have used LivehttpHeaders to look at the requests the page is making and I found the get command the page is sending to download the file
http://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=cd21d8a7-ba5e-037e-f830-08600989cb0a
Using this in python with a request.get command still results in downloading the raw html instead of the file. Here is the code I am currently using
import requests
url = "https://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=360f528b-53be-72c1-25dd-dcf9401633a3"
r = requests.get(url, allow_redirects=True)
open('copper.csv', 'wb').write(r.content)
Are there any glaring issues in my code or am I just not understanding how html works?
Cheers
EDIT: Here is the page I am trying to download from using the "download as .xlsx" button
https://matmatch.com/materials/mitf1194-astm-b196-grade-c17200-tb00
EDIT: Here is the LivehttpHeaders summary of the GET request. Do I need to include some of these headers in my request?
GET /api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=41db0a0e-5091-0b06-8195-f0e0639286d6 HTTP/1.1
Host: matmatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36
HTTP/1.1 200
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Connection: keep-alive
Content-Disposition: attachment; filename="mitf1194-astm-b196-grade-c17200-tb00.xlsx"
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Date: Mon, 22 Mar 2021 00:37:42 GMT
Expires: 0
Filename: mitf1194-astm-b196-grade-c17200-tb00.xlsx
Pragma: no-cache
Server: nginx
Strict-Transport-Security: max-age=15724800; includeSubDomains;
transfer-encoding: chunked
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
I am trying to download a file using python requests module bu logging in to the site first. I am able to login but when i send a get request to download the file it shows me the login page again.
Code:
login_url = 'https://seller.flipkart.com/login'
manifest_url = 'https://seller.flipkart.com/order_management/manifest.pdf'
username = 'username#gmail.com'
password = 'password'
params = {'sellerId':'seller_id'}
payload = {'authName':'flipkart',
'username':username,
'password':password}
ses = requests.Session()
ses.post(login_url, data=payload, headers={'Content-Type':'application/x-www-form-urlencoded','Connection':'keep-alive'})
response = ses.get(manifest_url, params=params, headers={'Content-Type':'application/pdf','Connection':'keep-alive'})
print response.status_code
print response.url
print response.content
On running this code I get the html of login page as content.
I used fiddler and got the below data:
Request URL: https://seller.flipkart.com/order_management/manifest.pdf?sellerId=seller_id
Request Method: GET
sellerId: seller_id
# Request Headers
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Referer: https://seller.flipkart.com/order_management?sellerId=seller_id
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
# Response Headers
Server: nginx
Date: Wed, 30 Dec 2015 13:12:31 GMT
Content-Type: application/pdf
Content-Length: 3652
Connection: keep-alive
X-XSS-Protection: 1; mode=block
strict-transport-security: max-age=31536000; preload
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Cache-Control: private, no-cache, no-store, must-revalidate
Expires: -1
Pragma: no-cache
X-Req-Id: REQ-14d7434a-e429-40e4-801f-6010d7c0b48c
X-Host-Id: 0008
content-disposition: attachment; filename=Manifest-seller_id-30-Dec-2015-18-42-30.pdf
vary: Accept-Encoding
How to download the file ?
Set stream=True and then write the contents to a file.
import re
# Send request by setting 'stream=True'
r = ses.get(manifest_url, ..., stream=True)
# Fetch filename
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)
# Write content to file
with open(fname, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
Docs.
I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.
I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like
headers = {
'Content-type' : 'text/xml',
'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}
request = urllib2.Request(url)
I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.
What am I missing here?
So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:
$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
While without the UA, you get:
$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
I've scraped many websites and have often wondered why the response headers displayed in Firebug and the response headers returned by urllib.urlopen(url).info() are often different in that Firebug reports MORE headers.
I encountered an interesting one today. I'm scraping a website by following a "search url" that fully loads (returns a 200 status code) before redirecting to a final page. The easiest way to perform the scrape would be to return the Location response header and make another request. However, that particular header is absent when I run 'urllib.urlopen(url).info().
Here is the difference:
Firebug headers:
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection : keep-alive
Content-Encoding : gzip
Content-Length : 2433
Content-Type : text/html
Date : Fri, 05 Oct 2012 15:59:31 GMT
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Location : /catalog/display/1292/index.html
Pragma : no-cache
Server : Apache/2.0.55
Set-Cookie : PHPSESSID=9b99dd9a4afb0ef0ca267b853265b540; path=/
Vary : Accept-Encoding,User-Agent
X-Powered-By : PHP/4.4.0
Headers returned by my code:
Date: Fri, 05 Oct 2012 17:16:23 GMT
Server: Apache/2.0.55
X-Powered-By: PHP/4.4.0
Set-Cookie: PHPSESSID=39ccc547fc407daab21d3c83451d9a04; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Type: text/html
Connection: close
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras
import scrape_tools
tools = scrape_tools.tool_box()
db = tools.db_connect()
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT data FROM table WHERE variable = 'Constant' ORDER BY data")
for row in cursor:
url = 'http://www.website.com/search/' + row['data']
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-Language' : 'en-us,en;q=0.5',
'Connection' : 'keep-alive',
'Host' : 'www.website.com',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'
}
post_params = {
'query' : row['data'],
'searchtype' : 'products'
}
post_args = urllib.urlencode(post_params)
soup = tools.request(url, post_args, headers)
print tools.get_headers(url, post_args, headers)
Please note: scrape_tools is a module I wrote myself. The code contained in the module to retrieve headers is (basically) as follows:
class tool_box:
def get_headers(self, url, post, headers):
file_pointer = urllib.urlopen(url, post, headers)
return file_pointer.info()
Is there a reason for the discrepancy? Am I making a silly mistake in my code? How can I retrieve the missing header data? I'm fairly new to Python, so please forgive any dumb errors.
Thanks in advance. Any advice is much appreciated!
Also...Sorry about the wall of code =\
You're not getting the same kind of response for the two requests. For example, the response to the Firefox request contains a Location: header, so it's probably a 302 Moved temporarily or a 301. Those don't contain any actual body data, but instead cause your Firefox to issue a second request to the URL in the Location: header (urllib doesn't do that).
The Firefox response also uses Connection : keep-alive while the urllib request got answered with Connection: close.
Also, the Firefox response is gzipped (Content-Encoding : gzip), while the urllib one is not. That's probably because your Firefox sends a Accept-Encoding: gzip, deflate header with its request.
Don't rely on Firebug to tell you HTTP headers (even though it does so truthfully most of the time), but use a sniffer like wireshark to inspect what's actually going over the wire.
You're obviously dealing with two different responses.
There could be several reasons for this. For one, web servers are supposed to respond differently depending on what Accept-Language, Accept-Encoding headers etc.. the client sends in its request. Then there's also the possibility that the server does some kind of User-Agent sniffing.
Either way, capture your requests with urllib as well as the ones with Firefox using wireshark and first compare the requests (not the headers, but the actual GET / HTTP/1.0 part. Are they really the same? If yes, move on to comparing request headers and start manually setting the same headers for the urllib request until you figure out which headers make a difference.
The following url (and others like it) can be opened in a browser but causes urllib2.urlopen to throw a 404 exception: http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=1
geturl() returns the same url (no redirect). The headers are copied and pasted from firebug. I tried passing in the headers as a dictionary to Request, but got the same result. wget opens the url in the console but not from the script.
the code:
source_url = 'http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2'
try:
socket.setdefaulttimeout(10)
hdrs = [('Host','store.ovi.com'),('User-Agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;(+http://code.google.com/appengine)'),('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),('Accept-Language','en-us,en;q=0.5'),('Accept-Encoding','gzip,deflate'),('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7'),('Keep-Alive','115'),('Connection','keep-alive'),('Cookie','JNPRSESSID=4u4devdrt7eb6e0qem3gin47i2; s_cc=true; undefined_s=First%20Visit; s_nr=1282817443274; s_sq=%5B%5BB%5D%5D; view=Grid; menu=menuOpen; OVI_DEVICE=b5130'),('Cache-Control','max-age=0')]
ree = urllib2.Request(source_url)
ree.addheaders = hdrs
opener = urllib2.build_opener()
htmlSource = opener.open(ree).read()
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.headers
The error output:
404
Not Found
Date: Sat, 28 Aug 2010 00:36:57 GMT
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Keep-Alive: timeout=7, max=333
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
What, if anything, am I doing incorrectly? Is this a bug? And if so, is there a workaround? Thanks!
Given a URL like:
http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2
The bit that browsers fetch is just:
http://store.ovi.com/
Everything to the right of that is a ‘fragment identifier’, which is not passed to the server at all (evidently, if you try, it will get confused). Instead, the HTML returned for the / URL will include a load of JavaScript that reads the #... data at the client side and fills in the page content using a bunch of XMLHttpRequests.
Webapps implemented like this are a big old pain to scrape, because you can't just take the HTML content of the main page. Instead you have to either analyse the script to find out where it gets the actual data from, or you have to hook up a real browser in order to execute all the scripts and see what document objects you're left with. They're also typically bad for accessibility and SEO.
Luckily for you this site appears to be putting something in the fragment that's also a valid path. So it looks like you can get the dynamic page data from the URL:
http://store.ovi.com/applications?categoryId=20&fragment=1&page=1