Python requests downloading page html content instead of .xlsx file - python

I am attempting to download a .xlsx file from https://matmatch.com/ using Python Requests, however I am having issues with my script downloading the html content of the page rather than the xlsx document I want. I have used LivehttpHeaders to look at the requests the page is making and I found the get command the page is sending to download the file
http://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=cd21d8a7-ba5e-037e-f830-08600989cb0a
Using this in python with a request.get command still results in downloading the raw html instead of the file. Here is the code I am currently using
import requests
url = "https://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=360f528b-53be-72c1-25dd-dcf9401633a3"
r = requests.get(url, allow_redirects=True)
open('copper.csv', 'wb').write(r.content)
Are there any glaring issues in my code or am I just not understanding how html works?
Cheers
EDIT: Here is the page I am trying to download from using the "download as .xlsx" button
https://matmatch.com/materials/mitf1194-astm-b196-grade-c17200-tb00
EDIT: Here is the LivehttpHeaders summary of the GET request. Do I need to include some of these headers in my request?
GET /api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=41db0a0e-5091-0b06-8195-f0e0639286d6 HTTP/1.1
Host: matmatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36
HTTP/1.1 200
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Connection: keep-alive
Content-Disposition: attachment; filename="mitf1194-astm-b196-grade-c17200-tb00.xlsx"
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Date: Mon, 22 Mar 2021 00:37:42 GMT
Expires: 0
Filename: mitf1194-astm-b196-grade-c17200-tb00.xlsx
Pragma: no-cache
Server: nginx
Strict-Transport-Security: max-age=15724800; includeSubDomains;
transfer-encoding: chunked
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block

Related

Need help scraping tracking number details from UPS website with Python without API

I want to build a simple python program that gets updates for tracking numbers from UPS, I couldn't get an account number with them so I can't use their API. I decided to try web scraping.
Here's an example of a tracking number:
https://www.ups.com/track?loc=en_US&tracknum=1Z0X118AYW08592000&requester=WT/trackdetails
I want to get the scheduled delivery date, the problem is that what the requests module scrapes and what shows when I view the page source doesn't get all the information inside a tag called app-root. That's where the delivery date is.
I found a similar post that solves this problem with FedEx, but I can't get it to work with the ups website: Parsing HTML does not output desired data(tracking info for FedEx)
I installed an extension called HTTP Trace that shows all the requests that go through my server, I can't find the one that matches UPS, this is what I got from the extension when I searched for the tracking number, any ideas what I can do here?
https://wwwapps.ups.com/WebTracking/track?loc=en_IL
HTMLVersion: 5.0
loc: en_IL
track.x: Track
trackNums: 1Z0X118AYW08592000
ups-search: 1Z0X118AYW08592000
POST https://wwwapps.ups.com/WebTracking/track?loc=en_IL
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
HTTP/1.1 302 Moved Temporarily
Redirect to: https://www.ups.com/track?loc=en_IL&tracknum=1Z0X118AYW08592000&requester=WT
Server: Apache
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains
Cache-Control: no-store, no-cache
Pragma: no-cache
Location: https://www.ups.com/track?loc=en_IL&tracknum=1Z0X118AYW08592000&requester=WT
Content-Length: 365
Content-Type: text/html
Date: Thu, 14 Jan 2021 00:32:34 GMT
Connection: keep-alive
Server-Timing: cdn-cache; desc=MISS
Server-Timing: edge; dur=164
Server-Timing: origin; dur=23
Debug-AK-TLS: No bypass
GET https://www.ups.com/track?loc=en_IL&tracknum=1Z0X118AYW08592000&requester=WT
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
HTTP/1.1 200 OK
Server: Apache
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains
Cache-Control: no-store, no-cache
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Debug-AK-TLS: No bypass
X-Akamai-Transformed: 9 9152 0 pmb=mTOE,1mRUM,1
Date: Thu, 14 Jan 2021 00:32:34 GMT
Content-Length: 10947
Connection: keep-alive
Vary: Accept-Encoding
Server-Timing: cdn-cache; desc=MISS
Server-Timing: edge; dur=182
Server-Timing: origin; dur=201
https://www.facebook.com/tr/?id=969628123173894&ev=PageView&dl=https%3A%2F%2Fwww.ups.com%2Ftrack%3Floc%3Den_IL%26tracknum%3D1Z0X118AYW08592000%26requester%3DWT%2Ftrackdetails&rl=https%3A%2F%2Fwww.ups.com%2F&if=false&ts=1610584355509&sw=1920&sh=1080&v=2.9.32&r=stable&a=tmtealium&ec=0&o=30&fbp=fb.1.1598067407332.38393503&it=1610584355413&coo=false&dpo=LDU&dpoco=0&dpost=0&rqm=GET
GET https://www.facebook.com/tr/?id=969628123173894&ev=PageView&dl=https%3A%2F%2Fwww.ups.com%2Ftrack%3Floc%3Den_IL%26tracknum%3D1Z0X118AYW08592000%26requester%3DWT%2Ftrackdetails&rl=https%3A%2F%2Fwww.ups.com%2F&if=false&ts=1610584355509&sw=1920&sh=1080&v=2.9.32&r=stable&a=tmtealium&ec=0&o=30&fbp=fb.1.1598067407332.38393503&it=1610584355413&coo=false&dpo=LDU&dpoco=0&dpost=0&rqm=GET
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Accept: image/avif,image/webp,image/apng,image/*,*/*;q=0.8
HTTP/1.1 302
Redirect to: https://cx.atdmt.com/?c=1479770850078954307&f=AYzL_IHfyiIJ9HIa7oqq8XcmRPtLo6M0aKkForULuTS_d5qgkpmUtO1x4Rmi3jkdZ4EPRHG7qxKZDTiWb-BA5MYf&id=969628123173894&l=3&v=0
cache-control: no-cache, no-store, must-revalidate
pragma: no-cache
expires: 0
date: Thu, 14 Jan 2021 00:32:36 GMT
location: https://cx.atdmt.com/?c=1479770850078954307&f=AYzL_IHfyiIJ9HIa7oqq8XcmRPtLo6M0aKkForULuTS_d5qgkpmUtO1x4Rmi3jkdZ4EPRHG7qxKZDTiWb-BA5MYf&id=969628123173894&l=3&v=0
content-type: text/plain
content-length: 0
server: proxygen-bolt
alt-svc: h3-29=":443"; ma=3600,h3-27=":443"; ma=3600
GET https://cx.atdmt.com/?c=1479770850078954307&f=AYzL_IHfyiIJ9HIa7oqq8XcmRPtLo6M0aKkForULuTS_d5qgkpmUtO1x4Rmi3jkdZ4EPRHG7qxKZDTiWb-BA5MYf&id=969628123173894&l=3&v=0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36
Accept: image/avif,image/webp,image/apng,image/*,*/*;q=0.8
HTTP/1.1 200
x-fb-rlafr: 0
content-type: image/gif
date: Wed, 13 Jan 2021 16:32:37 PST
x-content-type-options: nosniff
report-to: {"group":"coep_report","max_age":86400,"endpoints":[{"url":"https:\/\/www.facebook.com\/browser_reporting\/"}]}
cache-control: public, max-age=0
content-encoding: br
x-frame-options: DENY
cross-origin-resource-policy: cross-origin
expires: Wed, 13 Jan 2021 16:32:37 PST
vary: Accept-Encoding
cross-origin-embedder-policy-report-only: require-corp;report-to="coep_report"
pragma: public
x-fb-debug: ySiesinmQSMWtIWGg5+rMp+g66R70GGiqJJC3M0DowZMGuFf14OidRiX02DfG99gXxjUSjCaEtHosxh/9tl/hQ==
I honestly do not believe this is possible. I checked how UPS loads its sites, and it seems to load the frontend first like this
Get request to website preview
then goes in to the api to grab the dates. For example, the delivered on date is stored in this api link ("https://www.ups.com/track/api/Track/GetStatus?loc=en_US") which needs a bunch of headers and has some akamai/security cookies (which may prevent you from scraping it).
If you really do not want to use an api, I would suggest using something like Selenium if you do not need it to be quick/do not have many links to work with.
Your only choice is either Selenium OR API but that comes with a hitch. For what I can tell on the UPS website, their API only allows for queries at night. They only want "emergencies" to be hitting their API, which is preposterous since I would imagine web requests are hitting the same API.

Open URL using python requests only then proceed to download file

I'm trying to download a file from a website using Python's request module.
However the site will allow me to download the file only if the download link is clicked directly from the download page.
So using requests, I tried hitting the download page's URL first using requests.get() then proceeding to download the file. But unfortunately this doesn't seem to work. A text asking me to open the download page first simply gets written into file.torrent"
import requests
def download(username, password):
with requests.Session() as session:
session.post('https://website.net/forum/login.php', data={'login_username': username, 'login_password': password})
# Download page URL
requests.get('https://website.net/forum/viewtopic.php?t=2508126')
# The download URL itself
response = requests.get('https://website.net/forum/dl.php?t=2508126')
with open('file.torrent', 'wb') as f:
f.write(response.content)
download(username='XXXXX', password='YYYYY')
Response when downloading directly from the download page (works) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Disposition: attachment; filename="[website.net].t2508126.torrent"
Content-Length: 33641
Content-Type: application/x-bittorrent; name="[website.net].t2508126.torrent"
Date: Thu, 14 Feb 2019 07:57:08 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 07:57:09 GMT
Pragma: no-cache
Server: nginx
Set-Cookie: bb_dl=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/forum/; domain=.website.net
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131012-3061288864-1; bb_dl=2508126
Host: website.net
Referer: https://website.net/forum/viewtopic.php?t=2508126
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126
Response when opening the download link on it's own (doesn't work) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Type: text/html; charset=windows-1251
Date: Thu, 14 Feb 2019 08:03:29 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 08:03:29 GMT
Pragma: no-cache
Server: nginx
Transfer-Encoding: chunked
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131390-3061288864-1
Host: website.net
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126
This works for me:
data={'login_username': username, 'login_password': password, 'login': ''}
and using session.get() instead of requests.get()

Urllib2 not returning entire data

I'm trying to download icecast json status data from a server using python.
This is my code (after different attempts).
def checkStream(url):
request = urllib2.Request(url)
request.add_header("Connection", "keep-alive")
request.add_header("Cache-Control", "max-age=0")
request.add_header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
request.add_header("Upgrade-Insecure-Requests", "1")
request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")
request.add_header("Accept-Encoding", "gzip, deflate, sdch")
response = urllib2.urlopen(request)
line = response.read()
print line
return
checkStream("http://108.168.175.149:10128/status-json.xsl")
The problem is that my response is printed like this
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type
Access-Control-Allow-Methods: GET, OPTIONS, HEAD
{"icestats":{"admin":"icemaster#localhost","banned_IPs":0,"build":20141112090605,"host":"pro02.caster.fm","location":"Earth","outgoing_kbitrate":3799,"server_id":"Icecast 2.3.3-kh11","server_start":"05/Oct/2015:10:43:46 -0500","stream_kbytes_read":104422400,"stream_kbytes_sent":5123403693,"source":[{"audio_codecid":2,"audio_info":"ice-samplerate=44100;ice-bitrate=96;ice-channels=2","bitrate":96,"connected":33748,"genre":"Various","ice-bitrate":96,"ice-channels":2,"ice-samplerate":44100,"incoming_bitrate":95920,"listener_peak":153,"listeners":42,"listenurl":"http://pro02.caster.fm:10128/live","mpeg_channels":2,"mpeg_samplerate":44100,"outgoing_kbitrate":3883,"queue_size":358609,"se
The end of the json response is short 272 bytes which is exactly the number of bytes of the response headers which are returned in the data.
If I open the link on chrome the response appears ok.
I also tested using requests lib with no luck.
>>> import requests
>>> r = requests.get("http://108.168.175.149:10128/status-json.xsl")
>>> r.text
u'Expires: Thu, 19 Nov 1981 08:52:00 GMT\r\nCache-Control: no-store, no-cache, must-revalidate\r\nPragma: no-cache\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type\r\nAccess-Control-Allow-Methods: GET, OPTIONS, HEAD\r\n\r\n{"icestats":{"admin":"icemaster#localhost","banned_IPs":0,"build":20141112090605,"host":"pro02.caster.fm","location":"Earth","outgoing_kbitrate":3844,"server_id":"Icecast 2.3.3-kh11","server_start":"05/Oct/2015:10:43:46 -0500","stream_kbytes_read":104438630,"stream_kbytes_sent":5124109510,"source":[{"audio_codecid":2,"audio_info":"ice-samplerate=44100;ice-bitrate=96;ice-channels=2","bitrate":96,"connected":35133,"genre":"Various","ice-bitrate":96,"ice-channels":2,"ice-samplerate":44100,"incoming_bitrate":95920,"listener_peak":153,"listeners":43,"listenurl":"http://pro02.caster.fm:10128/live","mpeg_channels":2,"mpeg_samplerate":44100,"outgoing_kbitrate":3837,"queue_size":164258,"se'
>>>
How can I retrieve the complete data?
The server you are requesting this from is running an ancient version of an Icecast fork.
This bug was fixed and the fix released long ago in mainline. I'd recommend to upgrade (or tell the operator to upgrade) the server to the latest official Icecast version from http://icecast.org

Unable to download a file after login using python request

I am trying to download a file using python requests module bu logging in to the site first. I am able to login but when i send a get request to download the file it shows me the login page again.
Code:
login_url = 'https://seller.flipkart.com/login'
manifest_url = 'https://seller.flipkart.com/order_management/manifest.pdf'
username = 'username#gmail.com'
password = 'password'
params = {'sellerId':'seller_id'}
payload = {'authName':'flipkart',
'username':username,
'password':password}
ses = requests.Session()
ses.post(login_url, data=payload, headers={'Content-Type':'application/x-www-form-urlencoded','Connection':'keep-alive'})
response = ses.get(manifest_url, params=params, headers={'Content-Type':'application/pdf','Connection':'keep-alive'})
print response.status_code
print response.url
print response.content
On running this code I get the html of login page as content.
I used fiddler and got the below data:
Request URL: https://seller.flipkart.com/order_management/manifest.pdf?sellerId=seller_id
Request Method: GET
sellerId: seller_id
# Request Headers
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Referer: https://seller.flipkart.com/order_management?sellerId=seller_id
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
# Response Headers
Server: nginx
Date: Wed, 30 Dec 2015 13:12:31 GMT
Content-Type: application/pdf
Content-Length: 3652
Connection: keep-alive
X-XSS-Protection: 1; mode=block
strict-transport-security: max-age=31536000; preload
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Cache-Control: private, no-cache, no-store, must-revalidate
Expires: -1
Pragma: no-cache
X-Req-Id: REQ-14d7434a-e429-40e4-801f-6010d7c0b48c
X-Host-Id: 0008
content-disposition: attachment; filename=Manifest-seller_id-30-Dec-2015-18-42-30.pdf
vary: Accept-Encoding
How to download the file ?
Set stream=True and then write the contents to a file.
import re
# Send request by setting 'stream=True'
r = ses.get(manifest_url, ..., stream=True)
# Fetch filename
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)
# Write content to file
with open(fname, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
Docs.

download files with python (REST URL)

I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.
import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))

Categories