I am using Python 2.7.3 and requests(requests==2.10.0). I am trying to get some zipfile from some link. The website's certificate is not verified but I just want to download that zip so I used verfiy=False.
link = 'https://webapi.yanoshin.jp/rde.php?https%3A%2F%2Fdisclosure.edinet-fsa.go.jp%2FE01EW%2Fdownload%3Fuji.verb%3DW0EZA104CXP001006BLogic%26uji.bean%3Dee.bean.parent.EECommonSearchBean%26lgKbn%3D2%26no%3DS1007NMV'
r = requests.get(link, timeout=10, verify=False)
print r.content
# 'GIF89a\x01\x00\x01\x00\x80\x00\x00\x00\x00\x00\xff\xff\xff!\xf9\x04\x01\x00\x00\x01\x00,\x00\x00\x00\x00\x01\x00\x01\x00#\x02\x02L\x01\x00;'
print r.headers
# {'Content-Length': '43', 'Via': '1.0 localhost (squid/3.1.19)', 'X-Cache': 'MISS from localhost', 'X-Cache-Lookup': 'MISS from localhost:3128', 'Server': 'Apache', 'Connection': 'keep-alive', 'Date': 'Mon, 06 Jun 2016 07:59:52 GMT', 'Content-Type': 'image/gif'}
However, I tried with Firefox & Chromium, if I choose to trust that cert, I will be able to download zip file. wget --no-check-certificate [that link] results in a zip file with correct size as well.
(I wrote that gif to disk and checked, no content, just too small in terms of file size)
Maybe it is a header issue? I do not know. I can use wget of course. Just want to figure out the reason behind this and make this work.
(Browser will download some zip, 23.4KB)(wget [link] -O test.zip will download the zip file as well)
The server is trying to block scripts from downloading ZIP files; you'll see the same issue when using curl:
$ curl -sD - -o /dev/null "https://webapi.yanoshin.jp/rde.php?https%3A%2F%2Fdisclosure.edinet-fsa.go.jp%2FE01EW%2Fdownload%3Fuji.verb%3DW0EZA104CXP001006BLogic%26uji.bean%3Dee.bean.parent.EECommonSearchBean%26lgKbn%3D2%26no%3DS1007NUS"
HTTP/1.1 302 Found
Server: nginx
Date: Mon, 06 Jun 2016 08:56:20 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/7.0.7
Location: https://disclosure.edinet-fsa.go.jp/E01EW/download?uji.verb=W0EZA104CXP001006BLogic&uji.bean=ee.bean.parent.EECommonSearchBean&lgKbn=2&no=S1007NUS
Notice the text/html response.
The server seems to be looking for browser-specific Accept and User-Agent headers; copying the Accept header Chrome sends, plus adding a minimal User-Agent string, seems to be enough to fool the server:
>>> r = requests.get(link, timeout=10, headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}, verify=False)
# ... two warnings about ignoring the certificate ...
>>> r.headers
{'Content-Length': '14078', 'Content-Disposition': 'inline;filename="Xbrl_Search_20160606_175759.zip"', 'Set-Cookie': 'FJNADDSPID=3XWzlS; expires=Mon, 05-Sep-2016 08:57:59 GMT; path=/; secure, JSESSIONID=6HIMAP1I60PJ2P9HC5H3AC1N68PJAOR568RJIEB5CGS3I0UITOI5A08000P00000.E01EW_001; Path=/E01EW; secure', 'Connection': 'close', 'X-UA-Compatible': 'IE=EmulateIE9', 'Date': 'Mon, 06 Jun 2016 08:57:59 GMT', 'Content-Type': 'application/octet-stream'}
Related
I am using the python requests library to get all the headers from a website, however requests only seems to be getting the Response Headers and i also need the Request Headers.
Is there a way to get the Request Headers within the requests library or should i use a differant library to get the headers?
my code:
import requests
r = requests.get("https://google.com", allow_redirects = False)
for key in r.headers:
print(key, ": ", r.headers[key])
output:
Location : https://www.google.com/
Content-Type : text/html; charset=UTF-8
Date : Wed, 19 Feb 2020 13:08:27 GMT
Expires : Fri, 20 Mar 2020 13:08:27 GMT
Cache-Control : public, max-age=2592000
Server : gws
Content-Length : 220
X-XSS-Protection : 0
X-Frame-Options : SAMEORIGIN
Alt-Svc : quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000
The response object contains a request object that is the request which produced the reponse.
This requests.models.PreparedRequest object is accessible through the request property of the response object, its header are in the property headersof the request object.
See this example:
>>> import requests
>>> r = requests.get("http://google.com")
>>> r.request.headers
{'Connection': 'keep-alive', 'User-Agent': 'python-requests/2.22.0', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate'}
I am attempting to download a .xlsx file from https://matmatch.com/ using Python Requests, however I am having issues with my script downloading the html content of the page rather than the xlsx document I want. I have used LivehttpHeaders to look at the requests the page is making and I found the get command the page is sending to download the file
http://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=cd21d8a7-ba5e-037e-f830-08600989cb0a
Using this in python with a request.get command still results in downloading the raw html instead of the file. Here is the code I am currently using
import requests
url = "https://matmatch.com/api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=360f528b-53be-72c1-25dd-dcf9401633a3"
r = requests.get(url, allow_redirects=True)
open('copper.csv', 'wb').write(r.content)
Are there any glaring issues in my code or am I just not understanding how html works?
Cheers
EDIT: Here is the page I am trying to download from using the "download as .xlsx" button
https://matmatch.com/materials/mitf1194-astm-b196-grade-c17200-tb00
EDIT: Here is the LivehttpHeaders summary of the GET request. Do I need to include some of these headers in my request?
GET /api/downloads/materials/MITF1194/xlsx?expires-in=30min&download-token=41db0a0e-5091-0b06-8195-f0e0639286d6 HTTP/1.1
Host: matmatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36
HTTP/1.1 200
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Connection: keep-alive
Content-Disposition: attachment; filename="mitf1194-astm-b196-grade-c17200-tb00.xlsx"
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Date: Mon, 22 Mar 2021 00:37:42 GMT
Expires: 0
Filename: mitf1194-astm-b196-grade-c17200-tb00.xlsx
Pragma: no-cache
Server: nginx
Strict-Transport-Security: max-age=15724800; includeSubDomains;
transfer-encoding: chunked
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.
I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like
headers = {
'Content-type' : 'text/xml',
'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}
request = urllib2.Request(url)
I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.
What am I missing here?
So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:
$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
While without the UA, you get:
$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
Basically i have few urls in a list of some video file and i want to find size of those videos without downloading them using urllib.
[u'https://fbcdn-video-a.akamaihd.net/hvideo-ak-frc3/v/985732_10102527799850656_17701053_n.mp4?oh=a4d452753fd4cc90aeca55b3e1b23d4f&oe=5222F54B&__gda__=1378022845_fc6b392b6b1238ab60bde944da7a1cfe', u'https://fbcdn-video-a.akamaihd.net/hvideo-ak-ash4/v/1039184_10102527799376606_136270614_n.mp4?oh=d3198aa784f5da432d56236135fffa4b&oe=5222F6C7&__gda__=1378023085_1c5de4e6d733269f70643fc3a25c09e5']
Can it be done using info() method of urllib?.
Is there any way from which i can get their size.
Thanks in advance
Although #sberry 's answer is perfectly valid, I'm just translating it to Python, as it was the tag of your question.
import requests
>>> r = requests.head(url)
>>> print r.headers
{'accept-ranges': 'bytes',
'cache-control': 'max-age=467354',
'connection': 'keep-alive',
'content-length': '37475248',
'content-type': 'video/mp4',
'date': 'Sun, 01 Sep 2013 07:26:21 GMT',
'expires': 'Fri, 06 Sep 2013 17:15:35 GMT',
'last-modified': 'Fri, 09 Aug 2013 18:51:33 GMT'}
video_size = r.headers.get('content-length')
If you do not want to install a new package, you can go by using httplib2 or urllib2 (although the latter is a bit hacky).
import httplib2
r = httplib2.Http()
response, _ = r.request(url, 'HEAD')
video_size = response.get('content-length')
# or with urllib2
import urllib2
r = urllib2.Request(url)
# here, we modify the Request.get_method() instance method
# so that is returns 'HEAD' instead of 'GET'
r.get_method = lambda: 'HEAD'
response = urllib2.urlopen(r)
# then you need to parse the response, as it is just raw_text
You can issue a HEAD request instead of a GET.
For example,
curl -i -X HEAD https://some.url.com
HTTP/1.1 200 OK
Content-Type: image/jpeg
Last-Modified: Sun, 01 Sep 2013 05:04:13 GMT
Content-Length: 83909
Date: Sun, 01 Sep 2013 06:17:34 GMT
Connection: keep-alive
Cache-Control: max-age=1209600
I've scraped many websites and have often wondered why the response headers displayed in Firebug and the response headers returned by urllib.urlopen(url).info() are often different in that Firebug reports MORE headers.
I encountered an interesting one today. I'm scraping a website by following a "search url" that fully loads (returns a 200 status code) before redirecting to a final page. The easiest way to perform the scrape would be to return the Location response header and make another request. However, that particular header is absent when I run 'urllib.urlopen(url).info().
Here is the difference:
Firebug headers:
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection : keep-alive
Content-Encoding : gzip
Content-Length : 2433
Content-Type : text/html
Date : Fri, 05 Oct 2012 15:59:31 GMT
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Location : /catalog/display/1292/index.html
Pragma : no-cache
Server : Apache/2.0.55
Set-Cookie : PHPSESSID=9b99dd9a4afb0ef0ca267b853265b540; path=/
Vary : Accept-Encoding,User-Agent
X-Powered-By : PHP/4.4.0
Headers returned by my code:
Date: Fri, 05 Oct 2012 17:16:23 GMT
Server: Apache/2.0.55
X-Powered-By: PHP/4.4.0
Set-Cookie: PHPSESSID=39ccc547fc407daab21d3c83451d9a04; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Type: text/html
Connection: close
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras
import scrape_tools
tools = scrape_tools.tool_box()
db = tools.db_connect()
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT data FROM table WHERE variable = 'Constant' ORDER BY data")
for row in cursor:
url = 'http://www.website.com/search/' + row['data']
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-Language' : 'en-us,en;q=0.5',
'Connection' : 'keep-alive',
'Host' : 'www.website.com',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'
}
post_params = {
'query' : row['data'],
'searchtype' : 'products'
}
post_args = urllib.urlencode(post_params)
soup = tools.request(url, post_args, headers)
print tools.get_headers(url, post_args, headers)
Please note: scrape_tools is a module I wrote myself. The code contained in the module to retrieve headers is (basically) as follows:
class tool_box:
def get_headers(self, url, post, headers):
file_pointer = urllib.urlopen(url, post, headers)
return file_pointer.info()
Is there a reason for the discrepancy? Am I making a silly mistake in my code? How can I retrieve the missing header data? I'm fairly new to Python, so please forgive any dumb errors.
Thanks in advance. Any advice is much appreciated!
Also...Sorry about the wall of code =\
You're not getting the same kind of response for the two requests. For example, the response to the Firefox request contains a Location: header, so it's probably a 302 Moved temporarily or a 301. Those don't contain any actual body data, but instead cause your Firefox to issue a second request to the URL in the Location: header (urllib doesn't do that).
The Firefox response also uses Connection : keep-alive while the urllib request got answered with Connection: close.
Also, the Firefox response is gzipped (Content-Encoding : gzip), while the urllib one is not. That's probably because your Firefox sends a Accept-Encoding: gzip, deflate header with its request.
Don't rely on Firebug to tell you HTTP headers (even though it does so truthfully most of the time), but use a sniffer like wireshark to inspect what's actually going over the wire.
You're obviously dealing with two different responses.
There could be several reasons for this. For one, web servers are supposed to respond differently depending on what Accept-Language, Accept-Encoding headers etc.. the client sends in its request. Then there's also the possibility that the server does some kind of User-Agent sniffing.
Either way, capture your requests with urllib as well as the ones with Firefox using wireshark and first compare the requests (not the headers, but the actual GET / HTTP/1.0 part. Are they really the same? If yes, move on to comparing request headers and start manually setting the same headers for the urllib request until you figure out which headers make a difference.