I've scraped many websites and have often wondered why the response headers displayed in Firebug and the response headers returned by urllib.urlopen(url).info() are often different in that Firebug reports MORE headers.
I encountered an interesting one today. I'm scraping a website by following a "search url" that fully loads (returns a 200 status code) before redirecting to a final page. The easiest way to perform the scrape would be to return the Location response header and make another request. However, that particular header is absent when I run 'urllib.urlopen(url).info().
Here is the difference:
Firebug headers:
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection : keep-alive
Content-Encoding : gzip
Content-Length : 2433
Content-Type : text/html
Date : Fri, 05 Oct 2012 15:59:31 GMT
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Location : /catalog/display/1292/index.html
Pragma : no-cache
Server : Apache/2.0.55
Set-Cookie : PHPSESSID=9b99dd9a4afb0ef0ca267b853265b540; path=/
Vary : Accept-Encoding,User-Agent
X-Powered-By : PHP/4.4.0
Headers returned by my code:
Date: Fri, 05 Oct 2012 17:16:23 GMT
Server: Apache/2.0.55
X-Powered-By: PHP/4.4.0
Set-Cookie: PHPSESSID=39ccc547fc407daab21d3c83451d9a04; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Type: text/html
Connection: close
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras
import scrape_tools
tools = scrape_tools.tool_box()
db = tools.db_connect()
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT data FROM table WHERE variable = 'Constant' ORDER BY data")
for row in cursor:
url = 'http://www.website.com/search/' + row['data']
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-Language' : 'en-us,en;q=0.5',
'Connection' : 'keep-alive',
'Host' : 'www.website.com',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'
}
post_params = {
'query' : row['data'],
'searchtype' : 'products'
}
post_args = urllib.urlencode(post_params)
soup = tools.request(url, post_args, headers)
print tools.get_headers(url, post_args, headers)
Please note: scrape_tools is a module I wrote myself. The code contained in the module to retrieve headers is (basically) as follows:
class tool_box:
def get_headers(self, url, post, headers):
file_pointer = urllib.urlopen(url, post, headers)
return file_pointer.info()
Is there a reason for the discrepancy? Am I making a silly mistake in my code? How can I retrieve the missing header data? I'm fairly new to Python, so please forgive any dumb errors.
Thanks in advance. Any advice is much appreciated!
Also...Sorry about the wall of code =\
You're not getting the same kind of response for the two requests. For example, the response to the Firefox request contains a Location: header, so it's probably a 302 Moved temporarily or a 301. Those don't contain any actual body data, but instead cause your Firefox to issue a second request to the URL in the Location: header (urllib doesn't do that).
The Firefox response also uses Connection : keep-alive while the urllib request got answered with Connection: close.
Also, the Firefox response is gzipped (Content-Encoding : gzip), while the urllib one is not. That's probably because your Firefox sends a Accept-Encoding: gzip, deflate header with its request.
Don't rely on Firebug to tell you HTTP headers (even though it does so truthfully most of the time), but use a sniffer like wireshark to inspect what's actually going over the wire.
You're obviously dealing with two different responses.
There could be several reasons for this. For one, web servers are supposed to respond differently depending on what Accept-Language, Accept-Encoding headers etc.. the client sends in its request. Then there's also the possibility that the server does some kind of User-Agent sniffing.
Either way, capture your requests with urllib as well as the ones with Firefox using wireshark and first compare the requests (not the headers, but the actual GET / HTTP/1.0 part. Are they really the same? If yes, move on to comparing request headers and start manually setting the same headers for the urllib request until you figure out which headers make a difference.
Related
I am using the python requests library to get all the headers from a website, however requests only seems to be getting the Response Headers and i also need the Request Headers.
Is there a way to get the Request Headers within the requests library or should i use a differant library to get the headers?
my code:
import requests
r = requests.get("https://google.com", allow_redirects = False)
for key in r.headers:
print(key, ": ", r.headers[key])
output:
Location : https://www.google.com/
Content-Type : text/html; charset=UTF-8
Date : Wed, 19 Feb 2020 13:08:27 GMT
Expires : Fri, 20 Mar 2020 13:08:27 GMT
Cache-Control : public, max-age=2592000
Server : gws
Content-Length : 220
X-XSS-Protection : 0
X-Frame-Options : SAMEORIGIN
Alt-Svc : quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000
The response object contains a request object that is the request which produced the reponse.
This requests.models.PreparedRequest object is accessible through the request property of the response object, its header are in the property headersof the request object.
See this example:
>>> import requests
>>> r = requests.get("http://google.com")
>>> r.request.headers
{'Connection': 'keep-alive', 'User-Agent': 'python-requests/2.22.0', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate'}
I am using an API, which receives a pdf file and does some analysis, but I am receiving Response 500 always
Have initially tested using Postman and the request goes through, receiving response 200 with the corresponding JSON information. The SSL security should be turned off.
However, when I try to do request via Python, I always get Response 500
Python code written by me:
import requests
url = "https://{{BASE_URL}}/api/v1/documents"
fin = open('/home/train/aab2wieuqcnvn3g6syadumik4bsg5.0062.pdf', 'rb')
files = {'file': fin}
r = requests.post(url, files=files, verify=False)
print (r)
#r.text is empty
Python code, produced by the Postman:
import requests
url = "https://{{BASE_URL}}/api/v1/documents"
payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"file\"; filename=\"aab2wieuqcnvn3g6syadumik4bsg5.0062.pdf\"\r\nContent-Type: application/pdf\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"
headers = {
'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
'Content-Type': "application/x-www-form-urlencoded",
'cache-control': "no-cache",
'Postman-Token': "65f888e2-c1e6-4108-ad76-f698aaf2b542"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)
Have masked the API link as {{BASE_URL}} due to the confidentiality
Response by Postman:
{
"id": "5e69058e2690d5b0e519cf4006dfdbfeeb5261b935094a2173b2e79a58e80ab5",
"name": "aab2wieuqcnvn3g6syadumik4bsg5.0062.pdf",
"fileIds": {
"original": "5e69058e2690d5b0e519cf4006dfdbfeeb5261b935094a2173b2e79a58e80ab5.pdf"
},
"creationDate": "2019-06-20T09:41:59.5930472+00:00"
}
Response by Python:
Response<500>
UPDATE:
Tried the GET request - works fine, as I receive the JSON response from it. I guess the problem is in posting pdf file. Is there any other options on how to post a file to an API?
Postman Response RAW:
POST /api/v1/documents
Content-Type: multipart/form-data; boundary=--------------------------375732980407830821611925
cache-control: no-cache
Postman-Token: 3e63d5a1-12cf-4f6b-8f16-3d41534549b9
User-Agent: PostmanRuntime/7.6.0
Accept: */*
Host: {{BASE_URL}}
cookie: c2b8faabe4d7f930c0f28c73aa7cafa9=736a1712f7a3dab03dd48a80403dd4ea
accept-encoding: gzip, deflate
content-length: 3123756
file=[object Object]
HTTP/1.1 200
status: 200
Date: Thu, 20 Jun 2019 10:59:55 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Location: /api/v1/files/95463e88527ecdc94393fde685ab1d05fa0ee0b924942f445b14b75e983c927e
api-supported-versions: 1.0
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Referrer-Policy: strict-origin
{"id":"95463e88527ecdc94393fde685ab1d05fa0ee0b924942f445b14b75e983c927e","name":"aab2wieuqcnvn3g6syadumik4bsg5.0062.pdf","fileIds":{"original":"95463e88527ecdc94393fde685ab1d05fa0ee0b924942f445b14b75e983c927e.pdf"},"creationDate":"2019-06-20T10:59:55.7038573+00:00"}
CORRECT REQUEST
So, eventually - the correct code is the following:
import requests
files = {
'file': open('/home/train/aab2wieuqcnvn3g6syadumik4bsg5.0062.pdf', 'rb'),
}
response = requests.post('{{BASE_URL}}/api/v1/documents', files=files, verify=False)
print (response.text)
A 500 error indicates an internal server error, not an error with your script.
If you're receiving a 500 error (as opposed to a 400 error, which indicates a bad request), then theoretically your script is fine and it's the server-side code that needs to be adjusted.
In practice, it could still be due a bad request though.
If you're the one running the API, then you can check the error logs and debug the code line-by-line to figure out why the server is throwing an error.
In this case though, it sounds like it's a third-party API, correct? If so, I recommend looking through their documentation to find a working example or contacting them if you think it's an issue on their end (which is unlikely but possible).
I'm trying to download icecast json status data from a server using python.
This is my code (after different attempts).
def checkStream(url):
request = urllib2.Request(url)
request.add_header("Connection", "keep-alive")
request.add_header("Cache-Control", "max-age=0")
request.add_header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
request.add_header("Upgrade-Insecure-Requests", "1")
request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")
request.add_header("Accept-Encoding", "gzip, deflate, sdch")
response = urllib2.urlopen(request)
line = response.read()
print line
return
checkStream("http://108.168.175.149:10128/status-json.xsl")
The problem is that my response is printed like this
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type
Access-Control-Allow-Methods: GET, OPTIONS, HEAD
{"icestats":{"admin":"icemaster#localhost","banned_IPs":0,"build":20141112090605,"host":"pro02.caster.fm","location":"Earth","outgoing_kbitrate":3799,"server_id":"Icecast 2.3.3-kh11","server_start":"05/Oct/2015:10:43:46 -0500","stream_kbytes_read":104422400,"stream_kbytes_sent":5123403693,"source":[{"audio_codecid":2,"audio_info":"ice-samplerate=44100;ice-bitrate=96;ice-channels=2","bitrate":96,"connected":33748,"genre":"Various","ice-bitrate":96,"ice-channels":2,"ice-samplerate":44100,"incoming_bitrate":95920,"listener_peak":153,"listeners":42,"listenurl":"http://pro02.caster.fm:10128/live","mpeg_channels":2,"mpeg_samplerate":44100,"outgoing_kbitrate":3883,"queue_size":358609,"se
The end of the json response is short 272 bytes which is exactly the number of bytes of the response headers which are returned in the data.
If I open the link on chrome the response appears ok.
I also tested using requests lib with no luck.
>>> import requests
>>> r = requests.get("http://108.168.175.149:10128/status-json.xsl")
>>> r.text
u'Expires: Thu, 19 Nov 1981 08:52:00 GMT\r\nCache-Control: no-store, no-cache, must-revalidate\r\nPragma: no-cache\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type\r\nAccess-Control-Allow-Methods: GET, OPTIONS, HEAD\r\n\r\n{"icestats":{"admin":"icemaster#localhost","banned_IPs":0,"build":20141112090605,"host":"pro02.caster.fm","location":"Earth","outgoing_kbitrate":3844,"server_id":"Icecast 2.3.3-kh11","server_start":"05/Oct/2015:10:43:46 -0500","stream_kbytes_read":104438630,"stream_kbytes_sent":5124109510,"source":[{"audio_codecid":2,"audio_info":"ice-samplerate=44100;ice-bitrate=96;ice-channels=2","bitrate":96,"connected":35133,"genre":"Various","ice-bitrate":96,"ice-channels":2,"ice-samplerate":44100,"incoming_bitrate":95920,"listener_peak":153,"listeners":43,"listenurl":"http://pro02.caster.fm:10128/live","mpeg_channels":2,"mpeg_samplerate":44100,"outgoing_kbitrate":3837,"queue_size":164258,"se'
>>>
How can I retrieve the complete data?
The server you are requesting this from is running an ancient version of an Icecast fork.
This bug was fixed and the fix released long ago in mainline. I'd recommend to upgrade (or tell the operator to upgrade) the server to the latest official Icecast version from http://icecast.org
I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.
import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))
The following url (and others like it) can be opened in a browser but causes urllib2.urlopen to throw a 404 exception: http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=1
geturl() returns the same url (no redirect). The headers are copied and pasted from firebug. I tried passing in the headers as a dictionary to Request, but got the same result. wget opens the url in the console but not from the script.
the code:
source_url = 'http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2'
try:
socket.setdefaulttimeout(10)
hdrs = [('Host','store.ovi.com'),('User-Agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;(+http://code.google.com/appengine)'),('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),('Accept-Language','en-us,en;q=0.5'),('Accept-Encoding','gzip,deflate'),('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7'),('Keep-Alive','115'),('Connection','keep-alive'),('Cookie','JNPRSESSID=4u4devdrt7eb6e0qem3gin47i2; s_cc=true; undefined_s=First%20Visit; s_nr=1282817443274; s_sq=%5B%5BB%5D%5D; view=Grid; menu=menuOpen; OVI_DEVICE=b5130'),('Cache-Control','max-age=0')]
ree = urllib2.Request(source_url)
ree.addheaders = hdrs
opener = urllib2.build_opener()
htmlSource = opener.open(ree).read()
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.headers
The error output:
404
Not Found
Date: Sat, 28 Aug 2010 00:36:57 GMT
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Keep-Alive: timeout=7, max=333
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
What, if anything, am I doing incorrectly? Is this a bug? And if so, is there a workaround? Thanks!
Given a URL like:
http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2
The bit that browsers fetch is just:
http://store.ovi.com/
Everything to the right of that is a ‘fragment identifier’, which is not passed to the server at all (evidently, if you try, it will get confused). Instead, the HTML returned for the / URL will include a load of JavaScript that reads the #... data at the client side and fills in the page content using a bunch of XMLHttpRequests.
Webapps implemented like this are a big old pain to scrape, because you can't just take the HTML content of the main page. Instead you have to either analyse the script to find out where it gets the actual data from, or you have to hook up a real browser in order to execute all the scripts and see what document objects you're left with. They're also typically bad for accessibility and SEO.
Luckily for you this site appears to be putting something in the fragment that's also a valid path. So it looks like you can get the dynamic page data from the URL:
http://store.ovi.com/applications?categoryId=20&fragment=1&page=1