I try to open a html page with python requests library but my code open the site root folder and i don't understand how solve the problem.
import requests
scraping = requests.request("POST", url = "http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print scraping.content
Thank you for all suggestion!
You can see easily that the server is redirecting to the main page.
➜ ~ http -v http://www.pollnet.it/WeeklyReport_it.aspx\?ID\=69
GET /WeeklyReport_it.aspx?ID=69 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.pollnet.it
User-Agent: HTTPie/0.9.3
HTTP/1.1 302 Found
Content-Length: 131
Content-Type: text/html; charset=utf-8
Date: Sun, 07 Feb 2016 11:24:52 GMT
Location: /default.asp
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
On further checking, it can be seen that the web server uses session cookies.
➜ ~ http -v http://www.pollnet.it/default_it.asp
HTTP/1.1 200 OK
Cache-Control: private
Content-Encoding: gzip
Content-Length: 9471
Content-Type: text/html; Charset=utf-8
Date: Sun, 07 Feb 2016 13:21:41 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDSQTSTAST=PBHDLEIDFCNMPKIGANFDNMLK; path=/
Vary: Accept-Encoding
X-Powered-By: ASP.NET
It means that every time the main page is visited, the server sends a "Set-Cookie" header, which instructs the browser to set certain cookies. Then every time the browser asks for a Weekly Report, the server validates the session cookie.
Normally. requests package does not save cookies in between requests, but to do the scraping, we can use a Session object which will save the cookies in between page requests.
import requests
# create a Session object
s= requests.Session()
# first visit the main page
s.get("http://www.pollnet.it/default_it.asp")
# then we can visit the weekly report pages
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print(r.text)
# another page
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=89")
print(r.text)
But here is some advice - the web server may only allow opening of a fixed number of pages (maybe 10, maybe 15) with a certain Session object. Either immediately validate the results of r.text each time (maybe check the length of the request body to ensure it is not too small), or create a new Session object, for every 5 or 6 pages.
More info on Session objects here.
Related
Is it possible to get the 'content-length' of a web page if the server doesn't include content-length as a header (even if the whole page is downloaded) without downloading the whole content?
I am doing some scraping using python requests and only need to classify data based on size of response.
Below is the response of the server for that matter.
HTTP/1.1 200 OK
Cache-Control: no-cache,private
Content-Type: text/html
Server: Microsoft-IIS/7.2
X-Powered-By: ASP.NET
Date: Fri, 04 Jul 2017 11:28:40 GMT
Connection: close
I need to log into a website using python but the login page requires a sessionID cookie in the request header. Using Google developer tools along with a webclient(hurl.it), I was able to determine the required format of the request header that is acceptable by the webserver:
Accept: */*
Accept-Encoding: gzip, deflate
Content-Length: 85
Content-Type: application/x-www-form-urlencoded
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
User-Agent: runscope/0.1
email=******%40beatswitch.com&login_token=545a46230b291&password=*****&submission=
I have produced the following request using Python requests module:
POST /my-ade/login/ HTTP/1.1
Host: www.amsterdam-dance-event.nl
Content-Length: 85
Accept-Encoding: gzip,deflate
Accept: */*
User-Agent: runscope/0.1
Connection: keep-alive
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
Content-Type: application/x-www-form-urlencoded
login_token=545a46230b291&password=*****&email=******%40beatswitch.com&submission='
When I load the former request header with hurl.it, everything works perfectly and the webserver lets me log in but trying the almost-same request with the same parameters fails in python. While using python's request, the webserver presents an error page. Any help would be highly appreciated. I need a solution desperately.
EDIT:
Here is the code:
#Open the login page to get sessionID and login_token
loginURL = "https://www.amsterdam-dance-event.nl/my-ade/login/"
loginReq = session.get(loginURL)
loginSoup = BeautifulSoup(loginReq.text)
loginToken = loginSoup.find('input',attrs={'name':'login_token'})['value']
sessionID= loginReq.cookies['www_amsterdam-dance-event_nl_session']
cookie = 'www_amsterdam-dance-event_nl_session='+sessionID
#Construct the header and post it to the webserver
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
payload = {'email':'*******#beatswitch.com','password':'********','login_token':loginToken,'submission':''}
loggedinReq = session.post(loginURL,headers=headers,data=payload)
I found the solution, thanks to Md. Mohsin. I was trying to handle the request headers and cookies manually while the requests module can handle them by itself. So I REMOVED the following line from the code and let requests take total control, and everything worked as intended:
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.
import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))
I'm trying to use Mechanize to automate interactions with a very picky legacy system. In particular, after the first login page the authorization must be sent with every request it knocks you out of the system. Unfortunately, Mechanize seems content on only sending the authorization after first getting a 401 Unauthorized error. Is there any way to have it send authorization every time?
Here's some sample code:
br.add_password("http://example.com/securepage", "USERNAME", "PASSWORD", "/MYREALM")
br.follow_link(link_to_secure_page) # where the url is the previous URL
Here's the response I get from debugging Mechanize:
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 401 Unauthorized\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:04 GMT
header: Connection: close
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=US-ASCII
header: Content-Length: 210
header: WWW-Authenticate: Basic realm="/MYREALM"
header: Cache-control: no-cache
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nAuthorization: Basic VVNFUk5BTUU6UEFTU1dPUkQ=\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:07 GMT
header: Connection: close
header: Last-Modified: Thu, 08 Dec 2011 03:08:06 GMT
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=UTF-8
header: Content-Length: 33333
header: Cache-control: no-cache
The problem is that contrary to what should happen in modern web application with a GET request, by hitting the 401 error first I get the wrong page. I've confirmed with CURL and urllib2 that if I hit the URL directly by passing in the auth header on the first request I get the correct page.
Any hints on how to tell mechanize to always send the auth headers and avoid the first 401 error? This needs to be fixed on the client side. I can't modify the server.
from base64 import b64encode
import mechanize
url = 'http://192.168.3.5/table.js'
username = 'admin'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
# # I needed to change to Mozilla for mine, but most do not
# br.addheaders= [('User-agent', 'Mozilla/5.0')]
br.addheaders.append(
('Authorization', 'Basic %s' % b64login )
)
br.open(url)
r = br.response()
data = r.read()
print data
The following url (and others like it) can be opened in a browser but causes urllib2.urlopen to throw a 404 exception: http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=1
geturl() returns the same url (no redirect). The headers are copied and pasted from firebug. I tried passing in the headers as a dictionary to Request, but got the same result. wget opens the url in the console but not from the script.
the code:
source_url = 'http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2'
try:
socket.setdefaulttimeout(10)
hdrs = [('Host','store.ovi.com'),('User-Agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;(+http://code.google.com/appengine)'),('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),('Accept-Language','en-us,en;q=0.5'),('Accept-Encoding','gzip,deflate'),('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7'),('Keep-Alive','115'),('Connection','keep-alive'),('Cookie','JNPRSESSID=4u4devdrt7eb6e0qem3gin47i2; s_cc=true; undefined_s=First%20Visit; s_nr=1282817443274; s_sq=%5B%5BB%5D%5D; view=Grid; menu=menuOpen; OVI_DEVICE=b5130'),('Cache-Control','max-age=0')]
ree = urllib2.Request(source_url)
ree.addheaders = hdrs
opener = urllib2.build_opener()
htmlSource = opener.open(ree).read()
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.headers
The error output:
404
Not Found
Date: Sat, 28 Aug 2010 00:36:57 GMT
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Keep-Alive: timeout=7, max=333
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
What, if anything, am I doing incorrectly? Is this a bug? And if so, is there a workaround? Thanks!
Given a URL like:
http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2
The bit that browsers fetch is just:
http://store.ovi.com/
Everything to the right of that is a ‘fragment identifier’, which is not passed to the server at all (evidently, if you try, it will get confused). Instead, the HTML returned for the / URL will include a load of JavaScript that reads the #... data at the client side and fills in the page content using a bunch of XMLHttpRequests.
Webapps implemented like this are a big old pain to scrape, because you can't just take the HTML content of the main page. Instead you have to either analyse the script to find out where it gets the actual data from, or you have to hook up a real browser in order to execute all the scripts and see what document objects you're left with. They're also typically bad for accessibility and SEO.
Luckily for you this site appears to be putting something in the fragment that's also a valid path. So it looks like you can get the dynamic page data from the URL:
http://store.ovi.com/applications?categoryId=20&fragment=1&page=1