Fetch a page with cookies using Python requests library - python

I'm just studying the requests library(http://docs.python-requests.org/en/latest/),
and got a problem on how to fetch a page with cookies using requests.
for example:
url2= 'https://passport.baidu.com'
parsedCookies={'PTOKEN': '412f...', 'BDUSS': 'hnN2...', ...} #Sorry that the cookies value is replaced by ... for instance of privacy
req = requests.get(url2, cookies=parsedCookies)
text=req.text.encode('utf-8','ignore')
f=open('before.html','w')
f.write(text)
f.close()
req.close()
when I use the codes above to fetch the page, it just saves the login page to 'before.html' instead of logined page, it refers that actually I haven't logged in successfully.
But if I use URLlib2 to fetch the page, it works properly as expected.
parsedCookies="PTOKEN=412f...;BDUSS=hnN2...;..." #Different format but same content with the aboved cookies
req = urllib2.Request(url2)
req.add_header('Cookie', parsedCookies)
ret = urllib2.urlopen(req)
f=open('before_urllib2.html','w')
f.write(ret.read())
f.close()
ret.close()
When I use these codes, it saves the logined page in before_urllib2.html.
--
Are there any mistakes in my code?
Any reply would be grateful.

You can use Session object to get what you desire:
url2='http://passport.baidu.com'
session = requests.Session() # create a Session object
cookie = requests.utils.cookiejar_from_dict(parsedCookies)
session.cookies.update(cookie) # set the cookies of the Session object
req = session.get(url2, headers=headers,allow_redirects=True)
If you use the requests.get function, it doesn't send cookies for the redirected page. Instead, if you use the Session().get function, it will maintain and send cookies for all http requests, this is what the concept "session" exactly means.
Let me try to elaborate to you what happens here:
When I sent cookies to http://passport.baidu.com/center and set the parameter allow_redirects as false, the returned status code is 302 and one of the headers of the response is 'location': '/center?_t=1380462657' (This is a dynamic value generated by server, you can replace it with what you get from server):
url2= 'http://passport.baidu.com/center'
req = requests.get(url2, cookies=parsedCookies, allow_redirects=False)
print req.status_code # output 302
print req.headers
But when I set the parameter allow_redirects as True, it still doesn't redirect to the page (http://passport.baidu.com/center?_t=1380462657) and the server return the login page. The reason is that the requests.get doesn't send cookies for the redirected page, here is http://passport.baidu.com/center?_t=1380462657, so we can login successfully. That is why we need the Session object.
If I set url2 = http://passport.baidu.com/center?_t=1380462657, it will return the page you want. One solution is use the above code to get the dynamic location value and form a path to you account like http://passport.baidu.com/center?_t=1380462657 , then you can get the desired page.
url2= 'http://passport.baidu.com' + req.headers.get('location')
req = session.get(url2, cookies=parsedCookies, allow_redirects=True )
But this is cumbersome, so when dealing with cookies, Session object do excellent job for us!

Related

How to store headers and cookies from python requests

My propose is to login at my application through python requests. I was able to get a token, that is expected, but passing it by GET isn't enough. So, i want to store the request in a cookie, pass the token, and maybe the browser can login.
So, let's resume what i did (this is pseudo code)
session = requests.Session()
session.get('<url>salt')
r = session.get('<url>login', params={username, password})
r.headers['token']
I discovered this by looking the requests while login. The token is passed to the application after. So, how can i store the "r" as a cookie?
you can simply access your session cookie using:
client = requests.session()
cook = client.cookies
extracted_token_value = client.cookies['token']
#this will print your cookie and token
print cook.text
print extracted_token_value
#updating your header now:
client.headers.update({'New Header': 'extracted_token_value')
BR

Remote form post Python and cookies headers

I have a script to post a form from a distant web page.
This webpage sends a cookie, and I like to get this and insert it in the header to post the form.
How could I do that ?
Thank you
If you want to get cookies from GET request and use it in POST, you can try to use session() as below:
session = requests.session()
response = session.get(URL) # send GET to get cookies
cookies = response.cookies.get_dict() # Grab cookies
post = session.post(URL, cookies=cookies) # Send POST with grabbed cookies
Thanks a lot,
When I go to the website from a web browser I have this :
Cookie: __qca=P0-1275531178-1484650046149; _ga=GA1.2.1669629019.1484650088; optimizelyEndUserId=oeu1486719631164r0.37280650070167387; optimizelySegments=%7B%224541596561%22%3A%22ff%22%2C%224547081444%22%3A%22none%22%2C%224502945149%22%3A%22direct%22%2C%224547981473%22%3A%22false%22%7D; optimizelyBuckets=%7B%7D; IRF_1322=%7Bvisits%3A1%2Cuser%3A%7Btime%3A1486719631639%2Cref%3A%22direct%22%2Cpv%3A1%2Ccap%3A%7B%7D%2Cv%3A%7B%7D%7D%2Cvisit%3A%7Btime%3A1486719631639%2Cref%3A%22direct%22%2Cpv%3A1%2Ccap%3A%7B%7D%2Cv%3A%7B%7D%7D%2Clp%3A%22http%3A%2F%2Flastingmemoriesbylisa.zenfolio.com%2F%22%2Cdebug%3A0%2Ca%3A1486719631639%7D; zf_5y_visitor=s7fL-1Wogsx-E1mBu-WkZbl-5gcAAAAATer6DQw9WNnr; zf_10y_tz=120; zf_pat=256887569$labellevuephotography$$396676290$492530544; zf_lsc=NPUPdGc1+bbM3gpqJ/i3nH4o...0
but when I use my script I online have :
zf_5y_visitor=s7fL-1Wogsx-E1mBu-WkZbl-5gcAAAAATer6DQw9WNnr; zf_10y_tz=120;
Is that possible to get the other parts ? At least the one starting with "zf"

Retrieve OAuth code in redirect URL provided as POST response

Python newbie here, so I'm sure this is a trivial challenge...
Using Requests module to make a POST request to the Instagram API in order to obtain a code which is used later in the OAuth process to get an access token. The code is usually accessed on the client-side as it's provided at the end of the redirect URL.
I have tried using Request's response history method, like this (client ID is altered for this post):
OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
OAuth_AccessRequest = requests.post(OAuthURL)
ResHistory = OAuth_AccessRequest.history
for resp in ResHistory:
print resp.status_code, resp.url
print OAuth_AccessRequest.status_code, OAuth_AccessRequest.url
But the URLs this returns are not revealing the code number string, instead, the redirect just looks like this:
302 https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.dashboard.com/whathappened&response_type=code
200 https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%cb0096f08a3848e67355f%26redirect_uri%3Dhttps%3A//www.smashboarddashboard.com/whathappened%26response_type%3Dcode
Where if you do this on the client side, using a browser, code would be replaced with the actual number string.
Is there a method or approach I can add to the POST request that will allow me to have access to the actual redirect URL string that appears in the web browser?
It should work in a browser if you are already logged in at Instagram. If you are not logged in you are redirected to a login page:
https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%3Dcb0096f08a3848e67355f%26redirect_uri%3Dhttps%3A//www.smashboarddashboard.com/whathappened%26response_type%3Dcode
Your Python client is not logged in and so it is also redirected to Instagram's login page as shown by the value of OAuth_AccessRequest.url :
>>> import requests
>>> OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
>>> OAuth_AccessRequest = requests.get(OAuthURL)
>>> OAuth_AccessRequest
<Response [200]>
>>> OAuth_AccessRequest.url
u'https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%3Dcb0096f08a3848e67355f%26redirect_uri%3Dhttps%3A//www.smashboarddashboard.com/whathappened%26response_type%3Dcode'
So, to get to the next step, your Python client needs to login. This requires that the client extract and set fields to be posted back to the same URL. It also requires cookies and that the Referer header be properly set. There is a hidden CSRF token that must be extracted from the page (you could use BeautifulSoup for example), and form fields username and password must be set. So you would do something like this:
import requests
from bs4 import BeautifulSoup
OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
session = requests.session() # use session to handle cookies
OAuth_AccessRequest = session.get(OAuthURL)
soup = BeautifulSoup(OAuth_AccessRequest.content)
form = soup.form
login_data = {form.input.attrs['name'] : form.input['value']}
login_data.update({'username': 'your username', 'password': 'your password'})
headers = {'Referer': OAuth_AccessRequest.url}
login_url = 'https://instagram.com{}'.format(form.attrs['action'])
r = session.post(login_url, data=login_data, headers=headers)
>>> r
<Response [400]>
>>> r.json()
{u'error_type': u'OAuthException', u'code': 400, u'error_message': u'Invalid Client ID'}
Which looks like it will work once provided a valid client ID.
As an alternative, you could look at mechanize which will handle the form submission for you, including the hidden CSRF field:
import mechanize
OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
br = mechanize.Browser()
br.open(OAuthURL)
br.select_form(nr=0)
br.form['username'] = 'your username'
br.form['password'] = 'your password'
r = br.submit()
response = r.read()
But this doesn't work because the referer header is not being set, however, you could use this method if you can figure out a solution to that.

python's requests do not handle cookies correctly?

I have problem with simple authorization and upload API script.
When authorized, client receives several cookies, including PHPSESSID cookie (in browser).
I use requests.post method with form data for authorization:
r = requests.post(url, headers = self.headers, data = formData)
self.cookies = requests.utils.dict_from_cookieja(r.cookies)
Headers are used for custom User-Agent only.
Authorization is 100% fine (there is a logout link on the page).
Later, i try to upload data using the authorized session cookies:
r = requests.post(url, files = files, data = formData, headers = self.headers, cookies = self.cookies)
But site rejects the request. If we compare the requests from script and google chrome (using Wireshark), there is no differences in request body.
Only difference is that 2 cookies sent by requests class, while google chrome sends 7.
Update: Double checked, first request receives 7 cookies. post method just ignore half...
My mistake in code was that i was assigning cookies from each next API request to the session cookies dictionary. On each request since logged in, cookies was 'reset' by upcoming response cookies, that's was the problem. As auth cookies are assigned only at login request, they were lost at the next request.
After each authorized request i use update(), not assigning.
self.cookies.update( requests.utils.dict_from_cookiejar(r.cookies) )
Solves my issue, upload works fine!

Python web scraping requests follow redirect

I'm trying to scrape a web site with the requests module.
Using chrome and inspect elements, I go to the url, fill in a form and click the continue button. Chrome's inspect elements (network documents) shows what chrome sent with post. It also shows multiple cookies. The site redirects to a url with among other things a session ID.
To simulate this, I try using requests. I take the form data from inspect elements and reformat it to a dictionary. I use requests.session to include the cookies.
import requests
form_data = 'currentCalForm=dep&currentCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')
payload = {}
for item in form_data:
key, value = item.split('=')
if value:
payload[key] = value
with requests.session() as s:
r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects=True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
f.write(r.text.encode('utf8'))
requests, however, does not appear to follow the redirect. history is empty and the url appears to be the data I sent rather than what the site returned. x.htm shows a web page, but does not contain the info I expected.
From http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history I expected r.url to contain the redirected url and r.history to contain an http response code.
What am I doing wrong?
ok what you do seems to be wrong. i am not sure how you decided to sent a post on https://www.aa.com/homePage.do, but that seems to be a get and doesnt take the params you send. when you click search your browser sends this post: https://www.americanairlines.co.uk/reservation/searchFlightsSubmit.do;jsessionid=XXXXXXXXXXXXXXXXXXX and parameters:
currentCalForm=dep
currentCodeFrom=
tripType=roundTrip
originAirport=LAX
flightParams.flightDateParams.travelMonth=10
flightParams.flightDateParams.travelDay=24
flightParams.flightDateParams.searchTime=040001
destinationAirport=JFK
returnDate.travelMonth=10
returnDate.travelDay=31
returnDate.searchTime=400001
adultPassengerCount=1
adultPassengerCount=1
childPassengerCount=0
hotelRoomCount=1
serviceclass=coach
searchTypeMode=matrix
awardDatesFlexible=true
originAlternateAirportDistance=0
destinationAlternateAirportDistance=0
discountCode=
flightSearch=revenue
dateChanged=false
fromSearchPage=true
advancedSearchOpened=false
numberOfFlightsToDisplay=10
searchCategory=
aairpassSearchType=false
moreOptionsIndicator=
seniorPassengerCount=0
youngAdultPassengerCount=0
infantPassengerCount=0
passengerCount=1
This will then give you an html back. preety mach you have to send all requests send in the browser. it might be easier for you to do it with selenium.
i found this using httpfox probably is similar to chrome networks.

Categories