I need to log into a website using python but the login page requires a sessionID cookie in the request header. Using Google developer tools along with a webclient(hurl.it), I was able to determine the required format of the request header that is acceptable by the webserver:
Accept: */*
Accept-Encoding: gzip, deflate
Content-Length: 85
Content-Type: application/x-www-form-urlencoded
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
User-Agent: runscope/0.1
email=******%40beatswitch.com&login_token=545a46230b291&password=*****&submission=
I have produced the following request using Python requests module:
POST /my-ade/login/ HTTP/1.1
Host: www.amsterdam-dance-event.nl
Content-Length: 85
Accept-Encoding: gzip,deflate
Accept: */*
User-Agent: runscope/0.1
Connection: keep-alive
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
Content-Type: application/x-www-form-urlencoded
login_token=545a46230b291&password=*****&email=******%40beatswitch.com&submission='
When I load the former request header with hurl.it, everything works perfectly and the webserver lets me log in but trying the almost-same request with the same parameters fails in python. While using python's request, the webserver presents an error page. Any help would be highly appreciated. I need a solution desperately.
EDIT:
Here is the code:
#Open the login page to get sessionID and login_token
loginURL = "https://www.amsterdam-dance-event.nl/my-ade/login/"
loginReq = session.get(loginURL)
loginSoup = BeautifulSoup(loginReq.text)
loginToken = loginSoup.find('input',attrs={'name':'login_token'})['value']
sessionID= loginReq.cookies['www_amsterdam-dance-event_nl_session']
cookie = 'www_amsterdam-dance-event_nl_session='+sessionID
#Construct the header and post it to the webserver
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
payload = {'email':'*******#beatswitch.com','password':'********','login_token':loginToken,'submission':''}
loggedinReq = session.post(loginURL,headers=headers,data=payload)
I found the solution, thanks to Md. Mohsin. I was trying to handle the request headers and cookies manually while the requests module can handle them by itself. So I REMOVED the following line from the code and let requests take total control, and everything worked as intended:
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
Related
I am scraping search results from google using people_also_ask module. The module itself dont have method to use proxies but I manually added proxies in the module. When I got blocked from google I printed the status and it was printing my ip address was banned from sending requests. The code I added in people_also_ask module to use proxies is
proxies = {
'http' : "http://username:passward#ip:port"
}
response = SESSION.get(URL, params=params, headers=HEADERS, proxies=proxies)
.I know it is an illegal activity but I want to know why it happens for education purpose mainly. I think the code to extract the data is irrelevant so I am adding simple code to send request using people_also_ask module
import people_also_ask as paa
queries = ["how to boil eggs","how to make cake","price of poco f1","price of wooden table","best soap in us","how much tesla worth"]
for query in queries:
questions = paa.get_related_questions(query ,40)
Note: The changes are made in first function named search() of google.py of people_also_people module
Note: I am doing searchs from browser without any problem. why is google allowing me to use google but blocked from using the script
The answer is quite simple. Although it is a proxy service, it doesn't guarantee 100% anonymity. When you send the HTTP GET request via the proxy server, the request sent by your program to the proxy server is:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Now, when the proxy server sends this request to the actual destination, it sends:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Via: 1.1 naxserver (squid/3.1.8)
X-Forwarded-For: 122.126.64.43
Cache-Control: max-age=18000
Connection: keep-alive
As you can see, it throws your IP (in my case, 122.126.64.43) in the HTTP header: X-Forwarded-For and hence the website knows that the request was sent on behalf of 122.126.64.43
Read more about this header at: https://www.rfc-editor.org/rfc/rfc7239
If you want to host your own squid proxy server and want to disable setting X-Forwarded-For header, read: http://www.squid-cache.org/Doc/config/forwarded_for/
I dont get any credit for the answer I copied this answer from the following post I found Python Requests module - proxy not working
I am very new with API things.
I have to make a POST request to API with the following "format"
content-type: multipart/form-data
Content-Disposition: form-data; name=""; filename=""
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Form data:
file = file.xlsx
How can I perform the API request using Python?
Using requests library, can I perform it:
requests.post(
'api_url',
headers = {'Content-Type':'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'},
data = {"filename.xlsx": open(filepath, "rb")}
)
Thanks
I prefer pool manager as this can manage timeout, retry, etc. easily:
import urllib3
from urllib3.util import Retry, Timeout
http_client = urllib3.PoolManager(retries=Retry(connect=5, read=2, redirect=5),
timeout=Timeout(connect=5.0, read=10.0),
num_pools=2)
data = {'asd': 'asd'}
request = http_client.request('POST', "http://localhost:8081", fields=data, encode_multipart=True)
This will give you:
>nc -l 127.0.0.1 8081
POST / HTTP/1.1
Host: localhost:8081
Accept-Encoding: identity
Content-Length: 125
Content-Type: multipart/form-data; boundary=6ce0c07687204c761cc1e5a6d6f6046e
User-Agent: python-urllib3/1.26.4
--6ce0c07687204c761cc1e5a6d6f6046e
Content-Disposition: form-data; name="asd"
asd
--6ce0c07687204c761cc1e5a6d6f6046e--
Normally, when sending a HTTP request, the actually traffic is like below:
GET /abc?hello HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.7
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
However, I would like to send URLs without the leading slash, for example:
GET abc?hello HTTP/1.1
GET ftp://abc?hello HTTP/1.1
I understand that's not compliant with RFCs, but just need to send such request for testing purpose in Python.
Have checked requests, urllib, urllib2, urllib3, haven't figured out how to do it.
Anyone can help me out?
There are many different ways of reading web pages in python.
I focused on the following methods:
Retrieve a page
Opening Socket
Making Request
Example of Retrieving a page:
from urllib.request import urlretrieve
url = 'http://ce.sharif.edu/courses'
file_name = 'courses.html'
urlretrieve(url, file_name)
Example of Opening Socket:
from urllib.request import urlopen
url = 'http://ce.sharif.edu/courses'
socket = urlopen(url)
text = str(url.readall())
socket.close()
Example of Making Request:
>>>import requests
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}
So the problem is what are the main differences of the above methods and their usage?
The third method is using a different library than the above. This doesn't look like a problem, but let's see the content of the request as seen from the server side.
1)
GET /example1.html HTTP/1.1
Accept-Encoding: identity
Host: XXXXXXXX
Connection: close
User-Agent: Python-urllib/3.5
2)
GET /example2.html HTTP/1.1
Accept-Encoding: identity
Connection: close
Host: XXXXXXXX
User-Agent: Python-urllib/3.5
No noticeable difference between 1) and 2.
3)
GET /example3 HTTP/1.1
Host: XXXXXXXX
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.18.4
The last is slightly different; this means there is at least the possibility of obtaining different results in the response, and this will depend on the server configuration.
Accept-Encoding: gzip, deflate
This may result in the server compressing the response, which means less data transferred.
Connection: keep-alive
The server will keep the connection open for reusing with subsequent requests (possibly more efficient).
User-Agent:
Many web servers adapt the content depending on the identified client software. I don't think there will be any difference in this particular case, however it can't be ruled out completely.
I try to open a html page with python requests library but my code open the site root folder and i don't understand how solve the problem.
import requests
scraping = requests.request("POST", url = "http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print scraping.content
Thank you for all suggestion!
You can see easily that the server is redirecting to the main page.
➜ ~ http -v http://www.pollnet.it/WeeklyReport_it.aspx\?ID\=69
GET /WeeklyReport_it.aspx?ID=69 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.pollnet.it
User-Agent: HTTPie/0.9.3
HTTP/1.1 302 Found
Content-Length: 131
Content-Type: text/html; charset=utf-8
Date: Sun, 07 Feb 2016 11:24:52 GMT
Location: /default.asp
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
On further checking, it can be seen that the web server uses session cookies.
➜ ~ http -v http://www.pollnet.it/default_it.asp
HTTP/1.1 200 OK
Cache-Control: private
Content-Encoding: gzip
Content-Length: 9471
Content-Type: text/html; Charset=utf-8
Date: Sun, 07 Feb 2016 13:21:41 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDSQTSTAST=PBHDLEIDFCNMPKIGANFDNMLK; path=/
Vary: Accept-Encoding
X-Powered-By: ASP.NET
It means that every time the main page is visited, the server sends a "Set-Cookie" header, which instructs the browser to set certain cookies. Then every time the browser asks for a Weekly Report, the server validates the session cookie.
Normally. requests package does not save cookies in between requests, but to do the scraping, we can use a Session object which will save the cookies in between page requests.
import requests
# create a Session object
s= requests.Session()
# first visit the main page
s.get("http://www.pollnet.it/default_it.asp")
# then we can visit the weekly report pages
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print(r.text)
# another page
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=89")
print(r.text)
But here is some advice - the web server may only allow opening of a fixed number of pages (maybe 10, maybe 15) with a certain Session object. Either immediately validate the results of r.text each time (maybe check the length of the request body to ensure it is not too small), or create a new Session object, for every 5 or 6 pages.
More info on Session objects here.