Cookies and http requests - python

I have this url, the content are produced in this way (php, it's supose to generate a random cookie on every request):
setcookie('token', md5(time()), time()+99999);
if(isset($_COOKIE['token'])) {
echo 'Cookie: ' .$_COOKIE['token'];
die();
}
echo 'Cookie not set yet';
As you can see, the cookie changes on every reload/refresh of the page. Now i have a python (python3) script with three completely independent from each other requests:
import requests
def get_req_data(req):
print('\n\ntoken: ', req.cookies['token'])
print('headers we sent: ', req.request.headers)
print('headers server sent back: ', req.headers)
url = 'http://migueldvl.com/heya/login/tests2.php'
headers = {
"User-agent" : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
"Referer": 'https://www.google.com'
}
req1 = requests.get(url, headers=headers)
get_req_data(req1)
req2 = requests.get(url, headers=headers)
get_req_data(req2)
req3 = requests.get(url, headers=headers)
get_req_data(req3)
How can be that we sometimes have the same cookie in diferent requests? If clearly it's program to change on every request?
If we:
import time
and add a
time.sleep(1) # wait one second before the next request
between requests, the cookie change all the time, this is the right and expected behaviour, but my question is why do we need this (time.sleep(1)) to be certain of the changing cookie? Wouldn't different requests be enough?

Related

Why does URL return response from Postman but not Python?

I'm trying to pull data from this URL. It works in Postman but not locally in Python (3.8). The connection closes without a response. Also tried manually passing in my browser cookies but had the same results.
Below is the code generated from Postman after copying the XHR response as cURL (bash)
import requests
url = "https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/10047/FUT?tradeDate=10/07/2022"
payload={}
headers = {
'Cookie': 'ak_bmsc=40201F1BC4FD8456EED03A38A16CBC95~000000000000000000000000000000~YAAQj2V0aGgBgqeDAQAAPzYKsxFH4BM3CXIxGLs0BpfFzUiVR7t+Ul6Q9U64ItnBxPPhosD8CEBZ03QGfv4XHioHnh1Hzn3E0Kc17EV4dAMLsUySAsUwh3Q+MD9zf5gNh4nCZXkoP+ChCHkYJ+uR1qxnPRZK8yu4USf8by8Js6LcoO3X2WPWkHw5LAsBcImL5hdhYDCX9n2bS3j/vHRyT2cg6iE0YLrAK6eLwgp6w8EFN9JhRKyL8AGYcYEJm6Rxk2EFQ62cG12uSW5pSl/h5yF/Z5qF8+0xXi3yhcBZ9vEvz9W8YPw9gbreYAvURg4wZtkxtxJyBkgfwlGkbc+NnzcErzlmH2b9ZYjs+vuP3GK0zP/c1e3BKgVEz/iQ; bm_sv=4E1D62DAE9E148686F96340796FD4A79~YAAQj2V0aDr/hKeDAQAAuChCsxGk32eAruqs2a29BNi48QW5E1rqQqbyowotXKQ1+hoMqvIsxi/uXHUQ+csp+U4/P6dMDker8yWYw80MxnzYfQ0k1UMD4VtKUGthUwGgBHrP42vpUbUMkVXVgjJh6OQrEwEFyP9T/wZGi8HraSMtkUJ2fmySYJtHS5Hvxr5oGlv9RtG2zlsq30gBxaJI1Y/j5HTh1hIKLsmI/VmrrTU9kI3M4zgoAF+TU8C1tWGG8bhr~1'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
EDIT: I had to modify the cURL as I was getting an error importing it into Postman. This is the cURL (bash) I used in Postman that returned the proper response:
curl 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/10047/FUT?strategy=DEFAULT&tradeDate=10/07/2022&pageSize=500&isProtected&_t=1665158458937' \
Any ideas how to fix the request? None of the other SO threads seemed to have the answer.
You need to provide the User-Agent header and that's actually all you need for this URL
import requests
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'
URL = "https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/10047/FUT?tradeDate=10/07/2022"
headers = {'User-Agent': AGENT}
(r := requests.get(URL, headers=headers)).raise_for_status()
print(r.json())

Python cloudscraper requests slow, with 403 responses

I am using Cloduscraper Python library in order to obtain a JSON response from an url.
The probem is that I have to retry the same request 2-3 times before I get the correct output. The first responses have a 403 HTTP status code.
Here is my code:
import json
from time import sleep
import cloudscraper
url = "https://www.endpoint.com/api/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0",
"Accept": "*/*",
"Content-Type": "application/json"
}
json_response = 0
while json_response == 0:
try:
scraper = cloudscraper.create_scraper()
r = scraper.get(url, headers=headers)
json_response = json.loads(r.text)
except:
print(r.status_code)
sleep(2)
return json_response
What can I do in order to optimize my code and prevent the 403 responses?
You could use real browser to prevent some part of bot detection, here is the example with playwright:
import json
from playwright.sync_api import sync_playwright
API_URL = 'https://www.soraredata.com/api/players/info/29301348132354218386476497174231278066977835432352170109275714645119105189666'
with sync_playwright() as p:
# Webkit is fastest to start and hardest to detect
browser = p.webkit.launch(headless=True)
page = browser.new_page()
page.goto(API_URL)
# Use evaluate instead of `content` not to import bs4 or lxml
html = page.evaluate('document.querySelector("pre").innerText')
try:
data = json.loads(html)
except:
# Still might fail sometimes
data = None
print(data)
The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.
If you had no authorization, I would suggest first of all, to check if the url you are sending the request to, needs any sort of permissions to authorize the request.
However you do get a response at the 2nd or 3rd trial, and what happens is that some servers will take a few seconds before returning the answer, so they require the browser to wait ~5 seconds before submitting the response.
I would suggest adding a delay, which can be passed as an argument to create_scraper():
scraper = cloudscraper.create_scraper(delay=10)
If it is succesfull, then reduce the delay until it can no longert be reduced.

Python read Cookies from response?

In python I have:
cookies = dict(PHPSESSID='PHPSESSID=djsgkjhsdjkhj34',
authchallenge='sdifhshdfiuh34234234',
rishum='skdhfuihuisdhf-' + '10403111')
try:
response = requests.get(url, headers=headers, cookies=cookies, allow_redirects=False)
But I'm looking to use cookies value for the first time only and then use the new ones the server sets, how can I do that?
the solutions I found don't use default cookies for first request.
Note: I can't login automatically to the website since it uses auth challenge so everytime I login manually and change those cookies only for first request and then when the server updates them I want to see this affects my current cookies.
Example of how my website works:
At first I login using recaptcha and then get temp cookies,
for the first request in my app I want to use these temp cookies (already know them)
later, which each request I need to use the cookies from the previous response (they change with each request)
My current code:
def main():
start_time = time.time()
keep_running = True
while keep_running:
keep_running = execute_data()
time.sleep(5.0 - ((time.time() - start_time) % 5.0))
def execute_data():
url = 'https:me.happ.com/rishum/register/confirm'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'close'
}
cookies = dict(rishum='dsfsdf21312zxcasd-' + '39480523')
try:
response = requests.get(url, headers=headers, cookies=cookies, allow_redirects=False)
You've almost got it but are a bit off on your dictionary implementation.
This is what you are looking for:
cookies = {
"PHPSESSID": "djsgkjhsdjkhj34",
"authchallenge" : "sdifhshdfiuh34234234",
"rishum": 'skdhfuihuisdhf-' + '10403111'
}
try:
response = requests.get(url, headers=headers, cookies=cookies, allow_redirects=False)
Edit: I see now that this isn't the issue but that you want to update cookies during a session, here is a simple example of how to do so with requests.Session:
from requests import Session
s = Session()
s.cookies["foo"] = "bar"
r = s.get('https://google.com')
print("Before:")
for cookie in s.cookies:
print(cookie)
print()
s.cookies["bo"] = "baz"
print("After: ")
for cookie in s.cookies:
print(cookie)
Edit #2:
To further answer your question, here is a better example of how you can update cookies(all of them, if needed) in a loop.
from requests import Session, cookies
s = Session()
b = s.get('https://google.com')
for cookie in s.cookies:
print(cookie.value)
# Iterate over cookies
for cookie in s.cookies:
# You can see we already have this cookie's info in the $cookie variable so lets delete it from the cookie jar
del s.cookies[cookie.name]
# You can update the values HERE
# ...
# Example:
cookieValue = cookie.value.upper()
# Then save the new cookie to the cookie jar.
updated_cookie = cookies.create_cookie(domain=cookie.domain,name=cookie.name,value=cookieValue)
s.cookies.set_cookie(updated_cookie)
for cookie in s.cookies:
print(cookie.value)

POST request always returns "Disallowed Key Characters"

I want to retrieve Atmospheric particulate matter values from a table (sadly the site is not in english, so feel free to ask for everything): I failed with the combination of BeautifulSoup and GET request sent with requests, since table is filled with Bootstrap dinamically and a parser like BeautifulSoup can't find values which still must be inserted.
With Firebug I checked every angle of the page, and I found out that by selecting a different day of the table, a POST request is sent (the site, as you can see in Referer, is http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/, where the table is):
POST /temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini HTTP/1.1
Host: www.arpat.toscana.it
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/26-12-2016
Content-Length: 114
Cookie: [...]
DNT: 1
Connection: keep-alive
With the following params:
v_data_osservazione=26-12-2016&v_tipo_bollettino=regionale&v_zona=&csrf_test_name=b88d2517c59809a529
b6f8141256e6ca
Data in the answer are in JSON format.
So I started to craft my personal POST request, in order to directly get the JSON data which will fill the table.
In the params, in addition to the date, a csrf_test_name is required: here I discovered this site is protected against CSRF vulnerability; in order to perform a correct query in params, I need a CSRF token: that's why I perform a GET request to the site (see Referer in POST request for the URL) and get CSRF token from the cookie like this:
r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
At end of the day, with my CSRF token and POST request ready, I send it...and with status code 200, I always get Disallowed Key Characters.!
Looking for this error, I always see posts about CodeIgniter, which (I think) is not what I need: I tried every combination of headers and parameters, yet nothing changed. Before giving up on BeautifulSoup and requests and start learning Selenium, I'd like to figure out what the problem is: Selenium is too high level, low level libraries like BeautifulSoup and requests let me learn lot of useful things, so I'd prefer continue learning with these two.
Here's the code:
from requests import get, post
from bs4 import BeautifulSoup
import datetime
import json
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/" # + %d-%m-%Y
yesterday = datetime.date.today() - datetime.timedelta(1)
date_object = datetime.datetime.strptime(str(yesterday), '%Y-%m-%d')
yesterday_string = str(date_object.strftime('%d-%m-%Y'))
full_url = url + yesterday_string
print("REFERER " + full_url)
r = get(url)
csrf_token = r.cookies["csrf_cookie_name"]
print(csrf_token)
# preparing headers for POST request
headers = {
"Host": "www.arpat.toscana.it",
"Accept" : "*/*",
"Accept-Language" : "en-US,en;q=0.5",
"Accept-Encoding" : "gzip, deflate",
"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With" : "XMLHttpRequest", # XHR
"Referer" : full_url,
"DNT" : "1",
"Connection" : "keep-alive"
}
# preparing POST parameters (to be inserted in request's body)
payload_string = "v_data_osservazione="+yesterday_string+"&v_tipo_bollettino=regionale&v_zona=&csrf_test_name="+csrf_token
print(payload_string)
# data -- (optional) Dictionary, bytes, or file-like object to send in the body of the Request.
# json -- (optional) json data to send in the body of the Request.
req = post("http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini",
headers = headers, json = payload_string
)
print("URL " + req.url)
print("RESPONSE:")
print('\t'+str(req.status_code))
print("\tContent-Encoding: " + req.headers["Content-Encoding"])
print("\tContent-type: " + req.headers["Content-type"])
print("\tContent-Length: " + req.headers["Content-Length"])
print('\t'+req.text)
This code works for me:
I use request.Session() and it keeps all cookies
I use data= instead of json=
finally I don't need all commented elements
to compare browser requests and code requests I used Charles web debugging proxy application
code:
import requests
import datetime
#proxies = {
# 'http': 'http://localhost:8888',
# 'https': 'http://localhost:8888',
#}
s = requests.Session()
#s.proxies = proxies # for test only
date = datetime.datetime.today() - datetime.timedelta(days=1)
date = date.strftime('%d-%m-%Y')
# --- main page ---
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/index/regionale/"
print("REFERER:", url+date)
r = s.get(url)
# --- data ---
csrf_token = s.cookies["csrf_cookie_name"]
#headers = {
#'User-Agent': 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
#"Host": "www.arpat.toscana.it",
#"Accept" : "*/*",
#"Accept-Language" : "en-US,en;q=0.5",
#"Accept-Encoding" : "gzip, deflate",
#"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
#"X-Requested-With" : "XMLHttpRequest", # XHR
#"Referer" : url,
#"DNT" : "1",
#"Connection" : "keep-alive"
#}
payload = {
'csrf_test_name': csrf_token,
'v_data_osservazione': date,
'v_tipo_bollettino': 'regionale',
'v_zona': None,
}
url = "http://www.arpat.toscana.it/temi-ambientali/aria/qualita-aria/bollettini/aj_dati_bollettini"
r = s.post(url, data=payload) #, headers=headers)
print('Status:', r.status_code)
print(r.json())
proxy:

Consecutive urllib2 POST gives 404

The problem that I have — and try to solve with Python — is to make consecutive POST requests (completing an online form) for a website (specifically, a free online demo of an API at http://demo.travelportuniversalapi.com). I am not able to acquire the results page so far — been at this for two days now.
The code I employ is:
import sys
import urllib, urllib2, cookielib
from BeautifulSoup import BeautifulSoup
import re
class website:
def __init__(self):
self.host = 'demo.travelportuniversalapi.com'
self.ua = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0'
self.session = cookielib.CookieJar() #session devine o instanta a obiectului cookielib
pass
def get(self):
try:
url = 'http://demo.travelportuniversalapi.com/(S(cexfuhghvlzyzx5n0ysesra1))/Search' #this varies every 20 minutes
data = None
headers = {'User-Agent': self.ua}
request = urllib2.Request(url, data, headers)
self.session.add_cookie_header(request)
response = urllib2.urlopen(request)
self.session.extract_cookies(response, request)
url = response.geturl()
data = {'From': 'lhr', 'To': 'ams', 'Departure' : '9/4/2013','Return' : '9/6/2013'}
headers = {'User-Agent': self.ua, "Content-type": "application/x-www-form-urlencoded; charset=UTF-8",
}
request = urllib2.Request(url, urllib.urlencode(data), headers, 20)
self.session.add_cookie_header(request)
response = urllib2.urlopen(request, timeout=30) #HTTP Error 404: Not Found - aici am eroare
self.session.extract_cookies(response, request)
except urllib2.URLError as e:
print >> sys.stderr, e
return None
rt = website()
rt.get()
The error that I receive at the last urllib2.Request is HTTP Error 404: Not Found. I am not sure my cookies are working.
Monitoring HTTP packets with an addon in the browser I noticed the following header when the POST is sent in a broswer: 'X-Requested-With XMLHttpRequest' — is this relevant?

Categories