I am trying to access a site with a bot prevention.
with the following script using requests I can access the site.
request = requests.get(url,headers={**HEADERS,'Cookie': cookies})
and I am getting the desired HTML. but when I use aiohttp
async def get_data(session: aiohttp.ClientSession,url,cookies):
async with session.get(url,timeout = 5,headers={**HEADERS,'Cookie': cookies}) as response:
text = await response.text()
print(text)
I am getting as a response the bot prevention page.
This is the headers I use for all the requests.
HEADERS = {
'User-Agent': 'PostmanRuntime/7.29.0',
'Host': 'www.dnb.com',
'Connection': 'keep-alive',
'Accept': '/',
'Accept-Encoding': 'gzip, deflate, br'
}
I have compared the requests headers both of requests.get and aiohttp and they are identical.
is there any reason the results are different? if so why?
EDIT: I've checked the httpx module, the problem occurs there aswell both with httpx.Client() and httpx.AsyncClient().
response = httpx.request('GET',url,headers={**HEADERS,'Cookie':cookies})
doesn't work as well. (not asyncornic)
I tried capturing packets with wireshark to compare requests and aiohttp.
Server:
import http
server = http.server.HTTPServer(("localhost", 8080),
http.server.SimpleHTTPRequestHandler)
server.serve_forever()
with requests:
import requests
url = 'http://localhost:8080'
HEADERS = {'Content-Type': 'application/json'}
cookies = ''
request = requests.get(url,headers={**HEADERS,'Cookie': cookies})
requests packet:
GET / HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.27.1
Accept-Encoding: gzip, deflate, br
Accept: */*
Connection: keep-alive
Content-Type: application/json
Cookie:
with aiohttp:
import aiohttp
import asyncio
url = 'http://localhost:8080'
HEADERS = {'Content-Type': 'application/json'}
cookies = ''
async def get_data(session: aiohttp.ClientSession,url,cookies):
async with session.get(url,timeout = 5,headers={**HEADERS,'Cookie': cookies}) as response:
text = await response.text()
print(text)
async def main():
async with aiohttp.ClientSession() as session:
await get_data(session,url,cookies)
asyncio.run(main())
aiohttp packet:
GET / HTTP/1.1
Host: localhost:8080
Content-Type: application/json
Cookie:
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.10 aiohttp/3.8.1
If the site seems to accept packets from requests, then you could try making the aiohttp packet identical by setting the headers:
HEADERS = { 'User-Agent': 'python-requests/2.27.1','Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Type': 'application/json','Cookie': ''}
If you haven't already, I suggest capturing the request with wireshark to make sure aiohttp isn't messing with your headers.
You can also try other user agent strings too, or try the headers in different orders. The order is not supposed to matter, but some sites check it anyway for bot protection (for example in this question).
Related
I am trying to login into www.zalando.it using the requests library, but every time I try to post my data I am getting a 403 error. I saw in the network tab from Zalando and the login call and is the same. These are just dummy data, you can test creating a test account.
Here is the code for the login function:
import requests
import data
from bs4 import BeautifulSoup
home_page_link = "https://www.zalando.it/"
login_api_schema = "https://accounts.zalando.com/api/login/schema"
login_api_post = "https://accounts.zalando.com/api/login"
headers = {
'Host': 'www.zalando.it',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection' : 'close',
'Upgrade-Insecure-Requests': '1'
}
with requests.Session() as s:
s.headers.update(headers)
r = s.get(home_page_link)
# fetch these cookies: frsx, Zalando-Client-Id
cookie_dict = s.cookies.get_dict()
# update the headers
# remove this header for the xhr requests
del s.headers['Upgrade-Insecure-Requests']
# these 2 are taken from some response cookies
s.headers['x-xsrf-token'] = cookie_dict['frsx']
s.headers['x-zalando-client-id'] = cookie_dict['Zalando-Client-Id']
# i didn't pay attention to where these came from
# just saw them and manually added them
s.headers['x-zalando-render-page-uri'] = '/'
s.headers['x-zalando-request-uri'] = '/'
# this is sent as a response header and is needed to
# track future requests/responses
s.headers['x-flow-id'] = r.headers['X-Flow-Id']
# only accept json data from xhr requests
s.headers['Accept'] = 'application/json'
# when clicking the login button this request is sent
# i didn't test without this request
r = s.get(login_api_schema)
# add an origin header
s.headers['Origin'] = 'https://www.zalando.it'
# finally log in, this should return a 201 response with a cookie
login_data = {'email:': data.email,
'request': data.request,
'secret': data.secret}
r = s.post(login_api_post, json=login_data)
print(r.status_code)
print(r.headers)
I also used fiddler to try to sniff the data traffic but the HTTPS request is not performed and the latter generates the following exception:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='accounts.zalando.com', port=443) : Max retries exceeded with url: /api/login
(Caused by ProxyError('Your proxy appears to only use HTTP and not HTTPS, try changing your proxy URL to be HTTP. See: https://urllib3.readthedocs.io/en /1.26.x/advanced-usage.html#https-proxy-error-http-proxy',
SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1091)'))))
For the HTTP request instead I get a 301 error
Maybe this answer helps you.
I think youru website might detect, if a request gets sent with javascript.
I'm able to succesfully POST some csv data to an external server using the httr package in R using a post request like so:
request <- POST(url, body = upload_file('my_table.csv'), verbose())
The detail provided by the verbose() option above tells me that the post request headers look like this:
-> User-Agent: libcurl/7.68.0 r-curl/4.3.1 httr/1.4.2
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
-> Content-Type: text/csv
-> Content-Length: 77
->
What I'm trying to do is emulate this code with the Python requests module (because the rest of the package is in Python), and I'm using the following code:
response = requests.post(url, files = {'file': open('my_table.csv','rb')})
However, I'm getting a error,Error: File type not supported error from the server when I do so. When I look at the details of my POST request in Python, I see the following headers:
{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '226', 'Content-Type': 'multipart/form-data; boundary=b8f99a72145547743d035be5d9c1e983'}
What is the cleanest way for me to upload and post this CSV data such that the server might accept it?
My goal is to access a website that uses HTTP authentication using python. I can open the website from my web browser and the header tells me that I should use HTTBDigestAuth:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
Authorization: Digest username="user", realm="CG Downloads", nonce="cTN0pKxqBQA=e46ad250f42f73e9076ebc97c417f0d38bac094a", uri="/fileadmin/teaching/2017/WS/adip/exercises/adip-uebung-00-.pdf", algorithm=MD5, response="5a57ddbcd1b20444100a91b1967a2782", qop=auth, nc=00000001, cnonce="5a6b041b4113bb9a"
Connection: keep-alive
Cookie: fe_typo_user=76b7e7e25372f782d94e91b51b854568
Host: cg.cs.uni-bonn.de
Referer: http://cg.cs.uni-bonn.de/de/lehre/ws-2017/vorlesung-algorithmisches-denken-und-imperative-programmierung/
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0
However when I try to go the the page using requests and HTTPDigestAuth I get "401 Unauthorized" as a response.
import logging
import requests
from requests.auth import HTTPDigestAuth
try:
import httplib
except ImportError:
import http.client as httplib
httplib.HTTPConnection.debuglevel = 1
logging.basicConfig(level=logging.DEBUG)
url = 'http://cg.cs.uni-bonn.de/fileadmin/teaching/2017/WS/adip/exercises/adip-uebung-00-.pdf'
response = requests.get(url, auth=HTTPDigestAuth('user', 'pass'),
timeout=10)
print(response.status_code)
print(response.headers)
print(r.text)
Am I using the wrong authorization method or is my code wrong? I appreciate any help you can give me.
EDIT:
I am trying to access sites on cg.cs.uni-bonn.de, for example http://cg.cs.uni-bonn.de/fileadmin/teaching/2017/WS/adip/exercises/adip-uebung-00-.pdf
You just need to pass values to auth=HTTPDigestAuth('user', 'pass') like this:
user = 'admin' # change to your username
pass = '123456' # change to your password
...
response = requests.get(url, auth=HTTPDigestAuth(user, pass),
timeout=10)
...
Due to requests adding unwanted headers, I decided to prepare the request manually and use Session Send().
Sadly, The following code produces the wrong request
import requests
ARCHIVE_URL = "http://10.0.0.10/post/tmp/archive.zip"
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'no-cache',
'Connection': 'Keep-Alive',
'Host': '10.0.0.10'
}
DataToSend = 'data'
req = requests.Request('POST', ARCHIVE_URL, data=DataToSend, headers=headers)
prepped = req.prepare()
s = requests.Session()
response = s.send(prepped)
If I look at the request using fiddler I get this:
GET http://10.0.0.10/tmp/archive.zip HTTP/1.1
Accept-Encoding: identity
Connection: Keep-Alive
Host: 10.0.0.10
Cache-Control: no-cache
Content-Type: application/x-www-form-urlencoded
What am I missing?
since prepared request is not connected to the session when using req.prepare() instead of s.prepare_request(req) when s is the session , you must specify request headers since there are no default one that come from the session object.
use s.prepare_request(req) instead of req.prepare() or specify headers dictionary
POST /search HTTP/1.1
Host: chatango.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: cookies_enabled.chatango.com=yes; fph.chatango.com=http; id.chatango.com=programmable; auth.chatango.com={MY AUTH KEY - I already have this}
Connection: keep-alive
Referer: http://st.chatango.com/flash/sellers_external.swf
Content-Type: application/x-www-form-urlencoded
Content-Length: 27
s=B&ama=99&ami=13&t=25&f=20
I'd really like to know how to send this via python, I haven't found anything except sending that data part, I really don't understand how I'm supposed to send the cookie data as I have it stored into a variable which I got through an API, which obtains it through sockets.
You can add new headers in the request() method:
HTTPConnection.request(method, url[, body[, headers]])
See request documentation.
To add a cookie, just add the Cookie header.
Here is a POST example from the Python site:
import httplib, urllib
params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
headers = {"Content-type": "application/x-www-form-urlencoded",
"Accept": "text/plain"}
conn = httplib.HTTPConnection("musi-cal.mojam.com:80")
conn.request("POST", "/cgi-bin/query", params, headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
conn.close()