I am trying to send a request wtih post method to an API, my code looks like the following one:
import urllib.request
import json
url = "https://api.cloudflareclient.com/v0a745/reg"
referrer = "e7b507ed-5256-4bfc-8f17-2652d3f0851f"
body = {"referrer": referrer}
data = json.dumps(body).encode('utf8')
headers = {'User-Agent': 'okhttp/3.12.1'}
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
status_code = response.getcode()
print (status_code)
Actually it works fine but i want to use "requests" library instead as it's more updated and more flexible with proxies with following code:
import requests
import json
url = "https://api.cloudflareclient.com/v0a745/reg"
referrer = "e7b507ed-5256-4bfc-8f17-2652d3f0851f"
data = {"referrer": referrer}
headers = {'User-Agent': 'okhttp/3.12.1'}
req = requests.post(url, headers=headers, json=data)
status_code = req.status_code
print (status_code)
But it returns 403 status code, how can i fix it ?
Keep in mind that this API is open to everyone and you can just run the code with no worries.
EDIT-1: i have tried removing json.dumps(body).encode('utf8') or just .encode('utf8') from the second code by #tomasz-wojcik advice but i am still getting 403 while the first code still works!
EDIT-2: i tried requesting with postman that successfully made the request and returned 200 status code. postman generated the following python code:
import requests
url = "https://api.cloudflareclient.com/v0a745/reg"
payload = "{\"referrer\": \"e7b507ed-5256-4bfc-8f17-2652d3f0851f\"}"
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'okhttp/3.12.1',
'Host': 'api.cloudflareclient.com'
}
response = requests.request("POST", url, headers=headers, data=payload)
status_code = response.status_code
print (status_code)
If you run the code outside of postman, it still returns 403 status code, i'm a litte confused, i am thinking that maybe "requests" library doesn't changing the user-agent in the second code.
EDIT-3: I have looked into it and found out that it works on python 2.7.16 but doesn't work on python 3.8.5!
EDIT-4: Some Developers are reporting that the second code works on python 3.6 too but the main thing is why it is working on other versions but not working on 3.8 or 3.7 ?
Python Versions that returned 403 status code(second code): 3.8.5 & 3.7
Python Versions that returned 200 status code(second code): 3.6 & 2.7.16
The issue seems to be with how the host is handling ssl. Newer versions of requests uses certifi which in your case is having issues with the host server. I downgraded requests to an earlier version and it worked. (2.1.0). You can fix the version in your requirements.txt and it should work with any python version.
https://requests.readthedocs.io/en/master/user/advanced/#ca-certificates
Before version 2.16, Requests bundled a set of root CAs that it trusted, sourced from the Mozilla trust store.
The certificates were only updated once for each Requests version. When certifi was not installed, this led to extremely out-of-date certificate bundles when using significantly older versions of Requests.
For the sake of security we recommend upgrading certifi frequently!
Related
I am trying to make a request to a server using Python Requests and it returns a 403. The page works fine using my browser and using urllib.
The headers are identical. I even tried using an ordered dict to make sure the header ordering is identical, but it still won't work.
Then I tried to see the SSL differences, and found that the main difference between the 3 (my browser, requests, and urllib) is that requests doesn't support TLS session tickets.
url="https://www.howsmyssl.com/a/check"
import requests
req = requests.get(url=url)
print(req.text)
import urllib
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
print(response.read())
The Cipher Suite is almost identical across the 3. The TLS version is 1.3 across all. But the session_ticket_supported is true only for the browser and urllib (both of which work) and is false for requests (which returns 403).
So I assumed that the problem is there.
I dug deeper and learned that requests is actually using urllib3, but I got stuck at confirming which SSL adapter they use and how to configure it.
Any ideas on how to enable TLS session tickets for requests? Or maybe I am looking in the wrong place here?
PS. I am using Python 3.9.13 and the latest versions for all packages
PSS. curl also supports session tickets on my system and can access the server fine
I have repeatedly received <Response [403]> despite adding headers obtained from the chrome developer tool. Would someone with more experience be able to tell me if its possible to access the following url with Python Requests? And if not is there a suggested alternative approach
url='https://www.phosphosite.org/proteinAction.action?id=5848&showAllSites=true'
from bs4 import BeautifulSoup
import requests
url='https://www.phosphosite.org/proteinAction.action?id=5848&showAllSites=true'
headers={'user-agent': 'Mozilla/5.0....'}
result = requests.get(url, headers=headers)
print(result.request.headers)
print(result)
Because header key is lowercase
How to solve ?
Change your header key from:
'user-agent'
to:
'User-Agent'
then you get <Response [200]> result.
This worked, though I add to exit all sites and restart my kernel in order to get 200 response
import cloudscraper
scraper = cloudscraper.create_scraper(browser={'custom': 'ScraperBot/1.0','platform':'windows','mobile':'False'}) # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print(scraper.get("https://www.phosphosite.org/proteinAction.action?id=5848&showAllSites=true")) # => "<!DOCTYPE html><html><head>..."
I am trying to login into a website by passing username and password.It says session cookie is missing.I am beginner to api .I dont know if I have missed something here.The website is http://testing-ground.scraping.pro/login
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, fields={'usr':'admin','pwd':'12345'})
print(req.data.decode('utf-8'))
There are two issues in your code that make you unable to log in successfully.
The content-type issue
In the code you are using urllib3 to send data of content-type multipart/form-data. The website, however, seems to only accept the content-type application/x-www-form-urlencoded.
Try the following cURL commands:
curl -v -d "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
curl -v -F "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
For the first one, the content-type in your request header is application/x-www-form-urlencoded, so the website takes it and logs you in (with a 302 Found response).
The second one, however, sends data with content-type multipart/form-data. The website doesn't take it and therefore rejects your login request (with a 200 OK response).
The cookie issue
Another issue is that urllib3 follows redirect by default. More importantly, the cookie is not handled (i.e. stored and sent in the following requests) by default by urllib3. Thus, the second request won't contain the cookie tdsess=TEST_DRIVE_SESSION, and therefore the website returns the message that you're not logged in.
If you only care about the login request, you can try the following code:
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, data={'usr':'admin','pwd':'12345'}, encode_multipart=False, redirect=False)
print(req.data.decode('utf-8'))
The encode_multipart=False instructs urllib3 to send data with content-type application/x-www-form-urlencoded; the redirect=False tells it not to follow the redirect, so that you can see the response of your initial request.
If you do want to complete the whole login process, however, you need to save the cookie from the first response and send it in the second request. You can do it with urllib3, or
Use the Requests library
I'm not sure if you have any particular reasons to use urllib3. Urllib3 will definitely work if you implements it well, but I would suggest try the Request library, which is much easier to use. For you case, the following code with Request will work and get you to the welcome page:
import requests
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = requests.post(url, data={'usr':'admin','pwd':'12345'})
print(req.text)
import requests
auth_credentials = ("admin", "12345")
url = "http://testing-ground.scraping.pro/login?mode=login"
response = requests.post(url=url, auth=auth_credentials)
print(response.text)
I have noticed that for some websites' API Urls, the return on the browser is via a service worker which has caused problems in scraping those APIs.
For consider the following:
https://www.sephora.co.id/api/v2.3/products?filter[category]=makeup/face/bronzer&page[size]=30&page[number]=1&sort=sales&include=variants,brand
The data appears when the url is pasted into a browser However it gives me a 422 error when I try to automate the collection of that data in Python with the following code:
import requests
#API url
url = 'https://www.sephora.co.id/api/v2.3/products?filter[category]=makeup/face/bronzer&page[size]=30&page[number]=1&sort=sales&include=variants,brand'
#The response is always 422
response = requests.get(url)
I have noticed that when calling the API url on the browser returns a response via a service worker. Therefore my questions is there a way around to get a 200 response via the python requests library?
The server appears to require the Accept-Language header.
The code below now returns 200.
import requests
url = 'https://www.sephora.co.id/api/v2.3/products?filter[category]=makeup/face/bronzer&page[size]=30&page[number]=1&sort=sales&include=variants,brand'
headers = {'Accept-Language': 'en-gb'}
response = requests.get(url, headers=headers)
(Ascertained by checking a successful request via a browser, adding in all headers AS IS to the python request and then removing one by one.)
I am trying to log into a website. When I look at print(g.text) I am not getting back the web page I expect but instead a cloudflare page that says 'Checking your browser before accessing'
import requests
import time
s = requests.Session()
s.get('https://www.off---white.com/en/GB/')
headers = {'Referer': 'https://www.off---white.com/en/GB/login'}
payload = {
'utf8':'✓',
'authenticity_token':'',
'spree_user[email]': 'EMAIL#gmail.com',
'spree_user[password]': 'PASSWORD',
'spree_user[remember_me]': '0',
'commit': 'Login'
}
r = s.post('https://www.off---white.com/en/GB/login', data=payload, headers=headers)
print(r.status_code)
g = s.get('https://www.off---white.com/en/GB/account')
print(g.status_code)
print(g.text)
Why is this occurring when I have set the session?
You might want to try this:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
It does not require Node.js dependency.
All credits go to this pypi page
This is due to the fact that the page uses Cloudflare's anti-bot page (or IUAM).
Bypassing this check is quite difficult to solve on your own, since Cloudflare changes their techniques periodically. Currently, they check if the client supports JavaScript, which can be spoofed.
I would recommend using the cfscrape module for bypassing this. To install it, use pip install cfscrape. You'll also need to install Node.js.
You can pass a requests session into create_scraper() like so:
session = requests.Session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)
I had the same problem because they implemented cloudfare in the api, I solved it this way
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get("MY API").text
y = json.loads(r)
print (y)
You can scrape any Cloudflare protected page by using this tool. Node.js is mandatory in order for the code to work correctly.
Download Node from this link https://nodejs.org/en/
import cfscrape #pip install cfscrape
scraper = cfscrape.create_scraper()
res = scraper.get("https://www.example.com").text
print(res)
curl and hx avoid this problem. But how?
I found, they work by default with HTTP/2. But requests library used only HTTP/1.1.
So, for tests I installed httpx with h2 python library to support HTTP/2 requests) and it works if I do: httpx --http2 'https://some.url'.
So, the solution is to use a library that supports http2. For example httpx with h2
It's not a complete solution, since it won't help to solve Cloudflare's anti-bot ("I'm Under Attack Mode", or IUAM) challenge