Python3 - Requests - BS4 - Cloudflare -> 403 Forbidden not use Local Proxy

Python3 - Requests - BS4 - Cloudflare -> 403 Forbidden not use Local Proxy - python

Codes aren't working. It has got 403 error because system using cloudflare
When i am using anyone http proxy(burp suite/fiddler etc.), I see csrfToken. It works.
Why it works when use local proxy?
import requests
from bs4 import BeautifulSoup
headerIstek = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041",
"Sec-Fetch-Site" : "none",
"Accept-Language" : "tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7"
}
istekLazim = {"ref":"","display_type":"popup","loc":""}
istekLogin = requests.get("https://www.example.com/join/login-popup/", headers=headerIstek, cookies={"ud_rule_vars":""}, params=istekLazim, verify=False)
soup = BeautifulSoup(istekLogin.text, "html.parser")
print(istekLogin.request.headers)
csrfToken = soup.find("input", {"name":"csrfmiddlewaretoken"})["value"]
print(csrfToken)

Cloudflare performs JavaScript checks on the browser and returns a session if the checks have been successful. If you want to run a one-off script to download stuff off of a CloudFlare protected server, add a session cookie from a previously validated session you obtained using your browser.
The session cookie is named __cfduid. You can get it by fetching a resource using your browser and then opening the developer tools and the network panel. Once you inspect the request, you can see the cookies your browser sent to the server.
Then you can use that cookie for requests using your script:
cookies = {
"__cfduid": "xd0c0985ed80ffbc4dd29d1612168766",
}
response = requests.get(image_url, cookies=cookies)
response.raise_for_status()

Related

Heroku BeautifulSoup + cloudscraper don't bypass cloudflare on server side

i'm using BeautifulSoup + cloudscraper to scrap a site. The problem is in local it's working but on heroku server it doesn't work.
It's look like when i launch the script via heroku server the JS or cookie are not enable. That why in local cloudscraper can bypass cloudflare and not on heroku.
My code:
import requests
import cloudscraper
from bs4 import BeautifulSoup
session = requests.session()
scraper = cloudscraper.create_scraper(browser='chrome', sess=session)
contract_page = scraper.get("https://bscscan.com/token/0x30e650783b4046c64dcf3b7b78854f3d4a87b058",
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36",
'Cache-Control': "no-cache",
})
soupa = BeautifulSoup(contract_page.content, 'html.parser')
print(soupa)
tokenholders = soupa.find(id='ContentPlaceHolder1_tr_tokenHolders').get_text()
the print of soupa give me this HTML page:
Someone have idea how to enable JS or cookie from a heroku server that run the script please ?

After many tried i found a solution, to bypass it we have to use a proxy to change the IP of heroku server.

Web Scraping - Cloudflare Issues

I am trying to scrape https://www.carsireland.ie/search#q?%20scraper%20python=&toggle%5Bpoa%5D=false&page=1 (I had built a scraper but then they did a total overhaul of their website). The new website has a new format and has Cloudflare to provide the usual security. I have the following code which returns a 403 error, particularly referencing this error:
"https://www.cloudflare.com/5xx-error-landing"
The code which I have built so far is as follows:
from requests_html import HTMLSession
session = HTMLSession()
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
# url of search page
url = 'https://www.carsireland.ie/search#q?sortBy=vehicles_prod%2Fsort%2Fpoa%3Aasc%2Cupdated%3Adesc&page=1'
# create a session with the url
r = session.get(url, headers=header)
# render the url
data = r.html.render(sleep=1, timeout=20)
# Check the response
print(r.text)
I would really appriciate any help which could be provided to correct the CloudFlare issues which I am having.

this problem can be fixed by simply changing the referer property in header to the link you are going to scrape.

Unable to fetch a response - Request library Python

I am unable to fetch a response from this url. While it works in browser, even in incognito mode. Not sure why it is not working. It is just keep running without any output. No errors. I even tried request headers by setting 'user-agent' key but again received no response
Following is the code used:
import requests
response = requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020&section=EQ')
print(response.text)
I want html text from the response page for further use.

Your server is checking to see if you are sending the request from a web browser. If not, it's not returning anything. Try this:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
r=requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020&section=EQ', timeout=3, headers=headers)
print(r.text)

How to implement ajax request using Python Request

I'm trying to log into a website using Python request. Unfortunately, it is always showing this error when printing its content.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
For reference my code
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
May I know what is missing and how can I get HTML code of welcome page from this
UPDATE
I'm trying to get make an API call after login authentication. However, I'm not able to succeed in login authentication. Hence I am not able to get the response of API Call. As per my thought it due to multi-factor authentication it is getting failed. I need to know how can I implement this?
For eg: www.abc.com is the URL of the website. The login is done through JS form submission Hence URL is specified in the ajax part. On the success of that, there is another third authentication party(okta) which will also verify the credentials and finally reach the home page. then I need to call the real API for my task.
But it is not working.
import requests
import sys
class Login:
def sendRequestWithAuthentication(self,loginDetails,requestDetails):
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
action_url=loginDetails['action_url'] if 'action_url' in loginDetails.keys() else None
pay_load=loginDetails['payload'] if 'payload' in loginDetails.keys() else None
session_requests = requests.session()
if action_url and pay_load:
act_resp=session_requests.post(action_url, data=pay_load, headers=user_agent,verify=False,files=[ ])
print(act_resp)
auth_cookies=act_resp.cookies.get_dict()
url,method,request_payload = requestDetails['url'],requestDetails['method'],requestDetails['payload']
querystring=requestDetails['querystring']
response=session_requests.get(url,headers=user_agent,cookies=auth_cookies,data=request_payload,params=querystring)
print(response)
return response.json()
In the above action URL is the API given in the ajax part & in the second request, the URL is the API address for that GET.
In short, may I know how can implement multifactor authentication in python request
My Doubt
Do we need the cookies from the login form page to include in the login request
How to implement multifactor authentication in python request(Here we don't need any pin or something it is done through RSA.)Is there any need of a certificate for login as it now raising unable to validate the SSL certificate
Give a dummy example api that is implement such kind of scenario

No, you make it complex.This code worked:
import requests
login_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php"
welcome_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php"
payload = 'user_email=test#phpzag.com&password=test&login_button='
login_headers = {
'x-requested-with': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded', # its urlencoded instead of form-data
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
}
s = requests.Session()
login = s.post(login_url, headers=login_headers, data=payload) # post requests
welcome = s.get(welcome_url, headers=login_headers)
print(welcome.text)
Result:
.....Hello, <br><br>Welcome to the members page.<br><br>

TL;DR
Change the part of your code that says data=payload to json=payload, and it should work.
Direct answer to your question
How [does one] implement [an] AJAX request using Python Requests?
You cannot do that. An AJAX request is specifically referring to a Javascript-based HTTP request. To quote from W3 school's AJAX introduction page, "AJAX = Asynchronous JavaScript And XML".
Indirect answer to your question
What I believe you're asking is how to perform auth/login HTTP requests using the popular python package, requests. The short answer— unfortunately, and like most things—is that it depends. Various auth pages handle the auth requests differently, and so you might have to do different things in order to authenticate against the specific web service.
Based on your code
I'm going to make some assumptions that the login page is probably looking for a POST request with the authentication details (e.g. credentials) in the form of a JSON object based on your code, and based on the response back from the server being a 406 error meaning that you're sending data with an accept header that doesn't align with how the server wants to respond.
When using requests, using the data parameter to the request function will send the data "raw"; that is, it'll send it in the native data format it is (like in cases of binary data), or it'll translate it to standard HTML form data if that format doesn't work (e.g. key1=value1&key2=value2&key3=value3, this form has the MIME type of application/x-www-form-urlencoded and is what requests will send when data has not been specified with an accept header). I'm going to make an educated guess based on the fact that you put your credentials into a dictionary that the login form is expecting a POST request with a JSON-formatted body (most modern web apps do this), and you were under the impression that setting the data parameter to requests will make this into a JSON object. This is a common gotcha/misconception with requests that has bitten me before. What you want is instead to pass the data using the json parameter.
Your code:
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
Fixed (and cleaned up) code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Test script to login to php web app.
"""
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {
'user_email': 'test#phpzag.com',
'password':'test'
}
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
session = requests.Session()
auth_response = session.post(
url=LOGIN_URL,
json=payload, # <--- THIS IS THE IMPORTANT BIT. Note: data param changed to json param
headers=user_agent
)
response = session.get(
'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',
headers=headers,
cookies=auth_response.cookies.get_dict() # TODO: not sure this is necessary, since you're using the session object to initiate the request, so that should maintain the cookies/session data throughout the session...
)
print(response.content)
Check out this section of the requests documentation on POST requests, if you scroll down a bit from there you'll see the docs talk about the github API which expects JSON and how to handle that.
Auth can be tricky overall. Sometimes things will want "basic auth", which requests will expect you to pass as a tuple to the auth parameter, sometimes they'll want a bearer token / OAUTH thing which can get headache-inducing-ly complicated/annoying.
Hope this helps!

You are missing the User agent that the server (apache?) requires
Try this:
import requests
from requests import Session
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
s = requests.Session()
x=s.get(URL, headers=user_agent)
x=s.post(LOGIN_URL, data=payload, headers=user_agent)
print(x.content)
print(x.status_code)

Take a look at Requests: Basic Authentication
import requests
requests.post(URL, auth=('user', 'pass'))
# If there are some cookies you need to send
cookies = dict(cookies_are='working')
requests.post(URL, auth=('user', 'pass'), cookies=cookies)

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.

I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim&currency=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 - Requests - BS4 - Cloudflare -> 403 Forbidden not use Local Proxy - python

Related

Heroku BeautifulSoup + cloudscraper don't bypass cloudflare on server side

Web Scraping - Cloudflare Issues

Unable to fetch a response - Request library Python

How to implement ajax request using Python Request

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

Categories

Resources