Web Scraping - Cloudflare Issues - python

I am trying to scrape https://www.carsireland.ie/search#q?%20scraper%20python=&toggle%5Bpoa%5D=false&page=1 (I had built a scraper but then they did a total overhaul of their website). The new website has a new format and has Cloudflare to provide the usual security. I have the following code which returns a 403 error, particularly referencing this error:
"https://www.cloudflare.com/5xx-error-landing"
The code which I have built so far is as follows:
from requests_html import HTMLSession
session = HTMLSession()
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
# url of search page
url = 'https://www.carsireland.ie/search#q?sortBy=vehicles_prod%2Fsort%2Fpoa%3Aasc%2Cupdated%3Adesc&page=1'
# create a session with the url
r = session.get(url, headers=header)
# render the url
data = r.html.render(sleep=1, timeout=20)
# Check the response
print(r.text)
I would really appriciate any help which could be provided to correct the CloudFlare issues which I am having.

this problem can be fixed by simply changing the referer property in header to the link you are going to scrape.

Related

I cannot parse a JavaScript rendered webpage with Python Requests

I tried to write a little app for parsing this page: https://apps.microsoft.com/store/category/Business
I cannot get a full html code. The tag body is not full.
import requests
def get_data(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
req = requests.get(url, headers=headers)
with open("index.html", "w") as file:
file.write(req.text)
get_data("https://apps.microsoft.com/store/category/Business")
You cannot just parse this page because it is a client side rendered page through JavaScript.
You need to use a tool like:
pyppeteer
Selenium
Or maybe try to reverse engineer the page and directly call the APIs.
(Or maybe see if Microsoft has a public API you can call to get the info you want).

Is there a way to change the geolocation of the request in requests_html?

I am trying to scrape some Facebook post stats (number of reactions, comments, views or shares) with requests_html package. I used this approach because I had to render the initial page source since Facebook uses some lazy loading scripts.
I managed to do it but it depends on the geolocation. For example, if I make the request from my local machine the comment count span will be in my country's language, but if I use a UK VPN it will be displayed in English (E.g. '5 comments').
The goal is to deploy the code into cloud and to standardize it to a single Language to be more robust.
I tried sending the Accept-Language header as shown in the snippet below, but no success.
Please note that the xpath might change if the page structure changes.
from requests_html import HTMLSession
session = HTMLSession()
headers = {
'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8',
'Content-type': 'text/html; charset=utf-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
r = session.get(url, headers=headers)
r.html.render(timeout=10, wait=2, sleep=2)
xpath = '/html/body/div[1]/div/div[1]/div/div[3]/div/div/div/div[1]/div[2]/div[1]/div/div/div[1]/div[2]/div[2]/div/div/div[2]/div/div[3]/span/div/span'
comments = r.html.xpath(xpath, first=True).text
Thank you in advance!

503 Error When Trying To Crawl One Single Website Page | Python | Requests

Goal:
I am trying to scrape the HTML from this page: https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=.
(note - I will eventually want to paginate and scrape all job listings from this page)
My issue:
I get a 503 error when I try to scrape the page using Python and Requests. I am working out of Google Colab.
Initial Code:
import requests
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = requests.get(url)
print(response)
Attempted solutions:
Using 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
Implementing this code I found in another thread:
import requests
def getUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
res = requests.get(url, headers=headers)
res.raise_for_status()
getUrl('https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=')
I am able to access the website via my browser.
Is there anything else I can try?
Thank you
That page is protected by cloudflare, there's some options to try to bypass it, seems that using cloudscraper works:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = scraper.get(url).text
print(response)
In order to use it, you'll need to install it:
pip install cloudscraper

How to implement ajax request using Python Request

I'm trying to log into a website using Python request. Unfortunately, it is always showing this error when printing its content.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
For reference my code
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
May I know what is missing and how can I get HTML code of welcome page from this
UPDATE
I'm trying to get make an API call after login authentication. However, I'm not able to succeed in login authentication. Hence I am not able to get the response of API Call. As per my thought it due to multi-factor authentication it is getting failed. I need to know how can I implement this?
For eg: www.abc.com is the URL of the website. The login is done through JS form submission Hence URL is specified in the ajax part. On the success of that, there is another third authentication party(okta) which will also verify the credentials and finally reach the home page. then I need to call the real API for my task.
But it is not working.
import requests
import sys
class Login:
def sendRequestWithAuthentication(self,loginDetails,requestDetails):
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
action_url=loginDetails['action_url'] if 'action_url' in loginDetails.keys() else None
pay_load=loginDetails['payload'] if 'payload' in loginDetails.keys() else None
session_requests = requests.session()
if action_url and pay_load:
act_resp=session_requests.post(action_url, data=pay_load, headers=user_agent,verify=False,files=[ ])
print(act_resp)
auth_cookies=act_resp.cookies.get_dict()
url,method,request_payload = requestDetails['url'],requestDetails['method'],requestDetails['payload']
querystring=requestDetails['querystring']
response=session_requests.get(url,headers=user_agent,cookies=auth_cookies,data=request_payload,params=querystring)
print(response)
return response.json()
In the above action URL is the API given in the ajax part & in the second request, the URL is the API address for that GET.
In short, may I know how can implement multifactor authentication in python request
My Doubt
Do we need the cookies from the login form page to include in the login request
How to implement multifactor authentication in python request(Here we don't need any pin or something it is done through RSA.)Is there any need of a certificate for login as it now raising unable to validate the SSL certificate
Give a dummy example api that is implement such kind of scenario
No, you make it complex.This code worked:
import requests
login_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php"
welcome_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php"
payload = 'user_email=test#phpzag.com&password=test&login_button='
login_headers = {
'x-requested-with': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded', # its urlencoded instead of form-data
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
}
s = requests.Session()
login = s.post(login_url, headers=login_headers, data=payload) # post requests
welcome = s.get(welcome_url, headers=login_headers)
print(welcome.text)
Result:
.....Hello, <br><br>Welcome to the members page.<br><br>
TL;DR
Change the part of your code that says data=payload to json=payload, and it should work.
Direct answer to your question
How [does one] implement [an] AJAX request using Python Requests?
You cannot do that. An AJAX request is specifically referring to a Javascript-based HTTP request. To quote from W3 school's AJAX introduction page, "AJAX = Asynchronous JavaScript And XML".
Indirect answer to your question
What I believe you're asking is how to perform auth/login HTTP requests using the popular python package, requests. The short answer— unfortunately, and like most things—is that it depends. Various auth pages handle the auth requests differently, and so you might have to do different things in order to authenticate against the specific web service.
Based on your code
I'm going to make some assumptions that the login page is probably looking for a POST request with the authentication details (e.g. credentials) in the form of a JSON object based on your code, and based on the response back from the server being a 406 error meaning that you're sending data with an accept header that doesn't align with how the server wants to respond.
When using requests, using the data parameter to the request function will send the data "raw"; that is, it'll send it in the native data format it is (like in cases of binary data), or it'll translate it to standard HTML form data if that format doesn't work (e.g. key1=value1&key2=value2&key3=value3, this form has the MIME type of application/x-www-form-urlencoded and is what requests will send when data has not been specified with an accept header). I'm going to make an educated guess based on the fact that you put your credentials into a dictionary that the login form is expecting a POST request with a JSON-formatted body (most modern web apps do this), and you were under the impression that setting the data parameter to requests will make this into a JSON object. This is a common gotcha/misconception with requests that has bitten me before. What you want is instead to pass the data using the json parameter.
Your code:
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
Fixed (and cleaned up) code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Test script to login to php web app.
"""
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {
'user_email': 'test#phpzag.com',
'password':'test'
}
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
session = requests.Session()
auth_response = session.post(
url=LOGIN_URL,
json=payload, # <--- THIS IS THE IMPORTANT BIT. Note: data param changed to json param
headers=user_agent
)
response = session.get(
'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',
headers=headers,
cookies=auth_response.cookies.get_dict() # TODO: not sure this is necessary, since you're using the session object to initiate the request, so that should maintain the cookies/session data throughout the session...
)
print(response.content)
Check out this section of the requests documentation on POST requests, if you scroll down a bit from there you'll see the docs talk about the github API which expects JSON and how to handle that.
Auth can be tricky overall. Sometimes things will want "basic auth", which requests will expect you to pass as a tuple to the auth parameter, sometimes they'll want a bearer token / OAUTH thing which can get headache-inducing-ly complicated/annoying.
Hope this helps!
You are missing the User agent that the server (apache?) requires
Try this:
import requests
from requests import Session
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
s = requests.Session()
x=s.get(URL, headers=user_agent)
x=s.post(LOGIN_URL, data=payload, headers=user_agent)
print(x.content)
print(x.status_code)
Take a look at Requests: Basic Authentication
import requests
requests.post(URL, auth=('user', 'pass'))
# If there are some cookies you need to send
cookies = dict(cookies_are='working')
requests.post(URL, auth=('user', 'pass'), cookies=cookies)

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim&currency=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL

Categories