I am using Cloduscraper Python library in order to obtain a JSON response from an url.
The probem is that I have to retry the same request 2-3 times before I get the correct output. The first responses have a 403 HTTP status code.
Here is my code:
import json
from time import sleep
import cloudscraper
url = "https://www.endpoint.com/api/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0",
"Accept": "*/*",
"Content-Type": "application/json"
}
json_response = 0
while json_response == 0:
try:
scraper = cloudscraper.create_scraper()
r = scraper.get(url, headers=headers)
json_response = json.loads(r.text)
except:
print(r.status_code)
sleep(2)
return json_response
What can I do in order to optimize my code and prevent the 403 responses?
You could use real browser to prevent some part of bot detection, here is the example with playwright:
import json
from playwright.sync_api import sync_playwright
API_URL = 'https://www.soraredata.com/api/players/info/29301348132354218386476497174231278066977835432352170109275714645119105189666'
with sync_playwright() as p:
# Webkit is fastest to start and hardest to detect
browser = p.webkit.launch(headless=True)
page = browser.new_page()
page.goto(API_URL)
# Use evaluate instead of `content` not to import bs4 or lxml
html = page.evaluate('document.querySelector("pre").innerText')
try:
data = json.loads(html)
except:
# Still might fail sometimes
data = None
print(data)
The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.
If you had no authorization, I would suggest first of all, to check if the url you are sending the request to, needs any sort of permissions to authorize the request.
However you do get a response at the 2nd or 3rd trial, and what happens is that some servers will take a few seconds before returning the answer, so they require the browser to wait ~5 seconds before submitting the response.
I would suggest adding a delay, which can be passed as an argument to create_scraper():
scraper = cloudscraper.create_scraper(delay=10)
If it is succesfull, then reduce the delay until it can no longert be reduced.
Related
I am trying to login into www.ebay-kleinanzeigen.de using the requests library, but every time I try to post my data (on the register page its the same as on the login page) I am getting a 403 error.
Here is the code for the register function:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'user-agent': user_agent, 'Referer': 'https://www.ebay-kleinanzeigen.de'}
with requests.Session() as c:
url = 'https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html'
c.headers = headers
hp = c.get(url, headers=headers)
soup = BeautifulSoup(hp.content, 'html.parser')
crsf = soup.find('input', {'name': '_csrf'})['value']
print(crsf)
payload = dict(email='test.email#emailzz1.de', password='test123', passwordConfirmation='test123',
_marketingOptIn='on', _crsf=crsf)
page = c.post(url, data=payload, headers=headers)
print(page.text)
print(page.url)
print(page.status_code)
Is the problem that I need some more headers? Isn't a user-agent and a referrer enough?
I have tried adding all requested headers, but then I am getting no response.
I have managed to create a script that will successfully complete the register form you're trying to fill in using the mechanicalsoup library. Note you will have to manually check your email account for the email they send you to complete registration.
I realise this doesn't actually answer the question of why BeautifulSoup returned a 403 forbidden error however it does complete your task without encountering the same error.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html")
browser.select_form('#registration-form')
browser.get_current_form().print_summary()
browser["email"] = "mailuser#emailprovider.com"
browser["password"] = "testSO12345"
browser["passwordConfirmation"] = "testSO12345"
response = browser.submit_selected()
rsp_code = response.status_code
#print(response.text)
print("Response code:",rsp_code)
if(rsp_code == 200):
print("Success! Opening a local debug copy of the page... (no CSS formatting)")
browser.launch_browser()
else:
print("Failure!")
Sometimes when i try to get html code from a website with this code
import requests
url = "https://sit2play.com"
response = requests.get(url)
print response.content
i get this response
<h3 class="ielte9">
The browser you're using is not supported. Please use a different browser like Chrome or Firefox.
How can i avoid this, and get the real page content?
Add your user agent to the header of the request with
headers = {
'User-Agent': 'YOUR USER AGENT',
}
response = requests.get(url, headers=headers)
You can get your user agent from many websites like this.
Edit
If the solution above doesn't work for you, which might be because you are using an old version of requests, try this one:
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'YOUR USER AGENT',
})
response = requests.get(url, headers=headers)
I am new to python and programming and would really appreciate any help here.
I am trying to login to this website using the below code and I just cannot go beyond the first page.
Below is the code I have been trying...
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.dell.com/sts/passive/commercial/v1/us/en/19/Premier/Login/Anonymous?wa=wsignin1.0&wtrealm=http%253a%252f%252fwww.dell.com&wreply=https%253a%252f%252fwww.dell.com%252fidentity%252fv2%252fRedirect')
soup = BeautifulSoup(response.text)
formtoken = soup.find('input', {'name': '__RequestVerificationToken'}).get('value')
payload = {'UserName' = username, 'Password'=password, '__RequestVerificationToken': formtoken}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
with requests.Session() as s:
p = s.post('https://www.dell.com/sts/passive/commercial/v1/us/en/19/Premier/Login/Anonymous?wa=wsignin1.0&wtrealm=http%253a%252f%252fwww.dell.com&wreply=https%253a%252f%252fwww.dell.com%252fidentity%252fv2%252fRedirect', data=payload, headers=headers)
r = s.get('http://www.dell.com/account/', headers=headers)
print r.text
I am just not able to go beyond the login page. What parameters apart from login. I also tried checking the form data in the Chrome dev tool but that is encrypted. Form Data - Dev Tool screenshot
Any help here is highly appreciated.
EDIT
I have edited the code to pass token in the payload as suggested below. But I have no luck yet.
You are not following a correct approach for making a POST request.
Steps which you can follow:
First make a GET request with your URL.
Extract access token from the response.
Use that access token for your post request.
I'm currently working on a scraper to analyze data and making charts of a website with Python 2.7, BeautifulSoup, Requests, Json, etc...
I want to make a search with definite keywords and then scrape the prices of the different items to make an average value.
So I tried BeautifulSoup to scrape the json response as I usually do but the response it gives me is:
{"data":{"uuid":"YNp-EuXHrw","index_name":"Listing","default_name":null,"query":"supreme box logo","filters":{"strata":["basic","grailed","hype"]}}}
My request goes to : https://www.grailed.com/api/searches , URL I've found on the index page when making a search.
I figured out that "uuid":"YNp-EuXHrw" (always being a different value) is set to define the URL that will show the items data, as: https:// www.grailed.com/feed/YNp-EuXHrw
So I'm making a request to scrape the uuid from the api with
response = s.post(url, headers=headers, json=payload)
res_json = json.loads(response.text)
print response
id = res_json['data']['uuid']
But the problem is, when I'm making a request to
https:// www.grailed.com/ feed/YNp-EuXHrw
or whatever the uuid is, I'm getting <Response [500]>.
My whole code is:
import BeautifulSoup,requests,re,string,time,datetime,sys,json
s = requests.session()
url = "https://www.grailed.com/api/searches"
payload = {
"index_name":"Listing_production","query":"supreme box logo sweatshirts","filters":{"strata":["grailed","hype","basic"],"category_paths":[],"sizes":[],"locations":[],"designers":[],"min_price":"null","max_price":"null"}
}
headers = {
"Host": "www.grailed.com",
"Connection":"keep-alive",
"Content-Length": "217",
"Origin": "null",
"x-api-version": "application/grailed.api.v1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"content-type": "application/json",
"accept": "application/json",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4",
}
response = s.post(url, headers=headers, json=payload)
res_json = json.loads(response.text)
print response
id = res_json['data']['uuid']
urlID = "https://www.grailed.com/feed/" + str(id)
print urlID
response = s.get(urlID, headers=headers, json=res_json)
print response
As you can see when you're doing the requests through Chrome or whatever the URL quickly changes from
grailed. com
to
grailed.com/ feed/uuid
So I've tried to make a GET request to this URL but just getting Response 500.
What can I do to scrape data shown on the uuid URL as it don't even appears on Network requests?
I hope I was pretty clear, sorry for my english
Install phantomJs.
http://phantomjs.org/
not a full solution, but hope this helps.
pip install selenium
npm install phantomjs
test.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs') //path to phantomjs driver
driver.set_window_size(1120, 550)
driver.get("https://www.grailed.com/")
try:
//you want to wait untill page is renderded
element = WebDriverWait(driver,1).until(
EC.presence_of_all_elements_located((By.XPATH,'//*[#id="homepage"]/div/div[3]/div[1]/div/form/label/input'))
)
element = driver.find_element_by_xpath('//*[#id="homepage"]/div/div[3]/div[1]/div/form/label/input')
if element.is_displayed():
element.send_keys('search this')
else:
print ('no element')
except Exception as e:
print (e)
print (driver.current_url)
driver.quit()
I have this url, the content are produced in this way (php, it's supose to generate a random cookie on every request):
setcookie('token', md5(time()), time()+99999);
if(isset($_COOKIE['token'])) {
echo 'Cookie: ' .$_COOKIE['token'];
die();
}
echo 'Cookie not set yet';
As you can see, the cookie changes on every reload/refresh of the page. Now i have a python (python3) script with three completely independent from each other requests:
import requests
def get_req_data(req):
print('\n\ntoken: ', req.cookies['token'])
print('headers we sent: ', req.request.headers)
print('headers server sent back: ', req.headers)
url = 'http://migueldvl.com/heya/login/tests2.php'
headers = {
"User-agent" : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
"Referer": 'https://www.google.com'
}
req1 = requests.get(url, headers=headers)
get_req_data(req1)
req2 = requests.get(url, headers=headers)
get_req_data(req2)
req3 = requests.get(url, headers=headers)
get_req_data(req3)
How can be that we sometimes have the same cookie in diferent requests? If clearly it's program to change on every request?
If we:
import time
and add a
time.sleep(1) # wait one second before the next request
between requests, the cookie change all the time, this is the right and expected behaviour, but my question is why do we need this (time.sleep(1)) to be certain of the changing cookie? Wouldn't different requests be enough?