I am sending a request to some url. I Copied the curl url to get the code from curl to python tool. So all the headers are included, but my request is not working and I recieve status code 403 on printing and error code 1020 in the html output. The code is
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)
print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
f.write(response.text)
I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this. Thanks in advance.
Note:
I have installed all the libraries installed with request with same version as computer and still not working and throwing 403 error
The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?
The process of web scraping is fairly simple, though the
implementation can be complex. Web scraping occurs in 3 steps:
First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
When the website responds, the scraper parses the HTML document for a specific pattern of data.
Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
You can use urllib instead of requests, it seems to be able to deal with cloudflare
req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')
r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
f.write(r)
It works on my machine, so I am not sure what the problem is.
However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.
With playwright you can try the following:
from playwright.sync_api import sync_playwright
url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/69.0.3497.100 Safari/537.36"
)
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page(user_agent=ua)
page.goto(url)
page.wait_for_timeout(1000)
html = page.content()
print(html)
A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.
Try running Burp Suite's Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That's what I always do.
Good luck!
Had the same problem recently.
Using the javascript fetch-api with Selenium-Profiles worked for me.
example js:
fetch('http://example.com/movies.json')
.then((response) => response.json())
.then((data) => console.log(data));o
Example Python with Selenium-Profiles:
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": profile["cdp"]["useragent"]["acceptLanguage"],
"content-type": "application/json",
# "cookie": cookie_str, # optional
"sec-ch-ua": "'Google Chrome';v='107', 'Chromium';v='107', 'Not=A?Brand';v='24'",
"sec-ch-ua-mobile": "?0", # "?1" for mobile
"sec-ch-ua-platform": "'" + profile['cdp']['useragent']['userAgentMetadata']['platform'] + "'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"user-agent": profile['cdp']['useragent']['userAgent']
}
answer = driver.requests.fetch("https://www.example.com/",
options={
"body": json.dumps(post_data),
"headers": headers,
"method":"POST",
"mode":"same-origin"
})
I don't know why this occurs, but I assume cloudfare and others are able to detect, whether a request is made with javascript.
Related
Please look over my code. What am I doing wrong that my request response is empty? Any pointers?
URL in question: (should generate a results page)
https://www.ucr.gov/enforcement/343121222
But I cannot replicate it with python requests. Why?
import requests
headers = {'Host': 'www.ucr.gov',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:105.0) Gecko/20100101 Firefox/105.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive'
}
data = {
'scheme': 'https;',
'host': 'www.ucr.gov',
'filename': '/enforcement/3431212'
}
url="https://www.ucr.gov/enforcement/3431212"
result = requests.get(url, params=data, headers=headers)
print(result.status_code)
print(result.text)
The page at the link you provided is fully rendered on the client side using JavaScript. This means that you won't be able to obtain the same response using a simple HTTP request.
In this case a common solution is performing something that is called Headless Scraping, which involves automating a headless browser in order to access a specific website content as a regular client would. In Python headless scraping can be implemented using several libraries, including Selenium and Puppeteer.
I'm trying to get text from this site. It is just a simple plain site with only text. When running the code below, the only thing it prints out is a newline. I should say that websites content/text is dynamic, so it changes over a few minutes. My requests module version is 2.27.1. I'm using Python 3.9 on Windows.
What could be the problem?
import requests
url='https://www.spaceweatherlive.com/includes/live-data.php?object=solar_flare&lang=EN'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
}
content=requests.get(url, headers=headers)
print(content.text)
This is the example of how the website should look.
That particular server appears to be gating responses not on the User-Agent, but on the Accept-Encoding settings. You can get a normal response with:
import requests
url = "https://www.spaceweatherlive.com/includes/live-data.php?object=solar_flare&lang=EN"
headers = {
"Accept-Encoding": "gzip, deflate, br",
}
content = requests.get(url, headers=headers)
print(content.text)
Depending on how the server responds over time, you might need to install the brotli package to allow requests to decompress content compressed with it.
You just need to add user-agent like below.
import requests
url = "https://www.spaceweatherlive.com/includes/live-data.php?object=solar_flare&lang=EN"
payload={}
headers = {
'User-Agent': 'PostmanRuntime/7.29.0',
'Accept': '*/*',
'Cache-Control': 'no-cache',
'Host': 'www.spaceweatherlive.com',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
}
response = requests.get(url, headers=headers)
print(response.text)
I am currently using Python requests to scrape data from a website and using Postman as a tool to help me do it.
To those not familiar with Postman, it sends a get request and generates a code snippet to be used in many languages, including Python.
By using it, I can get data from the website quite easily, but it seems as like the 'Cookie' aspect of headers provided by Postman changes with time, so I can't automate my code to run anytime. The issue is that when the cookie is not valid I get an access denied message.
Here's an example of the code provided by Postman:
import requests
url = "https://wsloja.ifood.com.br/ifood-ws-v3/restaurants/7c854a4c-01a4-48d8-b3d4-239c6c069f6a/menu"
payload = {}
headers = {
'access_key': '69f181d5-0046-4221-b7b2-deef62bd60d5',
'browser': 'Windows',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'secret_key': '9ef4fb4f-7a1d-4e0d-a9b1-9b82873297d8',
'Cache-Control': 'no-cache, no-store',
'X-Ifood-Session-Id': '85956739-2fac-4ebf-85d3-1aceda9738df',
'platform': 'Desktop',
'app_version': '8.37.0',
'Cookie': 'session_token=TlNUXzMyMjJfMTU5Nzg1MDE5NTIxNF84NDI5NTA2NDQ2MjUxMg==; _abck=AD1745CB8A0963BF3DD67C8AF7932007~-1~YAAQtXsGYH8UUe9zAQAACZ+IAgStbP4nYLMtonPvQ+4UY+iHA3k6XctPbGQmPF18spdWlGiDB4/HbBvDiF0jbgZmr2ETL8YF+f71Uwhsj+L8K+Fk4PFWBolAffkIRDfSubrf/tZOYRfmw09o59aFuQor5LeqxzXkfVsXE8uIJE0P/nC1JfImZ35G0OFt+HyIgDUZMFQ54Wnbap7+LMSWcvMKF6U/RlLm46ybnNnT/l/NLRaEAOIeIE3/JdKVVcYT2t4uePfrTkr5eD499nyhFJCwSVQytS9P7ZNAM4rFIPnM6kPtwcPjolLNeeU=~-1~-1~-1; ak_bmsc=129F92B2F8AC14A400433647B8C29EA3C9063145805E0000DB253D5F49CE7151~plVgguVnRQTAstyzs8P89cFlKQnC9ISQCH9KPHa8xYPDVoV2iQ/Hij2PL9r8EKEqcQfzkGmUWpK09ZpU0tL/llmBloi+S+Znl5P5/NJeV6Ex2gXqBu1ZCxc9soMWWyrdvG+0FFvSP3a6h3gaouPh2O/Tm4Ghk9ddR92t380WBkxvjXBpiPzoYp1DCO4yrEsn3Tip1Gan43IUHuCvO+zkRmgrE3Prfl1T/g0Px9mvLSVrg=; bm_sz=3106E71C2F26305AE435A7DA00506F01~YAAQRTEGyfky691zAQAAGuDbBggFW4fJcnF1UtgEsoXMFkEZk1rG8JMddyrxP3WleKrWBY7jA/Q08btQE43cKWmQ2qtGdB+ryPtI2KLNqQtKM5LnWRzU+RqBQqVbZKh/Rvp2pfTvf5lBO0FRCvESmYjeGvIbnntzaKvLQiDLO3kZnqmMqdyxcG1f51aoOasrjfo=; bm_sv=B4011FABDD7E457DDA32CBAB588CE882~aVOIuceCgWY25bT2YyltUzGUS3z5Ns7gJ3j30i/KuVUgG1coWzGavUdKU7RfSJewTvE47IPiLztXFBd+mj7c9U/IJp+hIa3c4z7fp22WX22YDI7ny3JxN73IUoagS1yQsyKMuxzxZOU9NpcIl/Eq8QkcycBvh2KZhhIZE5LnpFM='
}
response = requests.request("GET", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
Here's just the Cookie part where I get access denied:
'Cookie': 'session_token=TlNUXzMyMjJfMTU5Nzg1MDE5NTIxNF84NDI5NTA2NDQ2MjUxMg==; _abck=AD1745CB8A0963BF3DD67C8AF7932007~-1~YAAQtXsGYH8UUe9zAQAACZ+IAgStbP4nYLMtonPvQ+4UY+iHA3k6XctPbGQmPF18spdWlGiDB4/HbBvDiF0jbgZmr2ETL8YF+f71Uwhsj+L8K+Fk4PFWBolAffkIRDfSubrf/tZOYRfmw09o59aFuQor5LeqxzXkfVsXE8uIJE0P/nC1JfImZ35G0OFt+HyIgDUZMFQ54Wnbap7+LMSWcvMKF6U/RlLm46ybnNnT/l/NLRaEAOIeIE3/JdKVVcYT2t4uePfrTkr5eD499nyhFJCwSVQytS9P7ZNAM4rFIPnM6kPtwcPjolLNeeU=~-1~-1~-1; ak_bmsc=129F92B2F8AC14A400433647B8C29EA3C9063145805E0000DB253D5F49CE7151~plVgguVnRQTAstyzs8P89cFlKQnC9ISQCH9KPHa8xYPDVoV2iQ/Hij2PL9r8EKEqcQfzkGmUWpK09ZpU0tL/llmBloi+S+Znl5P5/NJeV6Ex2gXqBu1ZCxc9soMWWyrdvG+0FFvSP3a6h3gaouPh2O/Tm4Ghk9ddR92t380WBkxvjXBpiPzoYp1DCO4yrEsn3Tip1Gan43IUHuCvO+zkRmgrE3Prfl1T/g0Px9mvLSVrg=; bm_sz=3106E71C2F26305AE435A7DA00506F01~YAAQRTEGyfky691zAQAAGuDbBggFW4fJcnF1UtgEsoXMFkEZk1rG8JMddyrxP3WleKrWBY7jA/Q08btQE43cKWmQ2qtGdB+ryPtI2KLNqQtKM5LnWRzU+RqBQqVbZKh/Rvp2pfTvf5lBO0FRCvESmYjeGvIbnntzaKvLQiDLO3kZnqmMqdyxcG1f51aoOasrjfo=; bm_sv=B4011FABDD7E457DDA32CBAB588CE882~aVOIuceCgWY25bT2YyltUzGUS3z5Ns7gJ3j30i/KuVUgG1coWzGavUdKU7RfSJewTvE47IPiLztXFBd+mj7c9U/IJp+hIa3c4z7fp22WX23E755znZL76c0V/amxbHU9BUnrEff3HGcsniyh5mU+C9XVmtNRLd8oT1UW9WUg3qE=' }
Which is slightly different from the one before.
How could I get through this by somehow having python get the session token?
Apparently just removing 'Cookie' from headers does the job.
I using python requests module to grab data from one website.
At first time i run script, all works fine, data is ok. Then, if run script again, it's return the same data, however this data changed on website if opened in browser. Whenever i run script, data still the same. BUT!
After 5 or 6 minutes, if run script again, data was updated. Looks like requests caching info.
If using the browser, every time hit refresh, data updates correctly.
r = requests.get('https://verysecretwebsite.com', headers=headers)
r.text
Actually i use following header:
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.gismeteo.ru/weather-orenburg-5159/now/',
'DNT': '1',
'Connection': 'false',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'no-cache, max-age=0',
'TE': 'Trailers'}
but with no luck.
I try grub this link https://www.gismeteo.ru/weather-orenburg-5159/now/ with section "data-dateformat="G:i"
In your code you haven't set any headers. This means that requests will always send its default User-Agent header like User-Agent: python-requests/2.22.0 and use no caching directives like Cache-Control.
The remote server of your website may have different caching policies for client applications. Remote server can respond with different data or use different caching time based on User-Agent and/or Cache-Control headers of your request.
So try to check what headers your browser uses (F12 in Chrome) to make requests to your site and then add them to your request. You can also add Cache-Control directive to force server to return the most recent data.
Example:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36",
"Cache-Control": "no-cache, max-age=0", # disable caching
}
r = requests.get("https://www.mysecretURL.com", headers=headers)
The requests.get() method doesn't cache data by default (from this StackOverflow post) I'm not entirely sure of the reason for the lag, as refreshing your browser is essentially identical to calling requests.get(). You could try creating a loop that automatically collects data every 5-10 seconds or so, and that should work fine (and keep you from having to manually run the same lines of code). Hope this helps!
I am trying to login to a site called grailed.com and follow a certain product. The code below is what I have tried.
The code below succeeds in logging in with my credentials. However whenever I try to follow a product (the id in the payload is the id of the product) the code runs without any errors but fails to follow the product. I am confused at this behavior. Is it a similar case to Instagram (where Instagram blocks any attempt to interact programmatically with their site and force you to use their API (grailed.com does not have a API for the public to use AFAIK)
I tried the following code (which looks exactly like the POST request sent when you follow on the site).
headers/data defined here
r = requests.Session()
v = r.post("https://www.grailed.com/api/sign_in", json=data,headers = headers)
headers = {
'authority': 'www.grailed.com',
'method': 'POST',
"path": "/api/follows",
'scheme': 'https',
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
"content-type": "application/json",
"x-amplitude-id": "1547853919085",
"x-api-version": "application/grailed.api.v1",
"x-csrf-token": "9ph4VotTqyOBQzcUt8c3C5tJrFV7VlT9U5XrXdbt9/8G8I14mGllOMNGqGNYlkES/Z8OLfffIEJeRv9qydISIw==",
"origin": "https://www.grailed.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
payload = {
"id": "7917017"
}
b = r.post("https://www.grailed.com/api/follows",json = payload,headers = headers)
If API is not designed to be public, you are most likely missing csrf token in your follow headers.
You have to find an CSRF token, and add it to /api/follows POST.
taking fast look at code, this might be hard as everything goes inside javascript.