I have a scraper that has worked without an issue for 18 months until today. Now I get 403 response from htlv.org and don't seem to be able to fix the issue. My code is below so the answer is not the usual to just add headers. If I print response.text it says something about captchas. So I assume I'd have to bypass captcha or my ip is blocked? Please help :)
import requests
url = 'https://www.hltv.org/matches'
headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite.com",
"Connection": "keep-alive"}
response = requests.get(url, headers=headers)
print response
EDIT: This remains a mystery to me, but today my code started working again on my main PC. Did not make any changes to the code.
KokoseiJ could not reproduce the problem, but Booboo did. The code also worked on my old PC, which I dug from storage, but not on my main PC. Anyways, thanks to all who tried to help me with this issue.
I am posting this not as a solution but as something that did not work, but may be useful information.
I went to https://www.hltv.org/matches then brought up Chrome's Inspector and reloaded the page and looked at the request headers Chrome (supposedly) used for the GET request. Some of the header names began with a ':', which requests considers illegal. But looking around Stack Overflow, I found a way to get around that (supposedly for Python 3.7 and greater). See the accepted answer and comments here for details.
This still resulted in a 403 error. Perhaps somebody might spot an error in this (or not).
These were the headers shown by the Inspector:
:authority: www.hltv.org
:method: GET
:path: /matches
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
cookie: MatchFilter={%22active%22:false%2C%22live%22:false%2C%22stars%22:1%2C%22lan%22:false%2C%22teams%22:[]}
dnt: 1
pragma: no-cache
sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36
And the code:
import requests
import http.client
import re
http.client._is_legal_header_name = re.compile(rb'\S[^:\r\n]*').fullmatch
url = 'https://www.hltv.org/matches'
headers = {
':authority': 'www.hltv.org',
':method': 'GET',
':path': '/matches',
':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'cookie': 'MatchFilter={%22active%22:false%2C%22live%22:false%2C%22stars%22:1%2C%22lan%22:false%2C%22teams%22:[]}',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.text)
print(response)
Also came across this issue recently.
My solution was using th js-fetch library (see answer)
I assume cloudfare and others found some way to detect, wheather a request is made by a browser (js) or other programming languages.
Related
I'm trying to get HTML response in Python with requests, but only get 403. While in Chrome browser link works fine and page is loaded: https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview
I've copied exact headers of the successfully loaded page from Chrome Developer Tools-> Network recording, but no luck (below).
import requests
headers = {
'authority': 'www.dell.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
response = requests.get('https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview', headers=headers)
print(response.status_code)
Also, requests.get('https://www.dell.com/support/home/en-us?lwp=rt') returns 200 with no problem.
Can't figure out what the difference between a browser and a python request might be in this case.
UPD Python 3.7.3 running from Jupyter Notebook, but yes, that hardly matters. And I tried running Python console as well.
I tried to scrape some data from a national betting site called https://tippmix.hu.
I wanted to get the data from this specific page:https://www.tippmix.hu/sportfogadas#?sportid=999&countryid=99999999&competitionid=45975&page=1. The data is dynamically loaded so I inspected the page and found a specific POST request which is responsible for loading the data I need in json. This is it.
I opened python and used the requests library to make a post request to this page: https://api.tippmix.hu/tippmix/search with exact same headers and data you can see on my previous picture. Unfortunately, my code returned the whole json file as if I did not specified any parameters.
Here is my code (the header_convert function converts the copied header string into a dictionary:
import requests
from header_convert import header_convert
events_url = "https://api.tippmix.hu/tippmix/search"
data = {"fieldValue": "",
"sportId": "999",
"competitionGroupId": "99999999",
"competitionId": "45975",
"type": "0",
"date": "0001-01-01T00:00:00.000Z",
"hitsPerPage": "20",
"page": "1",
"minOdds": "null",
"maxOdds": "null"}
raw_headers = """Accept: application/json, text/plain, */*
Content-Type: application/x-www-form-urlencoded
Origin: https://www.tippmix.hu
Content-Length: 182
Accept-Language: en-us
Host: api.tippmix.hu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15
Referer: https://www.tippmix.hu/
Accept-Encoding: gzip, deflate, br
Connection: keep-alive"""
headers = header_convert.header_convert(raw_headers)
print(headers)
page = requests.post(events_url, data=data, headers=headers)
print(page.content)
Here are my headers:
{'Accept': 'application/json, text/plain, */*', 'Content-Type': 'application/x-www-form-urlencoded', 'Origin': 'https://www.tippmix.hu', 'Content-Length': '182', 'Accept-Language': 'en-us', 'Host': 'api.tippmix.hu', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15', 'Referer': 'https://www.tippmix.hu/', 'Accept-Encoding': 'gzip, deflate, br', 'Connection': 'keep-alive'}
I wonder if someone could help me.
Thank you!
You need to post the data as json, updated with working code:
import requests
import json
url = 'https://api.tippmix.hu/tippmix/search'
data = {"fieldValue":"",
"sportId":999,
"competitionGroupId":99999999,
"competitionId":45975,
"type":0,
"date":"0001-01-01T00:00:00.000Z",
"hitsPerPage":20,
"page":1,
"minOdds":"null",
"maxOdds":"null"}
headers = {
'Accept':'application/json, text/plain, */*',
'Content-Type':'application/x-www-form-urlencoded',
'Host':'api.tippmix.hu',
'Origin':'https://www.tippmix.hu',
'Referer':'https://www.tippmix.hu/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
resp = requests.post(url,headers=headers,data=json.dumps(data)).json()
print(resp)
I have been trying to access this website https://www.dickssportinggoods.com/f/tents-accessories with requests module but it just keeps processing and does not stop while the same website works fine on browser. Scrappy gives a time out error for the same website. Is there something that should be taken into account while accessing websites like these. Thanks
For sites like these you can try to add the extra headers that your browser does. Following these steps worked for me -
Open the link in incognito window with the network tab open.
Copy the first request made by right clicking -> copy -> copy as curl
Go to https://curl.trillworks.com/. Paste the curl command to get the equivalent python requests code.
Now try removing headers one by one until it works with the minimal headers.
Image for reference - https://i.stack.imgur.com/vRS98.png
Edit -
import requests
headers = {
'authority': 'www.dickssportinggoods.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
print(response.text)
Have you tried adding headers?
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
response.raise_for_status()
print(response.text)
So Thanks to #Marcel and #Sonal but appart from headers, it just worked when i put the statement in a try/except block.
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0\
Win64\
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
session = requests.Session()
try:
r = session.get(
link, headers=headers, stream=True)
return r
except requests.exceptions.ConnectionError:
r.status_code = "Connection refused"
This is my first post on StackOverflow so please bear with me.
I am writing a function that makes a request via REST API and then returns the values, but I'm having trouble with the authentication part.
The authentication is a JWT bearer token, and is needed to retrieve the data (though I am not needing to log in so in that regard it is an unauthorised API).
def get__price(jwt, cookie):
headers = {
'authority': 'www.dextools.io',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'application/json',
'authorization': f'Bearer {jwt}', # HERE IS THE VAR I NEED
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
'content-type': 'application/json',
'sec-gpc': '1',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.dextools.io/app/uniswap/pair-explorer/0x0d4a11d5eeaac28ec3f61d100daf4d40471f1852',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
#'cookie': f'__cfduid={cookie}; ai_user=hizb^|2021-04-03T00:16:45.460Z; ai_session=5vAmv^|1617443356577.045^|1617443356577.045',
}
params = (
('v', '1.9.1'),
('pair', '0x0d4a11d5eeaac28ec3f61d100daf4d40471f1852'),
('ts', '1617443384-0')
)
try:
response = requests.get('https://www.dextools.io/api/uniswap/1/pairexplorer', headers=headers, params=params)
except Exception as e:
print(f"ERROR: {e}")
I've tried to make a request to the website https://www.dextools.io and get any JWT tokens, but it doesnt seem to work using Sessions.
Maybe it has no importance but I can find this JWT token on the browser when I go to developer tools > Local Storage > (website url) > t where t contains my eyJxxxxxxxxxxxxxxx token.
Any help would be appreciated, thanks.
Hello seeing to the network requests of website I was able to get the data via below code but you might need to get the new password if website blocks it jwt token which is generated below is valid for like 6 to 8 mins you can re use the jwt token till that time and then you need to get new jwt token by calling that back login url like mentioned in below code.
Code:
import time
import requests
s = requests.session()
headersdict = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
'Referer': 'https://www.dextools.io/app/uniswap/pair-explorer/0x0d4a11d5eeaac28ec3f61d100daf4d40471f1852',
'Origin': 'https://www.dextools.io'}
s.headers.update(headersdict)
payload = {"id": "anyone", "password": "TfY6WC6F4L4+S6xwvPo8QoHlYZ50rK2DrJnEAWBoMqU="}#you can use this password to generate new jwt tokens if it blocks you check network requests and get this password again but i dont think they will block it that way.
s1 = s.post("https://www.dextools.io/back/user/login", json=payload)
jwt = s1.headers["X-Auth"]
headersdict = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
'Referer': 'https://www.dextools.io/app/uniswap/pair-explorer/0x0d4a11d5eeaac28ec3f61d100daf4d40471f1852',
'Origin': 'https://www.dextools.io',
'authorization': f'Bearer {jwt}'}
s.headers.update(headersdict)
params = (
('v', '1.9.1'),
('pair', '0x0d4a11d5eeaac28ec3f61d100daf4d40471f1852'),
('ts', f'{time.time()}-0')
)
response = s.get('https://www.dextools.io/api/uniswap/1/pairexplorer', params=params)
print(response.text)
Output:
Let me know if you have any questions :)
I have bought a little wifi relay module - though it is in Chinese which I do not read I have worked out how to open and close the relay from the buttons on the home page on the embedded web server.
I then used postman interceptor to capture the 'open' and 'close' actions, and I can now click the 'post' button to make the action happen.
However the 'generate code' python script doesn't work, and from my limited understanding doesn't have the right info.
import requests
url = "http://192.168.4.1/"
payload = ""
headers = {
'origin': "http://192.168.4.1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
'content-type': "application/x-www-form-urlencoded",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'dnt': "1",
'referer': "http://192.168.4.1/",
'accept-encoding': "gzip, deflate",
'accept-language': "en-US,en;q=0.8",
'cache-control': "no-cache",
'postman-token': "bece04e7-ee50-3764-ca50-e86d07ebc0f3"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)
The output when I select HTTP instead of Python Requests is
POST / HTTP/1.1
Host: 192.168.4.1
Origin: http://192.168.4.1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
DNT: 1
Referer: http://192.168.4.1/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cache-Control: no-cache
Postman-Token: 0bd42b4f-067d-b5be-dd1c-b7e689196043
open_relay=%EF%BF%BD%F2%BF%AA%BC%CC%B5%EF%BF%BD%EF%BF%BD%EF%BF%BD
Could someone suggest how to modify the Python to correctly send the POST which works correctly from with Postman itself ?
Your python code is missing the POST data which contains the command to the piece of equipment, which is listed at the bottom of the http request.
Put open_relay=%EF%BF%BD%F2%BF%AA%BC%CC%B5%EF%BF%BD%EF%BF%BD%EF%BF%BD into the payload variable in the python code:
import requests
url = "http://192.168.4.1/"
payload = "open_relay=%EF%BF%BD%F2%BF%AA%BC%CC%B5%EF%BF%BD%EF%BF%BD%EF%BF%BD"
headers = {
'origin': "http://192.168.4.1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
'content-type': "application/x-www-form-urlencoded",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'dnt': "1",
'referer': "http://192.168.4.1/",
'accept-encoding': "gzip, deflate",
'accept-language': "en-US,en;q=0.8",
'cache-control': "no-cache",
'postman-token': "bece04e7-ee50-3764-ca50-e86d07ebc0f3"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)