I need to parse the site with DDoS-GUARD.
In Firefox devtools I found the GET-method https://stolichki.ru/cities/all
If I open this url in firefox, it's returns the JSON-object.
But Python requests returns the html page with 403 status.
response_raw = requests.get('https://stolichki.ru/cities/all')
print(response_raw.text)
I tried to change the headers of request
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'stolichki.ru',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Sec-GPC': '1',
'TE': 'trailers',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0'
}
But it didn't help.
Please don't offer grab library
Related
I want to use proxy with Python web requests. To test if my request is working or not, I send a request to jsonip.com. In the response it returns my real ip instead of the proxy. Also the website providing proxy also says "no activity". Am I connecting to the proxy correctly? Here the code:
import time, requests, random
from requests.auth import HTTPProxyAuth
auth = HTTPProxyAuth("muyjgovw", "mtpysgrb3nkj")
def reqs():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-User': '?1',
}
prox = [{"http": "http://64.137.58.19:6265"}]
proxies = random.choice(prox)
response = requests.get('https://jsonip.com/', headers=headers, proxies=proxies)
print(response.status_code)
print(response.json())
reqs()
Screenshot of website showing no activity
Your have to do this to include the proxy
import time, requests, random
from requests.auth import HTTPProxyAuth
auth = HTTPProxyAuth("muyjgovw", "mtpysgrb3nkj")
def reqs():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-User': '?1',
}
prox = [{"http": "http://64.137.58.19:6265",
"https": "http://64.137.58.19:6265" }]
proxies = random.choice(prox)
response = requests.get('https://jsonip.com/', headers=headers, proxies=proxies)
print(response.status_code)
print(response.json())
reqs()
I am trying to use Python requests to download a PDF from PeerJ. For example, https://peerj.com/articles/1.pdf.
My code is simply:
r = requests.get('https://peerj.com/articles/1.pdf')
However, the Response object returned displays as <Response [432]>, which indicates an HTTP 432 error. As far as I know, that error code is not assigned.
When I examine r.text or r.content, there is some HTML which says that it's an error 432 and gives a link to the same PDF, https://peerj.com/articles/1.pdf.
I can view the PDF when I open it in my browser (Chrome).
How do I get the actual PDF (as a bytes object, like I should get from r.content)?
While opening the site, you have mentioned, I also opened the developer tool in my firefox browser and copied the http request header from there and assigned it to headers parameter in request.get funcion.
a = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'peerj.com',
'Referer': 'https://peerj.com/articles/1.pdf',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
r = requests.get('https://peerj.com/articles/1.pdf', headers= a)
I'm trying to produce the JSON data so I can search for available camp rentals and the only way seems to be a request with a header otherwise I get a Not Authorize message when just using the URL. Unfortunately I'm having no luck this way as well since I keep getting a Session has expired message. I'm not a web developer so not sure what the cause is. Any help would be greatly appreciated it. Thank you
import time
import sys
import requests
url = "https://reservations.piratecoveresort.com/irmdata/api/irm?sessionID=_rdpirm01&arrival=2021-10-26&departure=2021-10-28&people1=1&people2=0&people3=0&people4=0&promocode=&groupnum=&rateplan=RACK&changeResNum=&roomtype=&roomnum=&propertycode=&locationcode=&preferences=&preferences=&preferences=&preferences=&preferences=WTF&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&masterType=&page=&start=0&limit=12&multiRoom=false"
payload={}
headers = {
'authority': 'reservations.piratecoveresort.com',
'method': 'GET',
'path': '/irmdata/api/irm?sessionID=_rdpirm01&arrival=2021-10-26&departure=2021-10-28&people1=1&people2=0&people3=0&people4=0&promocode=&groupnum=&rateplan=RACK&changeResNum=&roomtype=&roomnum=&propertycode=&locationcode=&preferences=&preferences=&preferences=&preferences=&preferences=WTF&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&masterType=&page=&start=0&limit=12&multiRoom=false',
'scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'authentication': '',
'content-type': 'application/json; charset=utf-8',
'cookie': 'rdpirm01=',
'dnt': '0',
'referer': 'https://reservations.piratecoveresort.com/irmng/',
#'sec-ch-ua': "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99",
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': "Windows",
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Result
Session Expired
You're getting session expired because the session cookie (and authentication token possibly too) are expired. You can fix this using a requests session which will set these session headers for you. Read more here:
https://docs.python-requests.org/en/master/user/advanced/
This question already has an answer here:
python how to decode http response
(1 answer)
Closed 2 years ago.
when i send get requests the response or outcome is not human understandable ..
enter image description here
my code:
get_log_head = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'sec-ch-ua': '"Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
get_login = session_request.get("https://www.ex.com", headers=get_log_head)
print(get_login.text)
what the solve ?
This is because the web server is sending you a brotli compressed response, since you set 'accept-encoding': 'gzip, deflate, br',
The requests module can natively handle gzip, and deflate, and will automatically be decoded for you (documented here) but not Brotli. Try modifying your accept-encoding to
'accept-encoding': 'gzip, deflate'
To get a human readable response I used this code:
import requests
session_request = requests.session()
get_log_head = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-ch-ua': '"Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
get_login = session_request.get("https://www.panago.com", headers=get_log_head)
print(get_login.text)
I want to scrape facebook companies for their date (if they have).
problem is that when I try to retrieve the HTML, I get the Hebrew version of it (I'm located in Israel)
this is part of the result:
�1u�9X�/.������~�O+$B\^����y�����e�;�+
Code:
import requests
from bs4 import BeautifulSoup
headers = {'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
url = 'https://www.facebook.com/pg/google/about/'
def fetch(URL):
try:
response = requests.get(url=URL, headers=headers).text
print(response)
except:
print('Could not retrieve data, or connect')
fetch(url)
Is there a way to check the EN website? any subdomain? or i should use proxy in the request?
What are you seeing isn't Hebrew version of the site, but compressed response from the server. As quick solution, you can remove accept-encoding header from the request:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': '*/*',
# 'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
url = 'https://www.facebook.com/pg/google/about/'
def fetch(URL):
try:
response = requests.get(url=URL, headers=headers).text
print(response)
except:
print('Could not retrieve data, or connect')
fetch(url)
Prints the uncompressed page:
<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
<head><meta charset="utf-8" /><meta name="referrer" content="origin-when-crossorigin" id="meta_referrer" /><script>window._cstart=+new Date();</script><script>function envFlush(a){function b(b){for(var c in a)b[
...and so on.