Faced this problem 2 days ago and I try to solve it to this day. The bottom line is that I wanted to use the requests library to send requests, but the answer to it does not come at all the one that should come. When trying to use another library, the correct answer came up, but this library does not support sessions. How to make it so that the correct response to the request comes through the requests library (urllib same problem)
import http.client
conn = http.client.HTTPSConnection("external-api.mediabilling.yandex.ru")
payload = ''
headers = {'Accept': '*/*','Referer': 'https://payment-widget.ott.yandex.ru/','Host': 'external-api.mediabilling.yandex.ru','sec-ch-ua-platform': '"Windows"','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36','Origin': 'https://payment-widget.ott.yandex.ru','Content-Type': 'application/json'}
conn.request("POST", "/v12/promo-code/info?code=7VBKMNRPYV&language=ru&platform=web", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
True answer:
{
"timestamp": 1674771545,
"status": 403,
"error": {
"name": "security",
"message": "Access denied"
},
"path": "/api/v12/promo-code/info"
}
import httpx
with httpx.Client() as session:
url = 'https://external-api.mediabilling.yandex.ru/v12/promo-code/info?code=7VBKMNRPYV&language=ru&platform=web'
response = session.post(url, headers=headers)
print(response.read())
False answer:
<!DOCTYPE html>
<html>
<head>
<title>403</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
</html>
Headers are the same!
1 request json answer
2 request text answer WTF
all the solutions i was looking for
Related
I am trying to make a webscrape in a website but I cannot obtain the Bearer or Token with my user and password (please tell me if there is any way I can share these with you in a private manner).
Here is my code...
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer': 'http://200.75.4.210:8080/'}
login = {
'Password': "CXXXXXX",
'Usuario': "eXXXXXX"}
response = requests.post('http://200.75.4.210:8080/CIODCH/login.aspx', headers=headers, data=login).json()
Can anybody help me finding the cause of the error ?
Can anybody help me finding the cause of the error ?
You can get the pre-json result with
response = requests.post('http://200.75.4.210:8080/CIODCH/login.aspx', headers=headers, data=login)
This can be analysed further in the REPL:
response.status_code # 200
response.content # b'\r\n\r\n<!DOCTYPE html>\r\n\r\n<html>\r\n<head>\r\n <meta charset="utf-8" />\r\n <meta conte...
This looks like HTML, not JSON. How about sending the application/json accept header ?
I am trying to send a get request with python like this:
import requests
url = "internal_url" # I replaced all internal urls
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0", "Accept": "*/*", "Accept-Language": "en-GB,en;q=0.5", "Accept-Encoding": "gzip, deflate", "X-Requested-With": "XMLHttpRequest", "Connection": "close", "Referer": "internal url"}
r = requests.get(url , headers=header)
print(r.text)
As reponse I am expecting json data. But instead I get this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<script type="text/javascript">
function getCookie(c_name) { // Local function for getting a cookie value
if (document.cookie.length > 0) {
c_start = document.cookie.indexOf(c_name + "=");
if (c_start!=-1) {
c_start=c_start + c_name.length + 1;
c_end=document.cookie.indexOf(";", c_start);
if (c_end==-1)
c_end = document.cookie.length;
return unescape(document.cookie.substring(c_start,c_end));
}
}
return "";
}
function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie
var exdate = new Date();
exdate.setDate(exdate.getDate()+expiredays);
document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";
}
function getHostUri() {
var loc = document.location;
return loc.toString();
}
setCookie('STRING redacted', IP-ADDRESS redacted, 10);
try {
location.reload(false);
} catch (err1) {
try {
location.reload();
} catch (err2) {
location.href = getHostUri();
}
}
</script>
</head>
<body>
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript>
</body>
</html>
When I changed the request to use the burp suite proxy so I can see the request, it suddenly works and I get the correct response:
proxies = {"http": "127.0.0.1:8080", "https": "http://127.0.0.1:8080"}
r = requests.get(url, headers=headers, verify=False, proxies=proxies)
My browser displays the correct results as text when I visit the link itself. Burp suite proxy not needed.
I think its possible that it has to do with the company proxy.
But even when I tried to run the request with company proxies supplied it still does not work.
Is there something I am missing?
EDIT:
After some more searching it seems like I get redirected when I dont use any proxies in python. That doesnt happen when I go over the burp suite proxy.
After a few days and some outside help I finally found the solution. Posting it here for the future.
My problem was that I was using a partially qualified domain name instead of a fully qualified domain name
So for example: myhost instead of myhost.example.com
Burp suite or the browser were handling the translation for me but in python I had to do it myself.
This is a spotify documentation I'm following. Out of the 3 options of 'Authorization Flows', I'm trying the 'Authorization Code Flow'.
Finished step 1. Have your application request authorization.
Stuck at step 2. Have your application request refresh and access tokens
It's asking to make a POST request that contains the parameters encoded in ´application/x-www-form-urlencoded as defined in the OAuth 2.0 specification:. Here is what I've done so far with my limited knowledge and google search.
import requests
import base64
from html import unescape
url = "https://accounts.spotify.com/api/token"
params = {
"grant_type": "authorization_code",
"code": <authorization code I got from step 1>,
"redirect_uri": "http://127.0.0.1:5000/",
}
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Authorization" : base64.b64encode("{}:{}".format(CLIENT_ID, CLIENT_SECRET).encode('UTF-8')).decode('ascii')
}
html = requests.request('post', url, headers=headers, params=params, data=None)
print(html.text)
result, with response code 400
{"error":"invalid_client"}
What should I do to make it work? I thought I got all the params right.
I am trying to scrape this url:
https://www.bloomberg.com/news/articles/2019-06-03/a-tesla-collapse-would-boost-european-carmakers-bernstein-says
I just wanted to scrape title and posted date only but bloomberg always banned man and think that I am robot
Sample Response that I've received:
<!doctype html>
<html>
<head>
<title>Bloomberg - Are you a robot?</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
Any idea how can I make the website believe that the request is coming from a browser using Scrapy?
This is what I've done so far
def parse(self, response):
yield scrapy.Request('https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker',
headers={'X-Crawlera-Session': 'create',
'Referrer': "https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker",
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.9,fr;q=0.8,ro;q=0.7,ru;q=0.6,la;q=0.5,pt;q=0.4,de;q=0.3',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
},
def parse_sub(self, response):
print(response.text)
I also use crawlera as well and I added it on settings.py
DOWNLOADER_MIDDLEWARES = {'scrapy_crawlera.CrawleraMiddleware': 300}
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
CRAWLERA_APIKEY = 'API_KEY'
Please help me thank you
You need to use headers, mainly to specify a User-Agent which tells the website general information about the browser and device. There is a massive User-Agent List on GitHub if you need help finding one.
You can specify headers for a specific request like this:
yield Request(parse=..., headers={"User-Agent":"user_agent", "Referrer":"url_here", etc.})
Here is the code I am working with.
import requests
headers = { 'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.8',
'Cookie':'Cookie:PHPSESSID=vev1ekv3grqhh37e8leu1coob1',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
with requests.Session() as c:
url = 'http://172.31.13.135/tpo/spp/'
c.get(url, headers=headers)
payload = {'regno': 'myregno', 'password': 'mypassword'}
c.post(url, data = payload, headers=headers)
r = c.get('http://172.31.13.135/tpo/spp/home.php', headers=headers)
print r.content
I get the following message when I run this script.
<script>
alert("Session timeout !");
window.location = "logout.php";
</script><script>
alert("Unauthorised Access!");
window.location = "index.php";
</script>
<!DOCTYPE html>
<html lang="en">
How do I deal with this "session timeout" issue ?
Many thanks in advance.
It really makes tough to answer when I can't visit the website to scrape.
So here's my guess,
1) Try removing cookies from your headers you don't need that.
Because requests.Session() will generate cookies of its own when you visit url = 'http://172.31.13.135/tpo/spp/' for the first time.
So your headers will be,
headers = { 'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
2) Make sure that 'If-Modified-Since' field in header is static to what you have mentioned and it doesn't change. If it does change then please code it accordingly to set the date and time on realtime basis.
3) I am not sure why you have 'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=' in headers. Try headers without it.
But, if you have to have it then please make sure that this auth code is static too and it doesn't change everytime.
Let me know if that helps
You could also just pass the timeout variable in your get and post method:
timeout = 10 # ten sec.
with requests.Session() as c:
url = 'http://172.31.13.135/tpo/spp/'
c.get(url, headers=headers, timeout=timeout)
payload = {'regno': 'myregno', 'password': 'mypassword'}
c.post(url, data = payload, headers=headers, timeout=timeout)
r = c.get('http://172.31.13.135/tpo/spp/home.php', headers=headers, timeout=timeout)
print r.content
You could tweak the max time you wan to wait for a response by doing this
For more about request see the docs