I am unable to fetch a response from this url. While it works in browser, even in incognito mode. Not sure why it is not working. It is just keep running without any output. No errors. I even tried request headers by setting 'user-agent' key but again received no response
Following is the code used:
import requests
response = requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020§ion=EQ')
print(response.text)
I want html text from the response page for further use.
Your server is checking to see if you are sending the request from a web browser. If not, it's not returning anything. Try this:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
r=requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020§ion=EQ', timeout=3, headers=headers)
print(r.text)
Related
I'm new to using requests and I can't seem to get https://www.costco.com to respond to a simple requests.get command.
I don't understand why, I believe it is because it knows I'm not on a browser.
I don't get any response, not even a 404.
so I tried with a simple header and it worked on my local machine.
headers = {"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
page = requests.get("https://www.costco.com", headers=headers, timeout=10)
But when I put this in an AWS Lambda function it went back to no response.
Is there a way to know why it won't respond at all? like what headers it is after?
Note that I have no trouble getting a response from https://google.com.
I am trying to send HTTP GET request to certain website, for example, https://www.united.com, but it get stuck with no response.
Here is the code:
from urllib.request import urlopen
url = 'https://www.united.com'
resp = urlopen(url,timeout=10 )
Every time, it goes timeout. But the same code works for other URLs, for example, https://www.aa.com.
So I wonder what is behind https://www.united.com that keeps me from getting the HTTP request through. Thank you!
Update:
Adding a request header still doesn't work for this site:
from urllib.request import urlopen
url = 'https://www.united.com'
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}
)
resp = urlopen(req,timeout=3 )
The server of united.com might be only responding to certain user-agent strings or request headers and blocking for others. You have to send certain headers or user-agent string which are allowed by their server. This depends upon website to website who want to add some more security to their applications so they are very specific about user-agents like which resource is trying to access them.
Codes aren't working. It has got 403 error because system using cloudflare
When i am using anyone http proxy(burp suite/fiddler etc.), I see csrfToken. It works.
Why it works when use local proxy?
import requests
from bs4 import BeautifulSoup
headerIstek = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041",
"Sec-Fetch-Site" : "none",
"Accept-Language" : "tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7"
}
istekLazim = {"ref":"","display_type":"popup","loc":""}
istekLogin = requests.get("https://www.example.com/join/login-popup/", headers=headerIstek, cookies={"ud_rule_vars":""}, params=istekLazim, verify=False)
soup = BeautifulSoup(istekLogin.text, "html.parser")
print(istekLogin.request.headers)
csrfToken = soup.find("input", {"name":"csrfmiddlewaretoken"})["value"]
print(csrfToken)
Cloudflare performs JavaScript checks on the browser and returns a session if the checks have been successful. If you want to run a one-off script to download stuff off of a CloudFlare protected server, add a session cookie from a previously validated session you obtained using your browser.
The session cookie is named __cfduid. You can get it by fetching a resource using your browser and then opening the developer tools and the network panel. Once you inspect the request, you can see the cookies your browser sent to the server.
Then you can use that cookie for requests using your script:
cookies = {
"__cfduid": "xd0c0985ed80ffbc4dd29d1612168766",
}
response = requests.get(image_url, cookies=cookies)
response.raise_for_status()
I am looking for downloading the PDFs with python and using requests library for the same. Following code works for some of the PDF documents but It throws an error for few documents.
from pathlib import Path
import requests
filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)
Following is the exact response (response.content), however, I can download the same document using a chrome browser without any error
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.rolls-royce.com/%7e/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf" on this server.<P>\nReference #18.36ad4d68.1562842755.6294c42\n</BODY>\n</HTML>\n'
Is there any way to get rid out of this?
You get 403 Forbidden because requests by default sends User-Agent: python-requests/2.19.1 header and server denies your request.
You can get the correct value for this header from your browser and everything will be fine.
For example:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
r = requests.get(url, headers=headers)
print(r.status_code) # 200
As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim¤cy=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL