some websites won't respond (timeout) to requests.get. Can headers resolve?

some websites won't respond (timeout) to requests.get. Can headers resolve? - python

I'm new to using requests and I can't seem to get https://www.costco.com to respond to a simple requests.get command.
I don't understand why, I believe it is because it knows I'm not on a browser.
I don't get any response, not even a 404.
so I tried with a simple header and it worked on my local machine.
headers = {"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
page = requests.get("https://www.costco.com", headers=headers, timeout=10)
But when I put this in an AWS Lambda function it went back to no response.
Is there a way to know why it won't respond at all? like what headers it is after?
Note that I have no trouble getting a response from https://google.com.

Related

Python urllib or requests library get stuck to open certain URLs?

I am trying to send HTTP GET request to certain website, for example, https://www.united.com, but it get stuck with no response.
Here is the code:
from urllib.request import urlopen
url = 'https://www.united.com'
resp = urlopen(url,timeout=10 )
Every time, it goes timeout. But the same code works for other URLs, for example, https://www.aa.com.
So I wonder what is behind https://www.united.com that keeps me from getting the HTTP request through. Thank you!
Update:
Adding a request header still doesn't work for this site:
from urllib.request import urlopen
url = 'https://www.united.com'
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}
)
resp = urlopen(req,timeout=3 )

The server of united.com might be only responding to certain user-agent strings or request headers and blocking for others. You have to send certain headers or user-agent string which are allowed by their server. This depends upon website to website who want to add some more security to their applications so they are very specific about user-agents like which resource is trying to access them.

Unable to fetch a response - Request library Python

I am unable to fetch a response from this url. While it works in browser, even in incognito mode. Not sure why it is not working. It is just keep running without any output. No errors. I even tried request headers by setting 'user-agent' key but again received no response
Following is the code used:
import requests
response = requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020&section=EQ')
print(response.text)
I want html text from the response page for further use.

Your server is checking to see if you are sending the request from a web browser. If not, it's not returning anything. Try this:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
r=requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020&section=EQ', timeout=3, headers=headers)
print(r.text)

Headers while running EC2 Ubuntu instance

I am attempting to run my code on a aws ec2(ubuntu) instance. The codes work perfectly fine on my local but doesnt seem to be able to connect to website inside server.
Im assuming it has to do something with the headers. I have installed firefox and chrome on the server but doesnt seem to do anything.
Any ideas on how to fix this problem would be appreciated.
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
# Making a get request
response = requests.get("https://us.louisvuitton.com/eng-us/products/pocket-organizer-monogram-other-nvprod2380073v", headers=HEADERS) #hangs here, cant make request in server
# print response
print(response.status_code)
Output:
Doesn't give me one, just stays blank until I KeyboardInterrupt.

Request returns 200 in python, but an exact copy in Node.js returns 401 - What's wrong?

I am trying to send a GET request to an API using Node.JS. I don't have control over the server side. The API requires two things to be authenticated. I am getting those two values by logging in manually and then copying them over from chrome to my script.
A cookie
The user-agent that was used to perform the login
While this whole thing used to work a couple weeks or months ago, I now kept getting a status 401 (unauthorized). I asked a friend for help, who isn't a pro in node, but pretty good with python. He tried to build the same request with python and to our both surprise, it works perfectly fine.
So here I am, having two scripts that are supposed to do an absolutely identical action, but both have a different outcome. The request headers are both identical - since the python request works fine, it's also confirmed, that these are valid and enough to authenticate the request. They are both running on the same machine under Windows 10.
Script in Node.JS (returns a 401 - unauthorized):
const request = require("request");
const url = "https://api.rollbit.com/steam/market?query&order=1&showTradelocked=false&showCustomPriced=true&min=0&max=4294967295"
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Cookie': '__Secure-RollbitSession=JWDEFp________HaLLfT'
}
request.get(url, {headers: headers, json: true}, function(err, resp, body){
console.log(" > Response status in JS: " + resp.statusCode)
})
Same script in Python (returns a 200 - success):
import requests
url = "https://api.rollbit.com/steam/market?query&order=1&showTradelocked=false&showCustomPriced=true&min=0&max=4294967295"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Cookie': '__Secure-RollbitSession=JWDEFp________HaLLfT'
}
r = requests.request("GET", url, headers=headers)
print(" > Response status: in PY:", r.status_code)
Things I've tried:
I intercepted both requests in the scripts above with http toolkit to see if python is adding something to the headers.
Node.JS request - returned 401
Python request - returned 200
As seen in the intercepted results, python is adding some accept-encoding and accept headers. I tried to copy the FULL exact same headers python is sending into my node.js script, but I still get the same result (401) even though the (once again) intercepted requests now look identical.
I'm on the newest python and tried node 10.x, 12.18.0 and also the latest release.
At this point I don't know what to try any more. I don't really need it, but its completely bugging me that it isn't working for mysterious reasons and I would really like to find out what is happening.
Thank you!

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.

I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim&currency=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

some websites won't respond (timeout) to requests.get. Can headers resolve? - python

Related

Python urllib or requests library get stuck to open certain URLs?

Unable to fetch a response - Request library Python

Headers while running EC2 Ubuntu instance

Request returns 200 in python, but an exact copy in Node.js returns 401 - What's wrong?

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

Categories

Resources