I am trying to get https://www.amazon.com content with Python Requests library. But I got an server error instantly. Here is the code:
import requests
response = requests.get('https://www.amazon.com')
print(response)
And this code returns <Response [503]>. Anyone can tell me why is this happening and how to fix this?
Amazon requires, that you specify User-Agent HTTP header to return 200 response:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
response = requests.get('https://www.amazon.com', headers=headers)
print(response)
Prints:
<Response [200]>
Try this,
import requests
headers = {'User-Agent': 'Mozilla 5.0'}
response = requests.get('https://www.amazon.com', headers=headers)
print(response)
You have not put the code from which you want the info.
The code should be like this:
import requests
response = requests.get('https://www.amazon.com')
print(response.content)
also you can use json, status_code or text in place of content
Related
I'm trying to pull data from this URL. It works in Postman but not locally in Python (3.8). The connection closes without a response. Also tried manually passing in my browser cookies but had the same results.
Below is the code generated from Postman after copying the XHR response as cURL (bash)
import requests
url = "https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/10047/FUT?tradeDate=10/07/2022"
payload={}
headers = {
'Cookie': 'ak_bmsc=40201F1BC4FD8456EED03A38A16CBC95~000000000000000000000000000000~YAAQj2V0aGgBgqeDAQAAPzYKsxFH4BM3CXIxGLs0BpfFzUiVR7t+Ul6Q9U64ItnBxPPhosD8CEBZ03QGfv4XHioHnh1Hzn3E0Kc17EV4dAMLsUySAsUwh3Q+MD9zf5gNh4nCZXkoP+ChCHkYJ+uR1qxnPRZK8yu4USf8by8Js6LcoO3X2WPWkHw5LAsBcImL5hdhYDCX9n2bS3j/vHRyT2cg6iE0YLrAK6eLwgp6w8EFN9JhRKyL8AGYcYEJm6Rxk2EFQ62cG12uSW5pSl/h5yF/Z5qF8+0xXi3yhcBZ9vEvz9W8YPw9gbreYAvURg4wZtkxtxJyBkgfwlGkbc+NnzcErzlmH2b9ZYjs+vuP3GK0zP/c1e3BKgVEz/iQ; bm_sv=4E1D62DAE9E148686F96340796FD4A79~YAAQj2V0aDr/hKeDAQAAuChCsxGk32eAruqs2a29BNi48QW5E1rqQqbyowotXKQ1+hoMqvIsxi/uXHUQ+csp+U4/P6dMDker8yWYw80MxnzYfQ0k1UMD4VtKUGthUwGgBHrP42vpUbUMkVXVgjJh6OQrEwEFyP9T/wZGi8HraSMtkUJ2fmySYJtHS5Hvxr5oGlv9RtG2zlsq30gBxaJI1Y/j5HTh1hIKLsmI/VmrrTU9kI3M4zgoAF+TU8C1tWGG8bhr~1'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
EDIT: I had to modify the cURL as I was getting an error importing it into Postman. This is the cURL (bash) I used in Postman that returned the proper response:
curl 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/10047/FUT?strategy=DEFAULT&tradeDate=10/07/2022&pageSize=500&isProtected&_t=1665158458937' \
Any ideas how to fix the request? None of the other SO threads seemed to have the answer.
You need to provide the User-Agent header and that's actually all you need for this URL
import requests
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'
URL = "https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/10047/FUT?tradeDate=10/07/2022"
headers = {'User-Agent': AGENT}
(r := requests.get(URL, headers=headers)).raise_for_status()
print(r.json())
I have very limited knowledge in web crawling/scraping and am trying to create a web crawler to this URL. However, when I try the usual printing of the response text from the server, I get this:
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
I don't think there's anything wrong with the code as it works on other websites I've tried it on. Was hoping you good folks here could help me figure this out. And this is just a hunch, but is this caused by the url not ending in a .xml?
import requests
url = 'https://phys.org/rss-feed/'
res = requests.get(url)
print(res.text[:500])
Try using BeautifulSoup and a header to mask your request like a real one:
import requests,lxml
from bs4 import BeautifulSoup
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
print(soup)
Just masking alone also works:
import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
I am trying to login into www.ebay-kleinanzeigen.de using the requests library, but every time I try to post my data (on the register page its the same as on the login page) I am getting a 403 error.
Here is the code for the register function:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'user-agent': user_agent, 'Referer': 'https://www.ebay-kleinanzeigen.de'}
with requests.Session() as c:
url = 'https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html'
c.headers = headers
hp = c.get(url, headers=headers)
soup = BeautifulSoup(hp.content, 'html.parser')
crsf = soup.find('input', {'name': '_csrf'})['value']
print(crsf)
payload = dict(email='test.email#emailzz1.de', password='test123', passwordConfirmation='test123',
_marketingOptIn='on', _crsf=crsf)
page = c.post(url, data=payload, headers=headers)
print(page.text)
print(page.url)
print(page.status_code)
Is the problem that I need some more headers? Isn't a user-agent and a referrer enough?
I have tried adding all requested headers, but then I am getting no response.
I have managed to create a script that will successfully complete the register form you're trying to fill in using the mechanicalsoup library. Note you will have to manually check your email account for the email they send you to complete registration.
I realise this doesn't actually answer the question of why BeautifulSoup returned a 403 forbidden error however it does complete your task without encountering the same error.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html")
browser.select_form('#registration-form')
browser.get_current_form().print_summary()
browser["email"] = "mailuser#emailprovider.com"
browser["password"] = "testSO12345"
browser["passwordConfirmation"] = "testSO12345"
response = browser.submit_selected()
rsp_code = response.status_code
#print(response.text)
print("Response code:",rsp_code)
if(rsp_code == 200):
print("Success! Opening a local debug copy of the page... (no CSS formatting)")
browser.launch_browser()
else:
print("Failure!")
I am trying to to scrape the following page using python 3 but I keep getting HTTP Error 400: Bad Request. I have looked at some of the previous answers suggesting to use urllib.quote which didn't work for me since it's python 2. Also, I tried the following code as suggested by another post and still didn't work.
url = requote_uri('http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01')
with urllib.request.urlopen(url) as response:
html = response.read()
The server deny queries from non human-like User-Agent HTTP header.
Just pick a browser's User-Agent string and set it as header to your query:
import urllib.request
url = 'http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01'
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request) as response:
html = response.read()
import requests
import webbrowser
from bs4 import BeautifulSoup
url = 'https://www.gamefaqs.com'
#headers={'User-Agent': 'Mozilla/5.0'}
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
response = requests.get(url, headers)
response.status_code is returning 403.
I can browse the website using firefox/chrome, so It seems to be a coding error.
I can't figure out what mistake I'm making.
Thank you.
This works if you make the request through a Session object.
import requests
session = requests.Session()
response = session.get('https://www.gamefaqs.com', headers={'User-Agent': 'Mozilla/5.0'})
print(response.status_code)
Output:
200
Using keyword argument works for me:
import requests
headers={'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.gamefaqs.com', headers=headers)
Try using a Session.
import requests
session = requests.Session()
response = session.get(url, headers={'user-agent': 'Mozilla/5.0'})
print(response.status_code)
If still the request returns 403 Forbidden (after session object &
adding user-agent to headers), you may need to add more headers:
headers = {
'user-agent':"Mozilla/5.0 ...",
'accept': '"text/html,application...',
'referer': 'https://...',
}
r = session.get(url, headers=headers)
In the chrome, Request headers can be found in the Network > Headers > Request-Headers of the Developer Tools. (Press F12 to toggle it.)
reason being, few websites look for user-agent or for presence of specific headers before accepting the request.