Timeout during fetching https website using Python

Timeout during fetching https website using Python - python

I have trouble fetching zomato.com website using Python and requests library.
import requests
r = requests.get('https://www.zomato.com/san-antonio')
print r.status_code
I run this script and get no response. I'm guessing that the problem is https, but I tried it with some other https websites and it worked liked a charm, and 200 was printed to the console.
Am I missing something here?

You'll need to pretend you're coming from an actual browser:
import requests
r = requests.get('https://www.zomato.com/san-antonio', headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
print(r.status_code)
# returns: 200

Related

Python web scraping, requests object hangs

I am trying to scrape the website in python, https://www.nseindia.com/
However when I try to load the website using Requests in python the call simply hangs below is the code I am using.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get('https://www.nseindia.com/',headers=headers)
The requests.get call simply hangs, not sure what I am doing wrong here? The same URL works perfectly in Chrome or any other browser.
Appreciate any help.

Access denied while downloading PDF using Python Requests

I am looking for downloading the PDFs with python and using requests library for the same. Following code works for some of the PDF documents but It throws an error for few documents.
from pathlib import Path
import requests
filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)
Following is the exact response (response.content), however, I can download the same document using a chrome browser without any error
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.rolls-royce.com/%7e/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf" on this server.<P>\nReference #18.36ad4d68.1562842755.6294c42\n</BODY>\n</HTML>\n'
Is there any way to get rid out of this?

You get 403 Forbidden because requests by default sends User-Agent: python-requests/2.19.1 header and server denies your request.
You can get the correct value for this header from your browser and everything will be fine.
For example:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
r = requests.get(url, headers=headers)
print(r.status_code) # 200

python session requests image download - forbidden access 403

I am trying to download an image from an url, with python requests.session also adding user-agent, still facing 403 forbidden error, please help.
my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
s = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#logged in! cookies saved for future requests.
r=s.get("https://example.com/homepage",headers=headers)
#cookies sent automatically!
soup=BeautifulSoup(r.text,'html.parser')
te=s.get("https://www.example.com/"+soup.find(class_='yes').find('a').get('href'),headers=headers).text
s.get('https://img.example.com/exampleimgcode.jpg', stream = True,headers=headers}))
out[]: <Response [403]>

Python Requests Get XML

If I go to http://boxinsider.cratejoy.com/feed/ I can see the XML just fine. But when I try to access it using python requests, I get a 403 error.
blog_url = 'http://boxinsider.cratejoy.com/feed/'
headers = {'Accepts': 'text/html,application/xml'}
blog_request = requests.get(blog_url, timeout=10, headers=headers)
Any ideas on why?

Because it's hosted by WPEngine and they filter user agents.
Try this:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
requests.get('http://boxinsider.cratejoy.com/feed/', headers={'User-agent': USER_AGENT})

Trouble downloading website using urllib(2) and requests - Bad status line

I'm trying to download pages from the site
http://statsheet.com/
like this
url = 'http://statsheet.com'
urllib2.urlopen(url)
I have tried with the Python modules urllib, urllib2 and "reqests", but I only get error messages like "got a bad status line", "BadStatusLine" or similar
Is there any way to get around this?

You need to specify a common browser user agent e.g.
wget -U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.34
Safari/537.36" http://statsheet.com
Related question/answer:
Changing user agent on urllib2.urlopen

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Timeout during fetching https website using Python - python

Related

Python web scraping, requests object hangs

Access denied while downloading PDF using Python Requests

python session requests image download - forbidden access 403

Python Requests Get XML

Trouble downloading website using urllib(2) and requests - Bad status line

Categories

Resources