python session requests image download - forbidden access 403 - python

I am trying to download an image from an url, with python requests.session also adding user-agent, still facing 403 forbidden error, please help.
my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
s = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#logged in! cookies saved for future requests.
r=s.get("https://example.com/homepage",headers=headers)
#cookies sent automatically!
soup=BeautifulSoup(r.text,'html.parser')
te=s.get("https://www.example.com/"+soup.find(class_='yes').find('a').get('href'),headers=headers).text
s.get('https://img.example.com/exampleimgcode.jpg', stream = True,headers=headers}))
out[]: <Response [403]>

Related

Extracting redirected link from an url

I am trying to extract the redirected link of this link. When I click on this link I am redirected to this page and I want to store this page link. So, for this I have tried with urllib module but it didn't give any response.
from urllib import request
headers = headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
response = requests.get(url, headers=headers)
print(response) # Output: <Response [503]>
So, how can I extract this link?
You can use cloudscraper to process the cloudflare redirect:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
r = scraper.get(url)
print(r.url)
you can use the requests library
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
response = requests.get(url, headers=headers)
print(response.url)

Unable to read website intermittently with requests

I tried to read a website using Python requests.
However, it sometimes succeeds but sometimes fails.
Here is my code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = 'https://hn.house.ifeng.com/homedetail/25762.shtml'
res = requests.get(url, headers = headers, timeout = 10, verify=False)
What are the reasons and how to solve it?
Thank you very much.

Access denied while downloading PDF using Python Requests

I am looking for downloading the PDFs with python and using requests library for the same. Following code works for some of the PDF documents but It throws an error for few documents.
from pathlib import Path
import requests
filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)
Following is the exact response (response.content), however, I can download the same document using a chrome browser without any error
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.rolls-royce.com/%7e/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf" on this server.<P>\nReference #18.36ad4d68.1562842755.6294c42\n</BODY>\n</HTML>\n'
Is there any way to get rid out of this?
You get 403 Forbidden because requests by default sends User-Agent: python-requests/2.19.1 header and server denies your request.
You can get the correct value for this header from your browser and everything will be fine.
For example:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
r = requests.get(url, headers=headers)
print(r.status_code) # 200

Timeout during fetching https website using Python

I have trouble fetching zomato.com website using Python and requests library.
import requests
r = requests.get('https://www.zomato.com/san-antonio')
print r.status_code
I run this script and get no response. I'm guessing that the problem is https, but I tried it with some other https websites and it worked liked a charm, and 200 was printed to the console.
Am I missing something here?
You'll need to pretend you're coming from an actual browser:
import requests
r = requests.get('https://www.zomato.com/san-antonio', headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
print(r.status_code)
# returns: 200

Python3 Connection reset by peer

urllib.error.URLError urlopen error [Errno 54] Connection reset by peer
I got this error by when try to fetch notino.com . I guess the guy used some clever way to prevent the screen scraper . I tried to add header and cookie but this doesn't work
from urllib.request import urlopen
url = "https://www.notino.com"
html = urlopen(url)
An auto-bot detection mechanism is most likely dropping your connection. You should provide a User-Agent header to fake a browser visit - worked for me:
>>> import requests
>>> response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'})
>>> response.status_code
200
Using requests module in this example.

Categories