I'm trying to build a webscraper from a tutorial I watched.
Replicating the same work is giving me the following error.
import requests
import bs4
r = requests.get("http://www.pyclass.com/example.html", headers={"User-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"})
c=r.content
The error says "Syntax Error : Invalid character in identifier"
The word headers is being highlighted .
I really need to use headers so that I can fetch the data by impersonating a web browser , otherwise I am getting a 406 error without it.
Try below code.
import requests
head={"User-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}
r=requests.get("http://www.example.com/", headers=head)
Related
I was trying to download some data using python request command as follows:
import requests
head = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
session = requests.session()
session.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=fobhav&date=02-07-2011§ion=FO', headers=head)
r= session.get('https://www1.nseindia.com/content/historical/DERIVATIVES/2011/AUG/fo02AUG2011bhav.csv.zip', headers=head)
print(r.status_code)
print(r.content)
But the above code is giving me following output:
403
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www1.nseindia.com/content/historical/DERIVATIVES/2011/AUG/fo02AUG2011bhav.csv.zip" on this server.<P>\nReference #18.661c2017.1662218167.332744f\n</BODY>\n</HTML>\n'
Why am I getting "Access Denied"? If someone simply goes to the website, he can select the date and download the data.
EDIT
Site to visit to get the url: https://www1.nseindia.com/products/content/derivatives/equities/archieve_fo.htm
Select 'bhavcopy' and a date to get the link.
I am unable to fetch a response from this url. While it works in browser, even in incognito mode. Not sure why it is not working. It is just keep running without any output. No errors. I even tried request headers by setting 'user-agent' key but again received no response
Following is the code used:
import requests
response = requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020§ion=EQ')
print(response.text)
I want html text from the response page for further use.
Your server is checking to see if you are sending the request from a web browser. If not, it's not returning anything. Try this:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
r=requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020§ion=EQ', timeout=3, headers=headers)
print(r.text)
I am looking for downloading the PDFs with python and using requests library for the same. Following code works for some of the PDF documents but It throws an error for few documents.
from pathlib import Path
import requests
filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)
Following is the exact response (response.content), however, I can download the same document using a chrome browser without any error
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.rolls-royce.com/%7e/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf" on this server.<P>\nReference #18.36ad4d68.1562842755.6294c42\n</BODY>\n</HTML>\n'
Is there any way to get rid out of this?
You get 403 Forbidden because requests by default sends User-Agent: python-requests/2.19.1 header and server denies your request.
You can get the correct value for this header from your browser and everything will be fine.
For example:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
r = requests.get(url, headers=headers)
print(r.status_code) # 200
I am trying to use the below code to access websites in python 3 using urllib
url = "http://www.goal.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
r = urllib.request.Request(url=url, headers=headers)
urllib.request.urlopen(r).read(1000)
It works fine when it access "yahoo.com", but it always returned error 403 when accessing sites such as "goal.com, hkticketing.com.hk" and I cannot figure out what I am missing. Appreciate for your help.
In python 2.x version , you can use urllib2 to fetch the contents. You can invoke the add headers function to add the header information. Then invoke the open method and read the contents. Finally print them.
import urllib2
import sys
print sys.version
url = urllib2.build_opener()
url.addheaders = [('User-agent', 'Mozilla/5.0(Windows NT 6.1; WOW64; rv:23.0)Gecko/20100101 Firefox/23.0')]
print url.open('http://hkticketing.com.hk').read()
For example, I can easily "make" the request from FireFox:
import urllib2
header = {"User-Agent": "Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1"}
req = urllib2.request("http://google.com", None, header)
response = urllib2.urlopen(req)
I was wondering, is there a way to add OS info into the header or somewhere else to make it look like coming from a certain OS?
The OS is also in the user agent string. Try the string below.
Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0
More details on the user agent: https://developer.mozilla.org/en-US/docs/Gecko_user_agent_string_reference