Bad request trying to scrape page using Python 3 - python

I am trying to to scrape the following page using python 3 but I keep getting HTTP Error 400: Bad Request. I have looked at some of the previous answers suggesting to use urllib.quote which didn't work for me since it's python 2. Also, I tried the following code as suggested by another post and still didn't work.
url = requote_uri('http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01')
with urllib.request.urlopen(url) as response:
html = response.read()

The server deny queries from non human-like User-Agent HTTP header.
Just pick a browser's User-Agent string and set it as header to your query:
import urllib.request
url = 'http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01'
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request) as response:
html = response.read()

Related

python request is giving "Access Denied"

I was trying to download some data using python request command as follows:
import requests
head = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
session = requests.session()
session.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=fobhav&date=02-07-2011&section=FO', headers=head)
r= session.get('https://www1.nseindia.com/content/historical/DERIVATIVES/2011/AUG/fo02AUG2011bhav.csv.zip', headers=head)
print(r.status_code)
print(r.content)
But the above code is giving me following output:
403
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www1.nseindia.com/content/historical/DERIVATIVES/2011/AUG/fo02AUG2011bhav.csv.zip" on this server.<P>\nReference #18.661c2017.1662218167.332744f\n</BODY>\n</HTML>\n'
Why am I getting "Access Denied"? If someone simply goes to the website, he can select the date and download the data.
EDIT
Site to visit to get the url: https://www1.nseindia.com/products/content/derivatives/equities/archieve_fo.htm
Select 'bhavcopy' and a date to get the link.

Unable to fetch a response - Request library Python

I am unable to fetch a response from this url. While it works in browser, even in incognito mode. Not sure why it is not working. It is just keep running without any output. No errors. I even tried request headers by setting 'user-agent' key but again received no response
Following is the code used:
import requests
response = requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020&section=EQ')
print(response.text)
I want html text from the response page for further use.
Your server is checking to see if you are sending the request from a web browser. If not, it's not returning anything. Try this:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
r=requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020&section=EQ', timeout=3, headers=headers)
print(r.text)

Amazon.com returns status 503

I am trying to get https://www.amazon.com content with Python Requests library. But I got an server error instantly. Here is the code:
import requests
response = requests.get('https://www.amazon.com')
print(response)
And this code returns <Response [503]>. Anyone can tell me why is this happening and how to fix this?
Amazon requires, that you specify User-Agent HTTP header to return 200 response:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
response = requests.get('https://www.amazon.com', headers=headers)
print(response)
Prints:
<Response [200]>
Try this,
import requests
headers = {'User-Agent': 'Mozilla 5.0'}
response = requests.get('https://www.amazon.com', headers=headers)
print(response)
You have not put the code from which you want the info.
The code should be like this:
import requests
response = requests.get('https://www.amazon.com')
print(response.content)
also you can use json, status_code or text in place of content

Getting information in inspect element

I'm trying to find all the information inside "inspect" when using a browser for example chrome, currently i can get the page "source" but it doesn't contain everything that inspect contains
when i tried using
with urllib.request.urlopen(section_url) as url:
html = url.read()
I got the following error message: "urllib.error.HTTPError: HTTP Error 403: Forbidden"
Now I'm assuming this is because the url I'm trying to get is from a https url instead of a http one, and i was wondering if there is a specific way to get that information from https since the normal methods arn't working.
Note: I've also tried this, but it didn't show me everything
f = requests.get(url)
print(f.text)
You need to have a user-agent to make the browser think you're not a robot.
import urllib.request, urllib.error, urllib.parse
url = 'http://www.google.com' #Input your url
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(req)
html = response.read()
response.close()
adapted from https://stackoverflow.com/a/3949760/6622817

trying to log in and scrape a website through asp.net

I have wrote a program with the aim of logging into one of my companies websites and then scraping data with aim of making data collection quicker. this is using requests and beautiful soup.
I can get it to print out the html code for a page but I cant get it to log in past the aspx and then print the html on the page after.
below is the code im using and my headers and params. any help would be appreciated
import requests
from bs4 import BeautifulSoup
URL="http://mycompanywebsiteloginpage.co.uk/Login.aspx"
headers={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0 Iceweasel/44.0.2"}
username="myusername"
password="mypassword"
s=requests.Session()
s.headers.update(headers)
r=s.get(URL)
soup=BeautifulSoup(r.content)
VIEWSTATE=soup.find(id="__VIEWSTATE")['value']
EVENTVALIDATION=soup.find(id="__EVENTVALIDATION")['value']
EVENTTARGET=soup.find(id="__EVENTTARGET")['value']
EVENTARGUEMENT=soup.find(id="__EVENTARGUMENT")['value']
login_data={"__VIEWSTATE":VIEWSTATE,
"ctl00$ContentPlaceHolder1$_tbEngineerUsername":username,
"ctl00$ContentPlaceHolder1$_tbEngineerPassword":password,
"ctl00$ContentPlaceHolder1$_tbSiteOwnerEmail":"",
"ctl00$ContentPlaceHolder1$_tbSiteOwnerPassword":"",
"ctl00$ContentPlaceHolder1$tbAdminName":username,
"ctl00$ContentPlaceHolder1$tbAdminPassword":password,
"__EVENTVALIDATION":EVENTVALIDATION,
"__EVENTTARGET":EVENTTARGET,
"--EVENTARGUEMENT":EVENTARGUEMENT}
r = s.post(URL, data=login_data)
r = requests.get("http://mycompanywebsitespageafterthelogin.co.uk/Secure/")
print (r.url)
print (r.text)
FROM DATA
__VIEWSTATE:"DAwNEAIAAA4BBQAOAQ0QAgAADgEFAw4BDRACDwEBBm9ubG9hZAFkU2hvd1BhbmVsKCdjdGwwMF9Db250ZW50UGxhY2VIb2xkZXIxX19wbkFkbWluaXN0cmF0b3JzJywgZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoJ2FkbWluTG9naW5MaW5rJykpOwAOAQUBDgENEAIAAA4DBQEFBwULDgMNEAIMDwEBDUFsdGVybmF0ZVRleHQBDldEU0kgRGFzaGJvYXJkAAAAAA0QAgAADgIFAAUBDgINEAIPAQEEVGV4dAEEV0RTSQAAAA0QAgwPAQEHVmlzaWJsZQgAAAAADRACDwECBAABBFdEU2kAAAAAAABCX8QugS7ztoUJMfDmZ0s20ZNQfQ=="
ctl00$ContentPlaceHolder1$_tbEngineerUsername:"myusername"
ctl00$ContentPlaceHolder1$_tbEngineerPassword:"mypassword"
ctl00$ContentPlaceHolder1$_tbSiteOwnerEmail:""
ctl00$ContentPlaceHolder1$_tbSiteOwnerPassword:""
ctl00$ContentPlaceHolder1$tbAdminName:"myusername"
ctl00$ContentPlaceHolder1$tbAdminPassword:"mypassword"
__EVENTVALIDATION:"HQABAAAA/////wEAAAAAAAAADwEAAAAKAAAACBzHEFXh+HCtf3vdl8crWr6QZnmaeK7pMzThEoU2hwqJxnlkQDX2XLkLAOuKEnW/qBMtNK2cdpQgNxoGtq65"
__EVENTTARGET:"ctl00$ContentPlaceHolder1$_btAdminLogin"
__EVENTARGUMENT:""
REQUEST COOKIES
ASP.NET_SessionId:"11513CDDE31AF267CCD87BAB"
RESPONSE HEADERS
Cache-Control:"private"
Connection:"Keep-Alive"
Content-Length:"123"
Content-Type:"text/html; charset=utf-8"
Date:"Thu, 28 Jul 2016 13:37:45 GMT"
Keep-Alive:"timeout=15, max=91"
Location:"/Secure/"
Server:"Apache/2.2.14 (Ubuntu)"
x-aspnet-version:"2.0.50727"
REQUEST HEADERS
Host:"mycompanywebsite.co.uk"
User-Agent:"Mozilla/5.0 (X11; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0 Iceweasel/44.0.2"
Accept:"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
Accept-Language:"en-US,en;q=0.5"
Accept-Encoding:"gzip, deflate"
Referer:"http://mycompanywebsiteloginpage/Login.aspx"
Cookie:"ASP.NET_SessionId=F11CB47B137ADB66D2274758"
Connection:"keep-alive"
change the line
r = requests.get("http://mycompanywebsitespageafterthelogin.co.uk/Secure/")
to use your session object
r = s.get("http://mycompanywebsitespageafterthelogin.co.uk/Secure/")

Categories