I have implemented a code to download bhav-copies for all the dates in the stock market. After scraping about 2 years, it seems like my IP got blocked.
This code doesn't work for me.
import urllib.request
url = 'https://www1.nseindia.com/content/historical/DERIVATIVES/2014/APR/fo01APR2014bhav.csv.zip'
response = urllib.request.urlopen(url)
It gives the following error :
urllib.error.HTTPError: HTTP Error 403: Forbidden
I would like to know how I can use some proxy to get the data. Any help would be really appreciated.
import urllib.request
proxy_host = '1.2.3.4:8080' # host and port of your proxy
url = 'https://www1.nseindia.com/content/historical/DERIVATIVES/2014/APR/fo01APR2014bhav.csv.zip'
req = urllib.request.Request(url)
req.set_proxy(proxy_host, 'http')
response = urllib.request.urlopen(req)
For more flexibility, you can use a Proxy Handler - https://docs.python.org/3/library/urllib.request.html
proxy_handler = urllib.request.ProxyHandler({'http': '1.2.3.4:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
This works fine,
import requests
headers = {
'authority': 'www.nseindia.com',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'accept': '*/*',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.nseindia.com/content/',
'accept-language': 'en-US,en;q=0.9,lb;q=0.8',
}
url = "https://www1.nseindia.com/content/historical/DERIVATIVES/2014/APR/fo01APR2014bhav.csv.zip"
r = requests.get(url,headers=headers)
with open("data.zip","wb") as f:
f.write(r.content)
if you have proxies,
proxy = {"http" : "x.x.x.x:pppp",
"https" :"x.x.x.x:pppp",
}
r = requests.get(url, headers=headers, proxies=proxy)
You don't need to use proxies to download this file. The code below will work like a charm:
import urllib.request
url = 'https://www1.nseindia.com/content/historical/DERIVATIVES/2014/APR/fo01APR2014bhav.csv.zip'
req = urllib.request.Request(url)
# Add referer header to bypass "HTTP Error 403: Forbidden"
req.add_header('Referer', 'https://www.nseindia.com')
res = urllib.request.urlopen(req)
# Save it into file.zip
with open("file.zip", "wb") as f:
f.write(res.read())
In case you want to get free proxies, visit https://free-proxy-list.net/. Then follow the answer of #pyd at https://stackoverflow.com/a/63328368/8009647
Related
I'm trying to return a GET request from an API using HTTPBasicAuth.
I've tested the following in Postman, and received the correct response
URL:"https://someapi.data.io"
username:"username"
password:"password"
And this returns me the data I expect, and all is well.
When I've tried this in python however, I get kicked back a 403 error, alongside a
""error_type":"ACCESS DENIED","message":"Please confirm api-key, api-secret, and permission is correct."
Below is my code:
import requests
from requests.auth import HTTPBasicAuth
URL = 'https://someapi.data.io'
authBasic=HTTPBasicAuth(username='username', password='password')
r = requests.get(URL, auth = authBasic)
print(r)
I honestly can't tell why this isn't working since the same username and password passes in Postman using HTTPBasicAuth
You have not conveyed all the required parameters. And postman is doing this automatically for you.
To be able to use in python requests just specify all the required parameters.
headers = {
'Host': 'sub.example.com',
'User-Agent': 'Chrome v22.2 Linux Ubuntu',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest'
}
url = 'https://sub.example.com'
response = requests.get(url, headers=headers)
It could be due to the fact that the user-agent is not defined
try the following:
import requests
from requests.auth import HTTPBasicAuth
URL = 'https://someapi.data.io'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
authBasic=HTTPBasicAuth(username='username', password='password')
r = requests.get(URL, auth = authBasic, headers=headers)
print(r)
I want to read a url in python but I get error with different ways:
import urllib
link = "http://data.europa.eu/esco/isco/C0110"
f = urllib.urlopen(link)
myfile = f.read()
print(myfile)
HTTPError: HTTP Error 406: Not Acceptable
link = "http://data.europa.eu/esco/isco/C0110"
f = requests.get(link)
print(f)
<Response [406]>
Any idea?
In this particular case you can overcome HTTP 406 by providing appropriate headers as follows:-
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept-Encoding': '*',
'Accept': 'text/html',
'Accept-Language': '*'}
The link is broken/invalid. As per the site, the following link http://data.europa.eu/esco/isco/C0110 is not a URL but a URI.
It seems they have an API setup for the data.
You can either;
Check out the API and configure it
https://ec.europa.eu/esco/portal/api
OR
Use a module like BeautifulSoup4 for web scraping the page from which you want the content.
I am trying to send a file for a candidate in POST request naturalHR API:
I have tried the same request using POSTMAN and it worked fine. But when i try to integrate the API's POST request using python to attach the file I am getting an error that It cv parameter should be a file(its API error response).
Source Code:
from pprint import pprint
import json
import requests
import urllib.request
headers = {
'accept': 'application/json',
'Authorization': api_key,
'Host': 'api02.naturalhr.net',
'Referer': 'https://api02.naturalhr.net/api/documentation',
'Content-type': 'multipart/form-data',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}
payLoad = dict()
payLoad["firstname"] = json_of_vals['firstname']
payLoad["surname"] = json_of_vals['surname']
payLoad["email"] = json_of_vals['email']
payLoad["cv"] = "Path/To/PDF_File"
files = {'file': "outfilename.pdf"}
api_url = "https://api02.naturalhr.net/api/v1/candidate"
res = requests.post(api_url, files=files, headers=headers, data=request_data)
print(res.content)
Please dont mark this as a duplicate to a question here which is already been answered because I have tested it by using files as request's argument like:
res = requests.post(api_url, files=files, headers=headers, data=request_data)
Edited:
The answer which I have tried:
Using Python Requests to send file and JSON in single request
I was adding a header
'accept': 'application/json'
Which should not be there, I tried it using only user-agent and an API-key and it worked perfectly fine as per requirements.
Corrected Code:
from pprint import pprint
import json
import requests
import urllib.request
headers = {
'Authorization': api_key,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}
payLoad = dict()
payLoad["firstname"] = json_of_vals['firstname']
payLoad["surname"] = json_of_vals['surname']
payLoad["email"] = json_of_vals['email']
files = {'file': "PATH/TO/FILE/FROM/LOCAL/DRIVE"}
api_url = "https://api02.naturalhr.net/api/v1/candidate"
res = requests.post(api_url, headers=headers, data=payLoad, files=files)
print("Status Code is: ", res.status_code)
print("Returned JSON Response is:\n")
pprint(res.text)
I know there are tons of threads and videos on how to do this, I've gone through them all and am in need of a little advanced guidance.
I am trying to log into this webpage where I have an account so I can send a request to download a report.
First I send the get request to the login page, then send the post request but when I print(resp.content) I get the code back for the login page. I do get a code[200] but I can't get to the index page. No matter what page I try to get after the post it keeps redirecting me back to the login page
Here are a couple things I'm not sure if I did correctly:
For the header I just put everything that was listed when I inspected the page
Not sure if I need to do something with the cookies?
Below is my code:
import requests
import urllib.parse
url = 'https://myurl.com/login.php'
next_url = 'https://myurl.com/index.php'
username = 'myuser'
password = 'mypw'
headers = {
'Host': 'url.myurl.com',
'Connection': 'keep-alive',
'Content-Length': '127',
'Cache-Control': 'max-age=0',
'Origin': 'https://url.myurl.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://url.myurl.com/login.php?redirect=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Cookie': 'PHPSESSID=3rgtou3h0tpjfts77kuho4nnm3'
}
login_payload = {
'XXX_login_name': username,
'XXX_login_password': password,
}
login_payload = urllib.parse.urlencode(login_payload)
r = requests.Session()
r.get(url, headers = headers)
r.post(url, headers = headers, data = login_payload)
resp = r.get(next_url, headers = headers)
print(resp.content)
You don't need to send separate requests for authorization and file download. You need to send single POST with specifying credentials. Also in most cases you don't need to send headers. In common your code should looks like follow:
from requests.auth import HTTPBasicAuth
url_to_download = "http://some_site/download?id=100500"
response = requests.post(url_to_download, auth=HTTPBasicAuth('your_login', 'your_password'))
with open('C:\\path\\to\\save\\file', 'w') as my_file:
my_file.write(response.content)
There are a few more fields in the form data to post:
import requests
data = {"redirect": "1",
"XXX_login_name": "your_username",
"XXX_login_password": "your_password",
"XXX_actionSUBMITLOGIN": "Login",
"XXX_login_php": "1"}
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
r1 = s.get("https://eym.sicomasp.com/login.php")
s.headers["cookie"] = r1.headers["Set-Cookie"]
pst = s.post("https://eym.sicomasp.com/login.php", data=data)
print(pst.history)
You may get redirected to index.php automatically after the post, you can check r1.history and r1.content to see exactly what is happening.
So I figured out what my problem was, just in case anyone in the future has the same issue. I am sure different websites have different requirements but in this case the Cookie: I was sending in the request header was blocking it. What I did was grab my cookie in the headers AFTER I logged in. I updated my headers and then I sent the request. This is what ended up working:
(also the form data needs to be encoded in HTML)
import requests
import urllib.parse
headers = {
'Host' : 'eym.sicomasp.com',
'Content-Length' : '62',
'Origin' : 'https://eym.sicomasp.com',
'Upgrade-Insecure-Requests' : '1',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Referer' : 'https://eym.sicomasp.com/login.php?redirect=1',
'Cookie' : 'PHPSESSID=vdn4er761ash4sb765ud7jakl0; SICOMUSER=31+147234553'
} #Additional cookie information after logging in ^^^^
data = {
'XXX_login_name': 'myuser',
'XXX_login_password': 'mypw',
}
data = urllib.parse.urlencode(data)
with requests.Session() as s:
s.headers.update(headers)
resp = s.post('https://eym.sicomasp.com/index.php', data=data2)
print(resp.content)
I am trying to make a login to http://site24.way2sms.com/content/index.html
This is the script I've written.
import urllib
import urllib2
url = 'http://site21.way2sms.com/content/index.html'
values = {'username' : 'myusername',
'password' : 'mypassword'}
headers = {'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'Referer':'https://packetforger.wordpress.com/2013/09/13/changing-user-agent-in-python-requests-and-requesocks-and-using-it-in-an-exploit/',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers=headers)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
I am getting the response from the website. But it's kind of encrypted or something like:
��:�����G��ʯ#��C���G�X�*�6�?���ך��5�\���:�tF�D1�٫W��<�bnV+w\���q�����$�Q��͇���Aq`��m�*��Օ���)���)�
in my ubuntu terminal. How can I fix this ?
Am I being logged in correctly ?
Please help.
The form on that page doesn't post back to the same URL, it posts to http://site21.way2sms.com/content/Login.action.