I know how to save page source code using urllib2
import urllib2
page = urllib2.urlopen('http://example.com')
page_source = page.read()
with open('page_source.html', 'w') as fid:
fid.write(page_source)
But how to save a source using urllib3? PoolManager?
Use .data, like this:
import urllib3
http = urllib3.PoolManager()
r = http.request('get', 'http://www.google.com')
with open('page_source.html', 'w') as fid:
fid.write(r.data)
Related
I am attempting to convert R code to python code. There is a current line that I am having trouble with. (code snip 1).
I have tried all variations of requests and the python code is creating a blank file with none of the contents.
Requests, wget, urllib.requests, etc. etc.
(1)
downloader = download.file(url = 'https://www.equibase.com/premium/eqbLateChangeXMLDownload.cfm',destfile = 'C:/Users/bnewell/Desktop/test.xml",quiet = TRUE) # DOWNLOADING XML FILE FROM SITE
unfiltered = xmlToList(xmlParse(download_file))
(2)
import requests
URL = 'https://www.equibase.com/premium/eqbLateChangeXMLDownload.cfm'
response = requests.head(URL, allow_redirects=True)
import requests, shutil
URL = 'https://www.equibase.com/premium/eqbLateChangeXMLDownload.cfm'
page = requests.get(URL, stream=True, allow_redirects=True,
headers={'user-agent': 'MyPC'})
with open("File.xml", "wb") as f:
page.raw.decode_content = True
shutil.copyfileobj(page.raw, f)
Manually adding a user-agent header the file download for some reason I'm not sure about.
I use shutil to download the raw file which could be replaced by page.iter_content
try to actually get the request
import requests
URL = 'https://www.equibase.com/premium/eqbLateChangeXMLDownload.cfm'
response = requests.get(URL, headers={'allow_redirects':True})
Then you can access what you are downloading with response.raw, response.text, response.content etc.
For more details see the actual docs
Try something like this instead:
import os
import requests
url = "htts://......"
r = requests.get(url , stream=True, allow_redirects=True)
if r.status_code != 200:
print("Download failed:", r.status_code, r.headers, r.text)
file_path = r"C:\data\...."
with open(file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024 * 8):
if chunk:
f.write(chunk)
f.flush()
os.fsync(f.fileno())
I have been trying to download the csv and zip file from the given links:
** https://nseindia.com/content/fo/fo.zip
** https://nseindia.com/archives/nsccl/sett/FOSett_prce_17052019.csv
The following code gives an error as HTTP Error 403: Forbidden
import urllib.request
csv_url = 'https://nseindia.com/archives/nsccl/sett/FOSett_prce_17052019.csv'
urllib.request.urlretrieve(csv_url, '17_05.csv')
The problem of yours is because the default User-Agent (Python-urllib/3.7) of Python-urllib is blocked by the website server. However, you can bypass the blockage by changing the User-Agent header:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
csv_url = 'https://nseindia.com/archives/nsccl/sett/FOSett_prce_17052019.csv'
urllib.request.urlretrieve(csv_url, '17_05.csv')
Here you can get the content of the CSV file and you can write the CSV file.
import csv
import requests
CSV_URL = 'https://nseindia.com/archives/nsccl/sett/FOSett_prce_17052019.csv'
with requests.Session() as s:
download = s.get(CSV_URL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
Install the package requests.
pip install requests
Then, use requests.get api to download the file and then write it to the desired file.
import requests
csv_url = 'https://nseindia.com/archives/nsccl/sett/FOSett_prce_17052019.csv'
r = requests.get(csv_url, allow_redirects=True)
open('test.csv', 'wb').write(r.content)
I am trying to use requests to download an SSRS report. The following code will download an empty Excel file:
url = 'http://MY REPORT URL HERE/ReportServer?/REPORT NAME HERE&rs:Format=EXCELOPENXML'
s = requests.Session()
s.post(url, data={'_username': 'username, '_password': 'password'})
r = s.get(url)
output_file = r'C:\Saved Reports\File.xlsx'
downloaded_file = open(output_file, 'wb')
for chunk in r.iter_content(100000):
downloaded_file.write(chunk)
I have successfully used requests_ntlm to complete this task, but I am wondering why the above code is not working as intended. The Excel file turns out to be empty; I feel it is due to an issue with logging in and passing those cookies to the GET request.
I was able to get this to work, but for pdfs. I found the solution here
Here's a piece of my code snippet:
import requests
from requests_ntlm import HttpNtlmAuth
session = requests.Session()
session.auth = HttpNtlmAuth(domain+uid,pwd)
response = session.get(reporturl,stream=True)
print response.status_code
with open(outputlocation+mdcProp+'.pdf','wb') as pdf:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
pdf.write(chunk)
session.close()
Trying to download the following file:
https://e4ftl01.cr.usgs.gov/MOLA/MYD14A2.006/2017.10.24/MYD14A2.A2017297.h19v01.006.2017310142443.hdf
I first need to sign into the following site before doing so:
https://urs.earthdata.nasa.gov
After reviewing my browser's web console, I believe it's using a cookie to allow me to download the file. How can I do this using python? I find out how to retrieve the cookies:
import os, requests
username = 'user'
password = 'pwd'
url = 'https://urs.earthdata.nasa.gov'
r = requests.get(url, auth=(username,password))
cookies = r.cookies
How can I then use this to download the HDF file? I've tried the following but always receive 401 error.
url2 = "https://e4ftl01.cr.usgs.gov/MOLA/MYD14A2.006/2017.10.24/MYD14A2.A2017297.h19v01.006.2017310142443.hdf"
r2 = requests.get(url2, cookies=r.cookies)
Have you tried a simple basic authentification :
from requests.auth import HTTPBasicAuth
url2='https://e4ftl01.cr.usgs.gov/MOLA/MYD14A2.006/2017.10.24/MYD14A2.A2017297.h19v01.006.2017310142443.hdf'
requests.get(url2, auth=HTTPBasicAuth('user', 'pass'))
or read this example
To download a file using the Requests library with the browser cookies, you can use the next function:
import browser_cookie3
import requests
import shutil
import os
cj = browser_cookie3.brave()
def download_file(url, root_des_path='./'):
local_filename = url.split('/')[-1]
local_filename = os.path.join(root_des_path, local_filename)
# r = requests.get(link, cookies=cj)
with requests.get(url, cookies=cj, stream=True) as r:
with open(local_filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
return local_filename
a = download_file(link)
In this example, cj is the cookies of Brave browser ( you can use ffox or chrome). then, these cj are passed to Requests to download the file.
Note, you need to get "browser_cookie3" library
pip install browser-cookie3
I need to download a file from an external source, I am using Basic authentication to login to the URL
import requests
response = requests.get('<external url', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<API URL to download the attachment>', auth=('<username>', '<password>'), stream=True)
print (data.content)
I am getting below output
<url to download the binary data>
\x00\x00\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\xcb\x00\x00\x1e\x00\x1e\x00\xbe\x07\x00\x00.\xcf\x05\x00\x00\x00'
I am expecting the URL to download the word document within the same session.
Working solution
import requests
import shutil
response = requests.get('<url>', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<url>', auth=('<username>', '<password>'), stream=True)
with open("C:/myfile.docx", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)
I am able to download the file as it is.
When you want to download a file directly you can use shutil.copyfileobj():
https://docs.python.org/2/library/shutil.html#shutil.copyfileobj
You already are passing stream=True to requests which is what you need to get a file-like object back. Just pass that as the source to copyfileobj().