How to download nasa satellite OPeNDAP data using python - python

I have tried requests, pydap, urllib, and netcdf4 and keep either getting redirect errors or permission errors when trying to download the following NASA data:
GLDAS_NOAH025SUBP_3H: GLDAS Noah Land Surface Model L4 3 Hourly 0.25 x 0.25 degree Subsetted V001 (http://disc.sci.gsfc.nasa.gov/uui/datasets/GLDAS_NOAH025SUBP_3H_V001/summary?keywords=Hydrology)
I am attempting to download about 50k files, here is an example of one, which works when pasted into google chrome browser (if you have proper username and password):
http://hydro1.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FGLDAS_V1%2FGLDAS_NOAH025SUBP_3H%2F2016%2F244%2FGLDAS_NOAH025SUBP_3H.A2016244.2100.001.2016256190725.grb&FORMAT=TmV0Q0RGLw&BBOX=-11.95%2C28.86%2C-0.62%2C40.81&LABEL=GLDAS_NOAH025SUBP_3H.A2016244.2100.001.2016286201048.pss.nc&SHORTNAME=GLDAS_NOAH025SUBP_3H&SERVICE=SUBSET_GRIB&VERSION=1.02&LAYERS=AAAB&DATASET_VERSION=001
Anyone have any experience getting OPeNDAP NASA data from the web using python? I am happy to provide more information if desired.
Here is the requests attempt which gives 401 error:
import requests
def httpdownload():
'''loop through each line in the text file and open url'''
httpfile = open(pathlist[0]+"NASAdownloadSample.txt", "r")
for line in httpfile:
print line
outname = line[-134:-122]+".hdf"
print outname
username = ""
password = "*"
r = requests.get(line, auth=("username", "password"), stream=True)
print r.text
print r.status_code
with open(pathlist[0]+outname, 'wb') as out:
out.write(r.content)
print outname, "finished" # keep track of progress
And here is the pydap example which gives redirect error:
import install_cas_client
from pydap.client import open_url
def httpdownload():
'''loop through each line in the text file and open url'''
username = ""
password = ""
httpfile = open(pathlist[0]+"NASAdownloadSample.txt", "r")
fileone = httpfile.readline()
filetot = fileone[:7]+username+":"+password+"#"+fileone[7:]
print filetot
dataset = open_url(filetot)

I did not find a solution using python, but given the information I have now it should be possible. I used wget with a .netrc file and cookie file shown as follows (https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20Download%20Data%20Files%20from%20HTTP%20Service%20with%20wget):
#!/bin/bash
cd # path to output files
touch .netrc
echo "machine urs.earthdata.nasa.gov login <username> password <password>" >> .netrc
chmod 0600 .netrc
touch .urs_cookies
wget --content-disposition --trust-server-names --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --auth-no-challenge=on --keep-session-cookies
-i <path to text file of url list>
Hope it helps anyone else working with NASA data from this server.

I realize it's a bit late to answer this question for the original poster, but I stumbled across this question while trying to do the same thing so I'll leave my solution here. It seems the NASA server uses redirects and Basic Authorization in a way the standard libraries don't expect. When you download from (for example) https://hydro1.gesdisc.eosdis.nasa.gov, you'll get redirected to https://urs.earthdata.nasa.gov for authentication. That server sets an authentication token as a cookie and redirects you back to download the file. If you're not handling cookies properly, you'll be stuck in an infinite redirection loop. If you're not handling authentication and redirection properly, you'll get an access denied error.
To get around this problem, chain HTTPRedirectHandler, HTTPCookieProcessor, and HTTPPasswordMgrWithDefaultRealm together and set it as the default opener or just use that opener directly.
from urllib import request
username = "<your username>"
password = "<your password>"
url = "<remote url of file>"
filename = "<local destination of file>"
redirectHandler = request.HTTPRedirectHandler()
cookieProcessor = request.HTTPCookieProcessor()
passwordManager = request.HTTPPasswordMgrWithDefaultRealm()
passwordManager.add_password(None, "https://urs.earthdata.nasa.gov", username, password)
authHandler = request.HTTPBasicAuthHandler(passwordManager)
opener = request.build_opener(redirectHandler,cookieProcessor,authHandler)
request.install_opener(opener)
request.urlretrieve(url,filename)

Related

Accessing an uploaded file's url

My application for sends and receives data from/to my phone - basically a 2-way communication through the pushbullet API.
I am trying to take a file from my phone and when it's uploaded do something with it, (play it for example if it's an audiofile).
But when I upload the file on my phone and then I list the pushes on my computer and get that exact push, the file-URLL is restricted.
I got following XML error-response showing "Access Denied" as message:
403: Forbidden
How would I approach this?
Here is the code for the application:
def play_sound(url):
#open the url and then write the contents into a local file
open("play.mp3", 'wb').write(urlopen(url))
#playsound through the playsound library
playsound("play.mp3", False)
pb = pushbullet.Pushbullet(API_KEY, ENCRYPTION_PASSWORD)
pushes = pb.get_pushes()
past_pushes = len(pushes)
while True:
time.sleep(3)
# checks for new pushes on the phone and then scans them for commands
pushes = pb.get_pushes()
number_pushes = len(pushes) - past_pushes
if number_pushes != 0:
past_pushes = (len(pushes) - number_pushes)
try:
for i in range(number_pushes):
push = pushes[i]
push_body = push.get("body")
if push_body is not None:
play = False
if push_body == "play":
play = True
elif play:
#only runs if the user has asked to play something
#beforehand
play = False
url = push.get('file_url')
#play sound from url
#this is where I get my 403: forbidden error
if url is not None and ".mp3" in url:
play_sound(url)
except Exception as e:
print(e)
From the docs...
To authenticate for the API, use your access token in a header like Access-Token: <your_access_token_here>.
You're using urlopen(url) without any header information, so the request is denied.
So, try something like the following
from urllib.request import Request, urlopen
req = Request('https://dl.pushbulletusercontent.com/...')
req.add_header('Access-Token', '<your token here>')
content = urlopen(req).read()
with open('sound.mp3', 'wb') as f:
f.write(content)
Reference: How do I set headers using python's urllib?

Google Drive API:How to download files from google drive?

access_token = ''
import json
r = session.request('get', 'https://www.googleapis.com/drive/v3/files?access_token=%s' % access_token)
response_text = str(r.content, encoding='utf-8')
files_list = json.loads(response_text).get('files')
files_id_list = []
for item in files_list:
files_id_list.append(item.get('id'))
for item in files_id_list:
file_r = session.request('get', 'https://www.googleapis.com/drive/v3/files/%s?alt=media&access_token=%s' % (item, access_token))
print(file_r.content)
I use the above code and Google shows:
We're sorry ...
... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.
I do n’t know if this method ca n’t be downloaded originally, or where is the problem?
The reason you are getting this error is you are requesting the data in a Loop.
causes so many requests to Google's server.
And hence the error
We're sorry ... ... but your computer or network may be sending automated queries
access_token should not be placed in the request body,We should put access_token in the header.Can try on this site oauthplayground

Why is Python script to download .xlsx from Sharepoint failing only for some URLs?

Using the Python Office365-REST-Python-Client I have written the following Python function to download Excel spreadsheets from Sharepoint (based on the answer at How to read SharePoint Online (Office365) Excel files in Python with Work or School Account? )
import sys
from urlparse import urlparse
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.file import File
xmlErrText = "<?xml version=\"1.0\" encoding=\"utf-8\"?><m:error"
def download(sourceURL, destPath, username, password):
print "Download URL: {}".format(sourceURL)
urlParts = urlparse(sourceURL)
baseURL = urlParts.scheme + "://" + urlParts.netloc
relativeURL = urlParts.path
if len(urlParts.query):
relativeURL = relativeURL + "?" + urlParts.query
ctx_auth = AuthenticationContext(baseURL)
if ctx_auth.acquire_token_for_user(username, password):
try:
ctx = ClientContext(baseURL, ctx_auth)
web = ctx.web
ctx.load(web)
ctx.execute_query()
except:
print "Failed to execute Sharepoint query (possibly bad username/password?)"
return False
print "Logged into Sharepoint: {0}".format(web.properties['Title'])
response = File.open_binary(ctx, relativeURL)
if response.content.startswith(xmlErrText):
print "ERROR response document received. Possibly permissions or wrong URL? Document content follows:\n\n{}\n".format(response.content)
return False
else:
with open(destPath, 'wb') as f:
f.write(response.content)
print "Downloaded to: {}".format(destPath)
else:
print ctx_auth.get_last_error()
return False
return True
This function works fine for some URLs but fails for others, printing the following "file does not exist" document content on failure (newlines and whitespace added for readability):
<?xml version="1.0" encoding="utf-8"?>
<m:error xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<m:code>
-2130575338, Microsoft.SharePoint.SPException
</m:code>
<m:message xml:lang="en-US">
The file /sites/path/to/document.xlsx does not exist.
</m:message>
</m:error>
I know that the username and password are correct. Indeed changing the password results in a completely different error.
I have found that this error can occur when either the document doesn't exist, or when there are insufficient permissions to access the document.
However, using the same username/password, I can download the document with the same URL in a web browser.
Note that this same function consistently works fine for some .xlsx URLs in the same Sharepoint repository, but consistently fails for some other .xlsx URLs in that same Sharepoint repository.
My only guess is that there are some more fine-grained permissions that need to me managed. But I'm completely ignorant to these if they exist.
Can anybody help me to resolve why the failure is occurring and figure out how to get it working for all the required files that I can already download in a web browser?
Additional Notes From Comments Below
The failures are consistent for some some URLs. The successes are consistent for other URLs. Ie, for one URL, the result is always the same - it does not come and go.
The files have not moved or been deleted. I can download them using browsers/PCs which have never accessed those files previously.
The source of the URLs is Sharepoint itself. Doing a search in Sharepoint includes those files in the results list with a URL below each file. This is the URL that I'm using for each file. (For some files the script works and for others it does not; for all files the browser works for the same URL.)
The URLs are all correctly encoded. In particular, spaces are encoded with %20.

python two step login

I want to download files from an dashboard environment in a python script and manipulate the data in the files. The dashboard environment needs me to login twice. I first need to login into a corporate account and then into a personal account. I can login into the corporate account, but then the login into my personal account fails while I do provide the correct credentials.
This is the script I'm trying to use. The stuff between the stars is changed for privacy reasons:
import csv
import requests
URL_Login = '*baseurl of the dashboard*'
CSV_URL = '*baseurl of the dashboard*/auto/reports/responses/?sheet=1528&item=4231&format=csv'
with requests.Session() as s:
download = s.get(URL_Login, auth=("*corporate account name*", "*corporate password*"))
download = s.get(CSV_URL, auth=("*personal account name*", "*personal password*"))
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
I get the following error message:
401 - Unauthorized: Access is denied due to invalid credentials.
You do not have permission to view this directory or page using the credentials that you supplied.
Can anything else trigger the 401, because I am very sure I'm providing the correct credentials?
To give the page to process the first request, try a timer before the second download statement: download s.get..., like time.sleep(3) for a 3-second wait (lengthen this to a max about 7 seconds by trial and error if 3 seconds does not work). Import time first of course.
It that still does not work, it means your request.Session() needs to be invoked again, so try:
import csv
import requests
URL_Login = '*baseurl of the dashboard*'
CSV_URL = '*baseurl of the dashboard*/auto/reports/responses/?sheet=1528&item=4231&format=csv'
with requests.Session() as s:
download = s.get(URL_Login, auth=("*corporate account name*", "*corporate password*"))
time.sleep(3)
with requests.Session() as t:
download = t.get(CSV_URL, auth=("*personal account name*", "*personal password*"))
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
...A third intervention, if the second does not work either, is to go back to your original code, and add this to the 'with requests.Session() as s:' line, in order to make the cookies persist, as recommended by the first answer here:
with requests.Session(config={'verbose': sys.stderr}) as s:
Leave comment here if anything goes wrong.

How do I pass event arguments to python scripts that run in response to Splunk alerts?

This is my script but it is not working as it is saying that sys.arg[8] is out of index range.
Splunk:
Your alert can trigger a shell script or batch file, which must be located in $SPLUNK_HOME/bin/scripts. Use the following attribute/value pairs:
action.script =
Splunk currently enables you to pass arguments to scripts both as command line arguments and as environment variables. This is because command line arguments don't always work with certain interfaces, such as Windows.
The values available in the environment are as follows:
SPLUNK_ARG_0 Script name
SPLUNK_ARG_1 Number of events returned
SPLUNK_ARG_2 Search terms
SPLUNK_ARG_3 Fully qualified query string
SPLUNK_ARG_4 Name of saved search
SPLUNK_ARG_5 Trigger reason (for example, "The number of events was greater than 1")
SPLUNK_ARG_6 Browser URL to view the saved search
SPLUNK_ARG_8 File in which the results for this search are stored (contains raw results)
SPLUNK_ARG_7 is not used for historical reasons.
These can be referenced in UNIX shell as $SPLUNK_ARG_0 and so on, or in Microsoft batch files via %SPLUNK_ARG_0% and so on. In other languages (perl, python, and so on), use the language native methods to access the environment.
#! /usr/bin/python
#Install requests package for python
import requests
import csv, gzip, sys
# Set the request parameters
url = 'https://xxxxxxxxdev.service-now.com/api/now/table/new_call'
user = 'xxxxx'
pwd = 'xxxxxx'
event_count = int(sys.argv[1]) # number of events returned.
results_file = sys.argv[8] # file with search results
# Set proper headers
headers = {"Content-Type":"application/json","Accept":"application/json"}
for row in csv.DictReader(openany(results_file)):
output="{"
for name,val in row.iteritems():
if output!="{":
output+=","
output += '"'+name+'":"'+val+'"'
output+="}"
# Do the HTTP request
response = requests.post(url, auth=(user, pwd), headers=headers, data='{"short_description":"Theo\'s Test for Splunk to SN","company":"company\'s domain","u_source":"Service Desk","contact_type":"Alert","description":"Please place detailed alert detail including recommended steps"}')
# Check for HTTP codes other than 200
if response.status_code != 201:
print('Status:', response.status_code, 'Headers:', response.headers, 'Error Response:',response.json())
exit()
# Decode the JSON response into a dictionary and use the data
#resp=response.json()
#print('Status:',response.status_code,'Headers:',response.headers,'Response:',re sponse.json())
print response.headers['location']
}
I see you are using the openany command, but haven't defined it in your code. Could that be causing the issue?
Otherwise, this should definitely be working, it matches my sample code and the code in the Splunk docs
import gzip
import csv
def openany(p):
if p.endswith(".gz"):
return gzip.open(p)
else:
return open(p)
results = sys.argv[8]
for row in csv.DictReader(openany(results)):
# do something with row["field"]

Categories