Code to download data at once from different directories using python - python

I am new at python.
I want to download through a code data from this URL: "ftp://cddis.nasa.gov/gnss/products/ionex/". However the files that I want have this format: "codgxxxx.xxx.Z".
All these files are inside each year(enter image description here) as it is show here:enter image description here.
How can I download it just those files using python?.
Until now I have been using wget with this code: wget ftp://cddis.nasa.gov/gnss/products/ionex/2008/246/codg0246.07i.Z", for each one of files but is to tedious.
Can anyone help me please!!.
Thank you

Since you know the structure in the FTP server, this can be pretty easy to accomplish without having to use ftplib.
It would be cleaner to actually retrieve a directory listing from the server such as this question
(I don't seem to be able to connect to that nasa URL though)
I would recommend reading here for more on how to actually perform an FTP download.
But something like this may work. (full disclosure: I haven't tested it)
import urllib
YEARS_TO_DOWNLOAD = 12
BASE_URL = "ftp://cddis.nasa.gov/gnss/products/ionex/"
FILE_PATTERN = "codg{}.{}.Z"
SAVE_DIR = "/home/your_name/nasa_ftp/"
year = 2006
three_digit_number = 0
for i in range(0, YEARS_TO_DOWNLOAD):
target = FILE_PATTERN.format(str(year + i), str(three_digit_number.zfill(3))
try:
urllib.urlretrieve(BASE_URL + target, SAVE_DIR + target)
except urllib.error as e:
print("An error occurred trying to download {}.\nReason: {}".format(target,
str(e))
else:
print("{} -> {}".format(target, SAVE_DIR + target))
print("Download finished!")

Related

How to filter filenames with extension on API call?

I was working on the python confluence API for downloading attachment from confluence page, I need to download only files with .mpp extension. Tried with glob and direct parameters but didnt work.
Here is my code:
file_name = glob.glob("*.mpp")
attachments_container = confluence.get_attachments_from_content(page_id=33110, start=0, limit=1,filename=file_name)
print(attachments_container)
attachments = attachments_container['results']
for attachment in attachments:
fname = attachment['title']
download_link = confluence.url + attachment['_links']['download']
r = requests.get(download_link, auth = HTTPBasicAuth(confluence.username,confluence.password))
if r.status_code == 200:
if not os.path.exists('phoenix'):
os.makedirs('phoenix')
fname = ".\\phoenix\\" +fname
glob.glob() operates on your local folder. So you can't use that as a filter for get_attachments_from_content(). Also, don't specify a limit of since that gets you just one/the first attachment. Specify a high limit or whatever default will include all of them. (You may have to paginate results.)
However, you can exclude the files you don't want by checking the title of each attachment before you download it, which you have as fname = attachment['title'].
attachments_container = confluence.get_attachments_from_content(page_id=33110, limit=1000)
attachments = attachments_container['results']
for attachment in attachments:
fname = attachment['title']
if not fname.lower().endswith('.mpp'):
# skip file if it's not got that extension
continue
download_link = ...
# rest of your code here
Also, your code looks like a copy-paste from this answer but you've changed the actual "downloading" part of it. So if your next StackOverflow question is going to be "how to download a file from confluence", use that answer's code.

is there a way to download multiple files using the requests module

I want to download multiple .hdr files from a website called hdrihaven.com.
My knowledge of python is not that great but here is what I have tried so far:
import requests
url = 'https://hdrihaven.com/files/hdris/'
resolution = '4k'
file = 'pump_station' #would need to be every file
url_2k = url + file + '_' + resolution + '.hdr'
print(url_2k)
r = requests.get(url_2k, allow_redirects=True)
open(file + resolution + '.hdr', 'wb').write(r.content)
Idealy file would just loop over every file in the directory.
Thanks for your answers in advance!
EDIT
I found a script on github that does what I need: https://github.com/Alzy/hdrihaven_dl. I edited it to fit my needs here: https://github.com/ktkk/hdrihaven-downloader. It uses the technique of looping through a list of all available files as proposed in the comments.
I have found that the requests module as well as urllib are extremly slow compared to native downloading from eg. Chrome. If anyone has an idea as to how I can speed these up pls let me know.
There are 2 ways you can do this:
You can use an URLto fetch all the files and iterate through a loop to download them individually. This of course only works if there exists such a URL.
You can pass in individual URL to a function that can download them in parallel/bulk.
For example:
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool
def url_response(url):
path, url = url
r = requests.get(url, stream = True)
with open(path, 'wb') as f:
for ch in r:
f.write(ch)
urls = [("Event1", "https://www.python.org/events/python-events/805/"),("Event2", "https://www.python.org/events/python-events/801/"),
("Event3", "https://www.python.org/events/python-user-group/816/")]
start = time()
for x in urls:
url_response (x)
print(f"Time to download: {time() - start}")
This code snippet is taken from here Download multiple files (Parallel/bulk download). Read on there for more information on how you can do this.

Download File That Uses Download Attribute With Urllib

I am trying to parse a webpage and download a series of csv files that are in zip folders. When I click on the link on the website, I can download it without any trouble. However whenever I paste the URL into my browser (ie: example.com/file.zip), I get a 400 Bad Request error. I am not sure but I have deduced this issue is caused because the link uses a download attribute
The problem now is when I use urllib.request.urlretrieve to download the zip files, I can not. My code is fairly simple:
# Look to a specific folder in my computer
# Compare the zip files in that folder to the zip files on the website
# What ever is on the website, but not on my local machine
# is added to a dictionary called remoteFiles
for remoteFile in remoteFiles:
try:
filename = ntpath.basename(remoteFile)
urllib.request.urlretrieve(remoteFile, filename)
print('finished downloading: ' + filename)
except Exception as e:
print('error with file: ' + filename)
print(e)
Here is a PasteBin link to my full .py file. Wherever I run it I get an error:
HTTP Error 400: Bad Request

How to extract images from a PDF in pure Python?

I'm developing a service in which I now need to extract images from a PDF file. From a Linux command line I can extract images using the Poppler library like this:
pdfimages my_file.pdf /tmp/image
Since I'm using the Python Flask framework and I want to run my service on Heroku I want to extract the images using pure Python (or any library that can run on Heroku in a Flask system).
So does anybody know how I can extract images from pdf in pure Python? I prefer open source solutions, but I'm willing to pay for it if needed (as long as it works under my own control on Heroku).
import minecart
import os
from NumberOfPages import getPageNumber
def extractImages(filename):
# making new directory if it doesn't exist
new_dir_name = filename[:-4]
if not os.path.exists(new_dir_name):
os.makedirs(new_dir_name + '/images')
os.makedirs(new_dir_name + '/text')
# open the target file
pdf_file = open(filename, 'rb')
# parse the document through the minecart. Document function
doc = minecart.Document(pdf_file)
# getting the number of pages in the pdf file.
num_pages = getPageNumber(filename)
# getting the list of all the pages
page = doc.get_page(num_pages)
count = 0
for page in doc.iter_pages():
for i in range(len(page.images)):
try:
im = page.images[i].as_pil() # requires pillow
name = new_dir_name + '/images/image_' + str(count) + '.jpg'
count = count + 1
im.save(name)
except:
print('Error encountered at %s' % filename)
doc_name = new_dir_name + '/images/info.txt'
with open(doc_name, 'a') as x:
print( x.write('Number of images in document: {}'.format(count)))

Python equivalent of a given wget command

I'm trying to create a Python function that does the same thing as this wget command:
wget -c --read-timeout=5 --tries=0 "$URL"
-c - Continue from where you left off if the download is interrupted.
--read-timeout=5 - If there is no new data coming in for over 5 seconds, give up and try again. Given -c this mean it will try again from where it left off.
--tries=0 - Retry forever.
Those three arguments used in tandem results in a download that cannot fail.
I want to duplicate those features in my Python script, but I don't know where to begin...
There is also a nice Python module named wget that is pretty easy to use. Keep in mind that the package has not been updated since 2015 and has not implemented a number of important features, so it may be better to use other methods. It depends entirely on your use case. For simple downloading, this module is the ticket. If you need to do more, there are other solutions out there.
>>> import wget
>>> url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
>>> filename = wget.download(url)
100% [................................................] 3841532 / 3841532>
>> filename
'razorback.mp3'
Enjoy.
However, if wget doesn't work (I've had trouble with certain PDF files), try this solution.
Edit: You can also use the out parameter to use a custom output directory instead of current working directory.
>>> output_directory = <directory_name>
>>> filename = wget.download(url, out=output_directory)
>>> filename
'razorback.mp3'
urllib.request should work.
Just set it up in a while(not done) loop, check if a localfile already exists, if it does send a GET with a RANGE header, specifying how far you got in downloading the localfile.
Be sure to use read() to append to the localfile until an error occurs.
This is also potentially a duplicate of Python urllib2 resume download doesn't work when network reconnects
I had to do something like this on a version of linux that didn't have the right options compiled into wget. This example is for downloading the memory analysis tool 'guppy'. I'm not sure if it's important or not, but I kept the target file's name the same as the url target name...
Here's what I came up with:
python -c "import requests; r = requests.get('https://pypi.python.org/packages/source/g/guppy/guppy-0.1.10.tar.gz') ; open('guppy-0.1.10.tar.gz' , 'wb').write(r.content)"
That's the one-liner, here's it a little more readable:
import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
This worked for downloading a tarball. I was able to extract the package and download it after downloading.
EDIT:
To address a question, here is an implementation with a progress bar printed to STDOUT. There is probably a more portable way to do this without the clint package, but this was tested on my machine and works fine:
#!/usr/bin/env python
from clint.textui import progress
import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url, stream=True)
with open(fname, 'wb') as f:
total_length = int(r.headers.get('content-length'))
for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1):
if chunk:
f.write(chunk)
f.flush()
A solution that I often find simpler and more robust is to simply execute a terminal command within python. In your case:
import os
url = 'https://www.someurl.com'
os.system(f"""wget -c --read-timeout=5 --tries=0 "{url}"""")
import urllib2
import time
max_attempts = 80
attempts = 0
sleeptime = 10 #in seconds, no reason to continuously try if network is down
#while true: #Possibly Dangerous
while attempts < max_attempts:
time.sleep(sleeptime)
try:
response = urllib2.urlopen("http://example.com", timeout = 5)
content = response.read()
f = open( "local/index.html", 'w' )
f.write( content )
f.close()
break
except urllib2.URLError as e:
attempts += 1
print type(e)
For Windows and Python 3.x, my two cents contribution about renaming the file on download :
Install wget module : pip install wget
Use wget :
import wget
wget.download('Url', 'C:\\PathToMyDownloadFolder\\NewFileName.extension')
Truely working command line example :
python -c "import wget; wget.download(""https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz"", ""C:\\Users\\TestName.TestExtension"")"
Note : 'C:\\PathToMyDownloadFolder\\NewFileName.extension' is not mandatory. By default, the file is not renamed, and the download folder is your local path.
Here's the code adopted from the torchvision library:
import urllib
def download_url(url, root, filename=None):
"""Download a file from a url and place it in root.
Args:
url (str): URL to download file from
root (str): Directory to place downloaded file in
filename (str, optional): Name to save the file under. If None, use the basename of the URL
"""
root = os.path.expanduser(root)
if not filename:
filename = os.path.basename(url)
fpath = os.path.join(root, filename)
os.makedirs(root, exist_ok=True)
try:
print('Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(url, fpath)
except (urllib.error.URLError, IOError) as e:
if url[:5] == 'https':
url = url.replace('https:', 'http:')
print('Failed download. Trying https -> http instead.'
' Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(url, fpath)
If you are ok to take dependency on torchvision library then you also also simply do:
from torchvision.datasets.utils import download_url
download_url('http://something.com/file.zip', '~/my_folder`)
Let me Improve a example with threads in case you want download many files.
import math
import random
import threading
import requests
from clint.textui import progress
# You must define a proxy list
# I suggests https://free-proxy-list.net/
proxies = {
0: {'http': 'http://34.208.47.183:80'},
1: {'http': 'http://40.69.191.149:3128'},
2: {'http': 'http://104.154.205.214:1080'},
3: {'http': 'http://52.11.190.64:3128'}
}
# you must define the list for files do you want download
videos = [
"https://i.stack.imgur.com/g2BHi.jpg",
"https://i.stack.imgur.com/NURaP.jpg"
]
downloaderses = list()
def downloaders(video, selected_proxy):
print("Downloading file named {} by proxy {}...".format(video, selected_proxy))
r = requests.get(video, stream=True, proxies=selected_proxy)
nombre_video = video.split("/")[3]
with open(nombre_video, 'wb') as f:
total_length = int(r.headers.get('content-length'))
for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
if chunk:
f.write(chunk)
f.flush()
for video in videos:
selected_proxy = proxies[math.floor(random.random() * len(proxies))]
t = threading.Thread(target=downloaders, args=(video, selected_proxy))
downloaderses.append(t)
for _downloaders in downloaderses:
_downloaders.start()
easy as py:
class Downloder():
def download_manager(self, url, destination='Files/DownloderApp/', try_number="10", time_out="60"):
#threading.Thread(target=self._wget_dl, args=(url, destination, try_number, time_out, log_file)).start()
if self._wget_dl(url, destination, try_number, time_out, log_file) == 0:
return True
else:
return False
def _wget_dl(self,url, destination, try_number, time_out):
import subprocess
command=["wget", "-c", "-P", destination, "-t", try_number, "-T", time_out , url]
try:
download_state=subprocess.call(command)
except Exception as e:
print(e)
#if download_state==0 => successfull download
return download_state
TensorFlow makes life easier. file path gives us the location of downloaded file.
import tensorflow as tf
tf.keras.utils.get_file(origin='https://storage.googleapis.com/tf-datasets/titanic/train.csv',
fname='train.csv',
untar=False, extract=False)

Categories