Multi-threading for downloading NCBI files in Python - python

So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax

To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
main()

You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.

Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
try:
download_task()
except urllib2.HTTPError:
clean_environment_for_retry()
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.

Related

Python -> Get all valid media download url's from a webfolder

I have a website here who has a link structure like this
https://example.com/assets/contents/1627347928.mp4
https://example.com/assets/contents/1627342345.mp4
https://example.com/assets/contents/1627215324.mp4
And I want to use python to get all links to download, when I access the folder /assets/contents/ i get a 404 error, so I can't see all the media to download from this web folder, but I know all the MP4 files has 10 CHARACTERS and all of them start with "1627******.mp4"
Can I do a LOOP to check all the links from that website and get all VALID links? Thanks!!!!!!!!!!!! I am newbie on python right now!
I could check if have media mp4/media with that code i can see the headers of a file, but how to make a loop to check all the links and download automatically? Or just show me the valid links? Thanks!!
import requests
link = 'https://example.com/assets/contents/1627347923.mp4'
r = requests.get(link, stream=True)
print(r.headers)
Print if file exists or not
import requests
names = [ 1627347923, 1627347924, 1627347925]
base = 'https://example.com/assets/contents/{}.mp4'
for item in names:
link = base.format(item)
print(link)
r = requests.head(link, allow_redirects=True)
if r.status_code == 200:
print("found {}.mp4".format(item))
#open('{}.mp4'.format(item), 'wb').write(r.content)
else:
print("File no found or error getting headers")
Or try to download it
import requests
names = [ 1627347923, 1627347924, 1627347925]
base = 'https://example.com/assets/contents/{}.mp4'
for item in names:
link = base.format(item)
print(link)
# uncomment below to download
#r = requests.get(link, allow_redirects=True)
#open('{}.mp4'.format(item), 'wb').write(r.content)
yes you can run a loop, check the status code or if requests.get() throws an error you get back and as such get all the files, but there are some problems which might stop you from choosing that
Your files are in the format of "1627******.mp4", which means a for loop would check for 10^6 entries, if all the * are numbers, which is not efficient. If you are planning to include characters and special characters, it would be highly inefficient.
What if in the future you have more than 10^6 files? Your format will have to change and so your code will have to change.
A much more simple, straight forward and efficient solution would be to have a place to store your data, a file or better a database, where you can just query and get all your files. You can just run your query to get the necessary details.
Also, a 404 error means the page you are trying to reach is not found, in your case, it essentially means it doesn't exist.
A sample code a/c to check if link exists
files = []
links = ["https://www.youtube.com/","https://docs.python.org","https://qewrt.org"]
for i in links:
try:
requests.get(i) // If link doesnt exists, it throws an error, else the link is appended to the files list
files.append(i)
except:
print(i+" doesnt exist")
print(files)
Building on this, based on your condition, checking for all files if they exist in the given format:
import requests
file_prefix = 'https://example.com/assets/contents/1627'
file_lists = []
for i in range(10**6):
suffix = (6-len(str(i)))*"0"+str(i)+".mp4"
file_name = file_prefix+suffix
try:
requests.get(file_name)
file_lists.append(file_name)
except:
continue
for i in file_lists:
print(i)
Based on all your codes and LMC codes, I do a thing who test all the MP4 files and show me the "headers", how I can only pick links who has a mp4 valid file like the link
import requests
file_prefix = 'https://example.com/assets/contents/1627'
file_lists = []
for i in range(10**6):
suffix = (6-len(str(i)))*"0"+str(i)+".mp4"
file_name = file_prefix+suffix
try:
requests.get(file_name)
file_lists.append(file_name)
r = requests.get(file_name, stream=True)
print(file_name)
print(r.headers)
except:
continue
for i in file_lists:
print(i)

is there a way to download multiple files using the requests module

I want to download multiple .hdr files from a website called hdrihaven.com.
My knowledge of python is not that great but here is what I have tried so far:
import requests
url = 'https://hdrihaven.com/files/hdris/'
resolution = '4k'
file = 'pump_station' #would need to be every file
url_2k = url + file + '_' + resolution + '.hdr'
print(url_2k)
r = requests.get(url_2k, allow_redirects=True)
open(file + resolution + '.hdr', 'wb').write(r.content)
Idealy file would just loop over every file in the directory.
Thanks for your answers in advance!
EDIT
I found a script on github that does what I need: https://github.com/Alzy/hdrihaven_dl. I edited it to fit my needs here: https://github.com/ktkk/hdrihaven-downloader. It uses the technique of looping through a list of all available files as proposed in the comments.
I have found that the requests module as well as urllib are extremly slow compared to native downloading from eg. Chrome. If anyone has an idea as to how I can speed these up pls let me know.
There are 2 ways you can do this:
You can use an URLto fetch all the files and iterate through a loop to download them individually. This of course only works if there exists such a URL.
You can pass in individual URL to a function that can download them in parallel/bulk.
For example:
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool
def url_response(url):
path, url = url
r = requests.get(url, stream = True)
with open(path, 'wb') as f:
for ch in r:
f.write(ch)
urls = [("Event1", "https://www.python.org/events/python-events/805/"),("Event2", "https://www.python.org/events/python-events/801/"),
("Event3", "https://www.python.org/events/python-user-group/816/")]
start = time()
for x in urls:
url_response (x)
print(f"Time to download: {time() - start}")
This code snippet is taken from here Download multiple files (Parallel/bulk download). Read on there for more information on how you can do this.

Best way to create a download link for a file in Flask?

In my project, when a user clicks a link, an AJAX request sends the information required to create a CSV. The CSV takes a long time to generate and so I want to be able to include a download link for the generated CSV in the AJAX response. Is this possible?
Most of the answers I've seen return the CSV in the following way:
return Response(
csv,
mimetype="text/csv",
headers={"Content-disposition":
"attachment; filename=myplot.csv"})
However, I don't think this is compatible with the AJAX response I'm sending with:
return render_json(200, {'data': params})
Ideally, I'd like to be able to send the download link in the params dict. But I'm also not sure if this is secure. How is this problem typically solved?
I think one solution may the futures library (pip install futures). The first endpoint can queue up the task and then send the file name back, and then another endpoint can be used to retrieve the file. I also included gzip because it might be a good idea if you are sending larger files. I think more robust solutions use Celery or Rabbit MQ or something along those lines. However, this is a simple solution that should accomplish what you are asking for.
from flask import Flask, jsonify, Response
from uuid import uuid4
from concurrent.futures import ThreadPoolExecutor
import time
import os
import gzip
app = Flask(__name__)
# Global variables used by the thread executor, and the thread executor itself
NUM_THREADS = 5
EXECUTOR = ThreadPoolExecutor(NUM_THREADS)
OUTPUT_DIR = os.path.dirname(os.path.abspath(__file__))
# this is your long running processing function
# takes in your arguments from the /queue-task endpoint
def a_long_running_task(*args):
time_to_wait, output_file_name = int(args[0][0]), args[0][1]
output_string = 'sleeping for {0} seconds. File: {1}'.format(time_to_wait, output_file_name)
print(output_string)
time.sleep(time_to_wait)
filename = os.path.join(OUTPUT_DIR, output_file_name)
# here we are writing to a gzipped file to save space and decrease size of file to be sent on network
with gzip.open(filename, 'wb') as f:
f.write(output_string)
print('finished writing {0} after {1} seconds'.format(output_file_name, time_to_wait))
# This is a route that starts the task and then gives them the file name for reference
#app.route('/queue-task/<wait>')
def queue_task(wait):
output_file_name = str(uuid4()) + '.csv'
EXECUTOR.submit(a_long_running_task, [wait, output_file_name])
return jsonify({'filename': output_file_name})
# this takes the file name and returns if exists, otherwise notifies it is not yet done
#app.route('/getfile/<name>')
def get_output_file(name):
file_name = os.path.join(OUTPUT_DIR, name)
if not os.path.isfile(file_name):
return jsonify({"message": "still processing"})
# read without gzip.open to keep it compressed
with open(file_name, 'rb') as f:
resp = Response(f.read())
# set headers to tell encoding and to send as an attachment
resp.headers["Content-Encoding"] = 'gzip'
resp.headers["Content-Disposition"] = "attachment; filename={0}".format(name)
resp.headers["Content-type"] = "text/csv"
return resp
if __name__ == '__main__':
app.run()

download files from the internet using urllib issue

I am trying to download data files from a website using urllib.
My code is
import urllib
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
urllib.urlretrieve(url,'A'+site+'_'+parameter+'.zip')
except ValueError:
break
My issue is some sites do not have all the parameter files. For eg, with my code, site 1 doesn't have flowcday but python still creates the zip file with nothing in content. How can I stop python to create these files if there's no data?
Many thanks,
I think maybe urllib2.urlopen is more suitable in this situation.
import urllib2
from urllib2 import URLError
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
name ='A'+site+'_'+parameter+'.zip'
req = urllib2.urlopen(url)
with open(name,'wb') as fh:
fh.write(req.read())
except URLError,e:
if e.code==404:
print name + ' not found. moving on...'
pass

Access a local file, but ensure it is up-to-date

How can I use the Python standard library to get a file object, silently ensuring it's up-to-date from some other location?
A program I'm working on needs to access a set of files locally; they're
just normal files.
But those files are local cached copies of documents available at remote
URLs — each file has a canonical URL for that file's content.
(I write here about HTTP URLs, but I'm looking for a solution that isn't specific to any particular remote fetching protocol.)
I'd like an API for ‘get_file_from_cache’ that looks something like:
file_urls = {
"/path/to/foo.txt": "http://example.org/spam/",
"other/path/bar.data": "https://example.net/beans/flonk.xml",
}
for (filename, url) in file_urls.items():
infile = get_file_from_cache(filename, canonical=url)
do_stuff_with(infile.read())
If the local file's modification timestamp is not significantly
earlier than the Last-Modified timestamp for the document at the
corresponding URL, get_file_from_cache just returns the file object
without changing the file.
The local file might be out of date (its modification timestamp may be
significantly older than the Last-Modified timestamp from the
corresponding URL). In that case, get_file_from_cache should first
read the document's contents into the file, then return the file
object.
The local file may not yet exist. In that case, get_file_from_cache
should first read the document content from the corresponding URL,
create the local file, and then return the file object.
The remote URL may not be available for some reason. In that case,
get_file_from_cache should simply return the file object, or if that
can't be done, raise an error.
So this is something similar to an HTTP object cache. Except where those
are usually URL-focussed with the local files a hidden implementation
detail, I want an API that focusses on the local files, with the remote
requests a hidden implementation detail.
Does anything like this exist in the Python library, or as simple code
using it? With or without the specifics of HTTP and URLs, is there some
generic caching recipe already implemented with the standard library?
This local file cache (ignoring the spcifics of URLs and network access)
seems like exactly the kind of thing that is easy to get wrong in
countless ways, and so should have a single obvious implementation
available.
Am I in luck? What do you advise?
From a quick Googling I couldn't find an existing library that can do, although I'd be surprised if there weren't such a thing. :)
Anyway, here's one way to do it using the popular Requests module. It'd be pretty easy to adapt this code to use urllib / urlib2, though.
#! /usr/bin/env python
''' Download a file if it doesn't yet exist in offline cache, or if the online
version is more than age seconds newer than the cached version.
Example code for
http://stackoverflow.com/questions/26436641/access-a-local-file-but-ensure-it-is-up-to-date
Written by PM 2Ring 2014.10.18
'''
import sys
import os
import email.utils
import requests
cache_path = 'offline_cache'
#Translate local file names in cache_path to URLs
file_urls = {
'example1.html': 'http://www.example.com/',
'badfile': 'http://httpbin.org/status/404',
'example2.html': 'http://www.example.org/index.html',
}
def get_headers(url):
resp = requests.head(url)
print "Status: %d" % resp.status_code
resp.raise_for_status()
for k,v in resp.headers.items():
print '%-16s : %s' % (k, v)
def get_url_mtime(url):
''' Get last modified time of an online file from the headers
and convert to a timestamp
'''
resp = requests.head(url)
resp.raise_for_status()
t = email.utils.parsedate_tz(resp.headers['last-modified'])
return email.utils.mktime_tz(t)
def download(url, fname):
''' Download url to fname, setting mtime of file to match url '''
print >>sys.stderr, "Downloading '%s' to '%s'" % (url, fname)
resp = requests.get(url)
#print "Status: %d" % resp.status_code
resp.raise_for_status()
t = email.utils.parsedate_tz(resp.headers['last-modified'])
timestamp = email.utils.mktime_tz(t)
#print 'last-modified', timestamp
with open(fname, 'wb') as f:
f.write(resp.content)
os.utime(fname, (timestamp, timestamp))
def open_cached(basename, mode='r', age=0):
''' Open a cached file.
Download it if it doesn't yet exist in cache, or if the online
version is more than age seconds newer than the cached version.'''
fname = os.path.join(cache_path, basename)
url = file_urls[basename]
#print fname, url
if os.path.exists(fname):
#Check if online version is sufficiently newer than offline version
file_mtime = os.path.getmtime(fname)
url_mtime = get_url_mtime(url)
if url_mtime > age + file_mtime:
download(url, fname)
else:
download(url, fname)
return open(fname, mode)
def main():
for fname in ('example1.html', 'badfile', 'example2.html'):
print fname
try:
with open_cached(fname, 'r') as f:
for i, line in enumerate(f, 1):
print "%3d: %s" % (i, line.rstrip())
except requests.exceptions.HTTPError, e:
print >>sys.stderr, "%s '%s' = '%s'" % (e, file_urls[fname], fname)
print
if __name__ == "__main__":
main()
Of course, for real-world use you should add some proper error checking.
You may notice that I've defined a function get_headers(url) which never gets called; I used it during development & figured it might come in handy when expanding this program, so I left it in. :)

Categories