Python: Download multiple .gz files from single URL - python

I am having trouble downloading multiple network files from an online directory. I am using a virtual Linux environment (Lubuntu) over VMware.
My aim is to access a subfolder and download all the .gz files it contains into a new local directory that is different from the home directory. I tried multiple solutions and this is the closest I got.
import os
from urllib2 import urlopen, URLError, HTTPError
def dlfile(url):
# Open the url
try:
f = urlopen(url)
print "downloading " + url
# Open our local file for writing
with open(os.path.basename(url), "wb") as local_file:
local_file.write(f.read())
#handle errors
except HTTPError, e:
print "HTTP Error:", e.code, url
except URLError, e:
print "URL Error:", e.reason, url
def main():
# Iterate over image ranges
for index in range(100, 250,5):
url = ("http://data.ris.ripe.net/rrc00/2016.01/updates20160128.0%d.gz"
%(index))
dlfile(url)
if __name__ == '__main__':
main()
The online directory needs no authentication, a link can be found here.
I tried string manipulation and using a loop over the filenames, but it gave me the following error:
HTTP Error: 404 http://data.ris.ripe.net/rrc00/2016.01/updates20160128.0245.gz

Look at the url
Good url: http://data.ris.ripe.net/rrc00/2016.01/updates.20160128.0245.gz
Bad url (your code): http://data.ris.ripe.net/rrc00/2016.01/updates20160128.0245.gz
A dot between updates and 2016 is missing

Related

Download File That Uses Download Attribute With Urllib

I am trying to parse a webpage and download a series of csv files that are in zip folders. When I click on the link on the website, I can download it without any trouble. However whenever I paste the URL into my browser (ie: example.com/file.zip), I get a 400 Bad Request error. I am not sure but I have deduced this issue is caused because the link uses a download attribute
The problem now is when I use urllib.request.urlretrieve to download the zip files, I can not. My code is fairly simple:
# Look to a specific folder in my computer
# Compare the zip files in that folder to the zip files on the website
# What ever is on the website, but not on my local machine
# is added to a dictionary called remoteFiles
for remoteFile in remoteFiles:
try:
filename = ntpath.basename(remoteFile)
urllib.request.urlretrieve(remoteFile, filename)
print('finished downloading: ' + filename)
except Exception as e:
print('error with file: ' + filename)
print(e)
Here is a PasteBin link to my full .py file. Wherever I run it I get an error:
HTTP Error 400: Bad Request

Download file matching filename with largest number

I want to download in python a file from an index directory that matches a pattern "foo_14_bar.dat" but where I download only the one with the largest number in the place of 14.
For example, if there were several files in the directory:
"asdf_234_asdf.dat"
"foo_21_bar.dat"
"foo_9_bar.dat"
I would download the second file.
How can I accomplish this? It seems like beautiful soup with urllib2 would be useful, but I can't quite think of how to implement it.
Edit: Should I first get a list of text that matches the pattern, and then find the largest number, and then download it? Or is there a better/cleaner way?
Going to use requests and beautifulsoup4 for this.
Install them like so:
sudo pip install requests
sudo pip install beautifulsoup4
Here is the script that we will use:
#!/usr/bin/env python
import sys
import requests
from bs4 import BeautifulSoup
# URL of your Index directory
url = 'https://example.com/open_dir/'
# Delimiter to parse files
delimiter = '_'
# Issue a GET request to the URL, exit if an error occurs
try:
r = requests.get(url)
except Exception as e:
print '[-] Error: %s' % str(e)
sys.exit(1)
# Take the HTML response and put it in BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
# Create an empty files list
files = []
# Parse the HTML, make sure you read the BeautifulSoup docs
for table in soup.findAll('table'):
for tr in table.findAll('tr'):
for td in table.findAll('td'):
for a in td.findAll('a'):
if a['href'] not in files:
if not a['href'].endswith('/'):
files.append(a['href'])
# Make sure files list is not empty
if not files:
print '[-] Error: no files found'
sys.exit(1)
# You will see what we do with these
filename = ''
largest = 0
for f in files:
try:
fParts = f.split(delimiter)
number = fParts[1]
if number > largest:
filename = f
largest = number
except Exception as e:
print '[-] Error, file probably doesn\'t match pattern: %s' % str(e)
print '[+] Largest file: %s' % filename
Tested and working.

download files from the internet using urllib issue

I am trying to download data files from a website using urllib.
My code is
import urllib
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
urllib.urlretrieve(url,'A'+site+'_'+parameter+'.zip')
except ValueError:
break
My issue is some sites do not have all the parameter files. For eg, with my code, site 1 doesn't have flowcday but python still creates the zip file with nothing in content. How can I stop python to create these files if there's no data?
Many thanks,
I think maybe urllib2.urlopen is more suitable in this situation.
import urllib2
from urllib2 import URLError
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
name ='A'+site+'_'+parameter+'.zip'
req = urllib2.urlopen(url)
with open(name,'wb') as fh:
fh.write(req.read())
except URLError,e:
if e.code==404:
print name + ' not found. moving on...'
pass

Multi-threading for downloading NCBI files in Python

So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax
To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
main()
You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.
Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
try:
download_task()
except urllib2.HTTPError:
clean_environment_for_retry()
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.

How do I download a zip file in python using urllib2?

Two part question. I am trying to download multiple archived Cory Doctorow podcasts from the internet archive. The old one's that do not come into my iTunes feed. I have written the script but the downloaded files are not properly formatted.
Q1 - What do I change to download the zip mp3 files?
Q2 - What is a better way to pass the variables into URL?
# and the base url.
def dlfile(file_name,file_mode,base_url):
from urllib2 import Request, urlopen, URLError, HTTPError
#create the url and the request
url = base_url + file_name + mid_url + file_name + end_url
req = Request(url)
# Open the url
try:
f = urlopen(req)
print "downloading " + url
# Open our local file for writing
local_file = open(file_name, "wb" + file_mode)
#Write to our local file
local_file.write(f.read())
local_file.close()
#handle errors
except HTTPError, e:
print "HTTP Error:",e.code , url
except URLError, e:
print "URL Error:",e.reason , url
# Set the range
var_range = range(150,153)
# Iterate over image ranges
for index in var_range:
base_url = 'http://www.archive.org/download/Cory_Doctorow_Podcast_'
mid_url = '/Cory_Doctorow_Podcast_'
end_url = '_64kb_mp3.zip'
#create file name based on known pattern
file_name = str(index)
dlfile(file_name,"wb",base_url
This script was adapted from here
Here's how I'd deal with the url building and downloading. I'm making sure to name the file as the basename of the url (the last bit after the trailing slash) and I'm also using the with clause for opening the file to write to. This uses a ContextManager which is nice because it will close that file when the block exits. In addition, I use a template to build the string for the url. urlopen doesn't need a request object, just a string.
import os
from urllib2 import urlopen, URLError, HTTPError
def dlfile(url):
# Open the url
try:
f = urlopen(url)
print "downloading " + url
# Open our local file for writing
with open(os.path.basename(url), "wb") as local_file:
local_file.write(f.read())
#handle errors
except HTTPError, e:
print "HTTP Error:", e.code, url
except URLError, e:
print "URL Error:", e.reason, url
def main():
# Iterate over image ranges
for index in range(150, 151):
url = ("http://www.archive.org/download/"
"Cory_Doctorow_Podcast_%d/"
"Cory_Doctorow_Podcast_%d_64kb_mp3.zip" %
(index, index))
dlfile(url)
if __name__ == '__main__':
main()
An older solution on SO along the lines of what you want:
download a zip file to a local drive and extract all files to a destination folder using python 2.5
Python and urllib

Categories