I am trying to download data files from a website using urllib.
My code is
import urllib
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
urllib.urlretrieve(url,'A'+site+'_'+parameter+'.zip')
except ValueError:
break
My issue is some sites do not have all the parameter files. For eg, with my code, site 1 doesn't have flowcday but python still creates the zip file with nothing in content. How can I stop python to create these files if there's no data?
Many thanks,
I think maybe urllib2.urlopen is more suitable in this situation.
import urllib2
from urllib2 import URLError
url_common = 'http://apps.waterconnect.sa.gov.au/SiteInfo/Data/Site_Data/'
site_list=['4260514','4260512','4260519']
parameter_list=['ecrec','ecday','flowrec','flowcday']
for site in site_list:
for parameter in parameter_list:
try:
url = url_common+'A'+site+'/'+'a'+site+'_'+parameter+'.zip'
name ='A'+site+'_'+parameter+'.zip'
req = urllib2.urlopen(url)
with open(name,'wb') as fh:
fh.write(req.read())
except URLError,e:
if e.code==404:
print name + ' not found. moving on...'
pass
Related
I am trying to write a function that takes a url and a path and downloads a file to that path IF it's a text file.
import urllib
import re
import os
mcBethURL = 'https://ia802707.us.archive.org/1/items/macbeth02264gut/0ws3410.txt'
def download_file(url, path, local_filename):
try:
url_type = urllib.request.urlopen(url).info()['content-type']
if bool(re.search('t[e]*xt', url_type)):
local_filename = url.split('/')[-1]
location = os.path.join("/{}/{}".format(path, local_filename))
urllib.request.urlretrieve(url, path, filename=local_filename)
else:
print('No text file found at given URL, download aborted!')
# some more exceptions here yet not relevant
except:
print('invalid url')
download_file(mcBethURL, '/home/wilma/PycharmProjects/Uni', 'mcBeth')
urllib.request.urlretrieve(url, path, filename=local_filename) doesn't work since it prints invalid url yet urllib.request.urlretrieve(url, filename=local_filename) works yet I can not specify a path. I inserted the path parameter looking at How to download to a specific directory?
Do have an idea why I can not urlretrieve specifying a path variable and a name for the file in which the download should be saved in?
So looking at this What command to use instead of urllib.request.urlretrieve? it looks like urllib.request.urlretrieve is on the outs and you might consider using shutil.copyfileobj or requests.get. From looking at the docs. This example seems relevant for the legacy interface you are using.
import urllib.request
local_filename, headers = urllib.request.urlretrieve('http://python.org/')
html = open(local_filename)
html.close()
In the docs urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None) does not have a second positional argument so it is being ignored in your code.
Basically, my goal is to fetch the filename, extension and the content of an image by its url. And my fuction should work for both of these urls:
easy case:
https://image.shutterstock.com/image-photo/bright-spring-view-cameo-island-260nw-1048185397.jpg
hard case (does not end with filename.extension ):
https://images.unsplash.com/photo-1472214103451-9374bd1c798e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
Currently, what I have looks like this:
from os.path import splitext, basename
def get_filename_from_url(url):
result = urllib.request.urlretrieve(url)
filename, file_ext = splitext(basename(result.path))
print(filename, file_ext)
This works fine for the easy case. But apparently, no solution in case of hard-case url. But I have a feeling that that I can use python's requests module and parse the header to find the mimetype and then use the same module's guesstype functionality to extract the necessary data. So I went on to try this:
import requests
response = requests.get(url, stream=True)
Here, someone seems to describe the clue, saying that
but the problem is that using the hard-case url I get something strange in the response dict items, and maybe my key issue is that I don't know the correct way to parse the header of the response to extract what I need.
I've tried a third approach using urlparse:
from urllib.parse import urlparse
result = urlparse(self.url)
print(os.path.basename(a.path)) # 'photo-1472214103451-9374bd1c798e'
which yields the filename, but again, I miss the extension here...
The ideal solution would be to get the filename, file extension and file content in one go, preferrably being able to validate that the url actually contains an image, not something else...
UPD:
The result1 elemet in result = urllib.request.urlretrieve(self.url) seems to contain the Content-Type, by I can't figure out how to extract it correctly.
One way is to query the content type:
>>> from urllib.request import urlopen
>>> response = urlopen(url)
>>> response.info().get_content_type()
'image/jpeg'
or using urlretrieve as in your edit:
>>> response = urllib.request.urlretrieve(url)
>>> response[1].get_content_type()
I am building a Rust program in which the user types in a command, and then the program reads the command and responds accordingly. One of these commands is to download a file from a set site.
I have a .py file with the following code that I made a while ago that downloads files from a set site:
import urllib
import urllib2
import requests
url = 'http://www.blog.pythonlibrary.org/wpcontent/uploads/2012/06/wxDbViewer.zip'
print "downloading with urllib"
urllib.urlretrieve(url, "code.zip")
print "downloading with urllib2"
f = urllib2.urlopen(url)
data = f.read()
with open("code2.zip", "wb") as code:
code.write(data)
print "downloading with requests"
r = requests.get(url)
with open("code3.zip", "wb") as code:
code.write(r.content)
The URLs in the code are not ones that I will be using; they are examples.
If the Rust program sets the site it needs to go to as a variable, is there a way that I could send the variable to my Python program? I know you can send Python to Rust:
Passing a list of strings from Python to Rust
http://www.joesacher.com/blog/2017/08/24/ptr-types/
Is there a way to do this in the other direction?
So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax
To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
main()
You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.
Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
try:
download_task()
except urllib2.HTTPError:
clean_environment_for_retry()
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.
I have made this simple download manager, but the problem is it wont work on complex urls, when pages are redirected.
def str(d):
for i in range(len(d)):
if d[-i] == '/':
x=-i
break
s=[]
l=len(d)+x+1
print d[l],d[len(d)-1]
s=d[l:]
return s
import urllib2
url=raw_input()
filename=str(url)
webfile = urllib2.urlopen(url)
data = webfile.read()
fout =open(filename,"w")
fout.write(data)
fout.close()
webfile.close()
it wouldn't work for http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CG0QFjAI&url=http%3A%2F%2Fwww.iasted.org%2Fconferences%2Fformatting%2FPresentations-Tips.ppt&ei=clfWTpjZEIblrAfC8qWXDg&usg=AFQjCNEIgqx6x4ULHFXzzYDzCITuUJOczA&sig2=0VtKXPvoDnIq-lIR4S9LEQ
while it would work for http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt
and both links are for the same file.
How to solve the problem of redirection?
I think redirection is not a problem here:
Since urllib2 already follows redirect automatically, google redirects to a page in case of error.
Try this script :
url1 = 'http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&ved=0CG0QFjAI&url=http%3A%2F%2Fwww.iasted.org%2Fconferences%2Fformatting%2FPresentations-Tips.ppt&ei=clfWTpjZEIblrAfC8qWXDg&usg=AFQjCNEIgqx6x4ULHFXzzYDzCITuUJOczA&sig2=0VtKXPvoDnIq-lIR4S9LEQ'
url2 = 'http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt'
from urlparse import urlsplit
from urllib2 import urlopen
for url in [url1, url2]:
split = urlsplit(url)
filename = split.path[split.path.rfind('/')+1:]
if not filename:
filename = split.query[split.query.rfind('/')+1:]
f = open(filename, 'w')
f.write(urlopen(url).read())
f.close()
# Yields 2 files : url and Presentations-Tips.ppt [Both are ppt files]
The above script works every time.
In general, you handle redirection by using urllib2.HTTPRedirectHandler, like this:
import urllib2
opener = urllib.build_opener(urllib2.HTTPRedirectHandler)
res = open.open('http://example.com/some/url/')
However, it doesn't like like this will work for the Google URL you've given in your example, because rather than including a Location header in the response, the Google result looks like this:
<script>window.googleJavaScriptRedirect=1</script><script>var a=parent,b=parent.google,c=location;if(a!=window&&b){if(b.r){b.r=0;a.location.href="http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt";c.replace("about:blank");}}else{c.replace("http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt");};</script><noscript><META http-equiv="refresh" content="0;URL='http://www.iasted.org/conferences/formatting/Presentations-Tips.ppt'"></noscript>
...which is to say, it uses a Javascript redirect, which substantially complicates your life. You could use Python's re module to extract the correct location from this block.