I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer
Related
I am downloading a lot of files from a website and want them to run parallel because they are heavy. Unfourtanetly I can't really share the website because to access the files I need a username and password which I can't share. The code below is my code, I know it can't really be run without the website and my username and password but I am 99% sure I am not allowed to share that information
import os
import requests
from multiprocessing import Process
dataset="dataset_name"
################################
def down_file(dspath, file, savepath, ret):
webfilename = dspath+file
file_base = os.path.basename(file)
file = join(savepath, file_base)
print('...Downloading',file_base)
req = requests.get(webfilename, cookies = ret.cookies, allow_redirects=True, stream=True)
filesize = int(req.headers['Content-length'])
with open(file, 'wb') as outfile:
chunk_size=1048576
for chunk in req.iter_content(chunk_size=chunk_size):
outfile.write(chunk)
return None
################################
##Download files
def download_files(filelist, c_DateNow):
## Authenticate
url = 'url'
values = {'email' : 'email', 'passwd' : "password", 'action' : 'login'}
ret = requests.post(url, data=values)
## Path to files
dspath = 'datasetwebpath'
savepath = join(path_script, dataset, c_DateNow)
makedirs(savepath, exist_ok = True)
#"""
processes = [Process(target=down_file, args=(dspath, file, savepath, ret)) for file in filelist]
print(["dspath, %s, savepath, ret\n"%(file) for file in filelist])
# kick them off
for process in processes:
print("\n", process)
process.start()
# now wait for them to finish
for process in processes:
process.join()
#"""
####### This works and it's what i want to parallelize
"""
##Download files
for file in filelist:
down_file(dspath, file, savepath, ret)
#"""
################################
def main(c_DateNow, c_DateIni, c_DateFin):
## Other code
files=["list of web file addresses"]
print(" ...Files being downladed\n ", "\n ".join(files), "\n")
## Doanlad files
download_files(files, c_DateNow)
I want to download 25 files. When I run the code all the print lines that have been printed before in the code are being reprinted even though the Process execution is not even near them. I am also getting the following error constantly
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
I googled the error and don't know how to fix it. Does it have to do with there not being enough cores? Is there a way to stop the Process depending on how many cores I have available? Or is it something else entirely?
In a question here, I read that the Process has to be within the __main__ function but this code is a module that gets imported in another code so when I run it I run it as
import this_code
import another1_code
import another2_code
#Step1
another1_code.main()
#Step2
c_DateNow, c_DateIni, c_DateFin = another2_code.main()
#Step3
this_code.main(c_DateNow, c_DateIni, c_DateFin)
#step4
## More code
So I need the process to be within a function and not in __main__
I appreciate any help or suggestions on how to correctly parallelize the above code in a way that allows me to use it as a module in another code.
Good day
I am working on a directory scanner and trying to speed it up as much as possible. I have been looking into using multiprocessing, however I do not believe I am using it correctly.
from multiprocessing import Pool
import requests
import sys
def dir_scanner(wordlist=sys.argv[1],dest_address=sys.argv[2],file_ext=sys.argv[3]):
print(f"Scanning Target: {dest_address} looking for files ending in {file_ext}")
# read a wordlist
dir_file = open(f"{wordlist}").read()
dir_list = dir_file.splitlines()
# empty list for discovered dirs
discovered_dirs = []
# make requests for each potential dir location
for dir_item in dir_list:
req_url = f"http://{dest_address}/{dir_item}.{file_ext}"
req_dir = requests.get(req_url)
print(req_url)
if req_dir.status_code==404:
pass
else:
print("Directroy Discovered ", req_url)
discovered_dirs.append(req_url)
with open("discovered_dirs.txt","w") as f:
for directtories in discovered_dirs:
print(req_url,file=f)
if __name__ == '__main__':
with Pool(processes=4) as pool:
dir_scanner(sys.argv[1],sys.argv[2],sys.argv[3])
Is the above example the correct usage of Pool? Ultimately I am attempting to speed up the requests that are being made to the target.
UPDATE: Perhaps not the most eleigant solution but:
from multiprocessing import Pool
import requests
import sys
# USAGE EXAMPLE: python3 dir_scanner.py <wordlist> <target address> <file extension>
discovered_dirs = []
# read in the wordlist
dir_file = open(f"{sys.argv[1]}").read()
dir_list = dir_file.splitlines()
def make_request(dir_list):
# create a GET request URL base on items in the wordlist
req_url = f"http://{sys.argv[2]}/{dir_list}.{sys.argv[3]}"
return req_url, requests.get(req_url)
# map the requests made by make_requests to speed things up
with Pool(processes=4) as pool:
for req_url, req_dir in pool.map(make_request, dir_list):
# if the request resp is a 404 move on
if req_dir.status_code == 404:
pass
# if not a 404 resp then add it to the list
else:
print("Directroy Discovered ", req_url)
discovered_dirs.append(req_url)
# create a new file and append it with directories that were discovered
with open("discovered_dirs.txt","w") as f:
for directories in discovered_dirs:
print(req_url,file=f)
Right now, you are creating a pool and not using it.
You can use pool.map to distribute the request into multiple process:
...
def make_request(dir_item):
req_url = f"http://{dest_address}/{dir_item}.{file_ext}"
return req_url, requests.get(req_url)
with Pool(processes=4) as pool:
for req_url, req_dir in pool.map(make_request, dir_list):
print(req_url)
if req_dir.status_code == 404:
pass
else:
print("Directroy Discovered ", req_url)
discovered_dirs.append(req_url)
...
In the example above the function make_request is executed in subprocesses.
Python documentation gives a lot of examples.
I have a use case, where a large remote file needs to be downloaded in parts, by using multiple threads.
Each thread must run simultaneously (in parallel), grabbing a specific part of the file.
The expectation is to combine the parts into a single (original) file, once all parts were successfully downloaded.
Perhaps using the requests library could do the job, but then I am not sure how I would multithread this into a solution that combines the chunks together.
url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"} # first megabyte
r = get(url, headers=headers)
I was also thinking of using curl where Python would orchestrate the downloads, but I am not sure that's the correct way to go. It just seems to be too complex and swaying away from the vanilla Python solution. Something like this:
curl --range 200000000-399999999 -o file.iso.part2
Can someone explain how you'd go about something like this? Or post a code example of something that works in Python 3? I usually find the Python-related answers quite easily, but the solution to this problem seems to be eluding me.
Here is a version using Python 3 with Asyncio, it's just an example, it can be improved, but you should be able to get everything you need.
get_size: Send an HEAD request to get the size of the file
download_range: Download a single chunk
download: Download all the chunks and merge them
import asyncio
import concurrent.futures
import functools
import requests
import os
# WARNING:
# Here I'm pointing to a publicly available sample video.
# If you are planning on running this code, make sure the
# video is still available as it might change location or get deleted.
# If necessary, replace it with a URL you know is working.
URL = 'https://download.samplelib.com/mp4/sample-30s.mp4'
OUTPUT = 'video.mp4'
async def get_size(url):
response = requests.head(url)
size = int(response.headers['Content-Length'])
return size
def download_range(url, start, end, output):
headers = {'Range': f'bytes={start}-{end}'}
response = requests.get(url, headers=headers)
with open(output, 'wb') as f:
for part in response.iter_content(1024):
f.write(part)
async def download(run, loop, url, output, chunk_size=1000000):
file_size = await get_size(url)
chunks = range(0, file_size, chunk_size)
tasks = [
run(
download_range,
url,
start,
start + chunk_size - 1,
f'{output}.part{i}',
)
for i, start in enumerate(chunks)
]
await asyncio.wait(tasks)
with open(output, 'wb') as o:
for i in range(len(chunks)):
chunk_path = f'{output}.part{i}'
with open(chunk_path, 'rb') as s:
o.write(s.read())
os.remove(chunk_path)
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
loop = asyncio.new_event_loop()
run = functools.partial(loop.run_in_executor, executor)
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
download(run, loop, URL, OUTPUT)
)
finally:
loop.close()
The best way i found is to use a module called pySmartDL.
step 1: pip install pySmartDL
step 2: for downloading the file you could use
from pySmartDL import SmartDL
obj = SmartDL(url, destination)
obj.start()
Note: This gives you a download meter by default.
In case you need to hook the download progress to a gui you could use
obj = SmartDL(url, dest,progress_bar=False)
obj.start(blocking=False)
while not obj.isFinished():
download_precentage = round(obj.get_progress()*100,2)
time.sleep(0.2)
print(download_precentage)
if you want to use more threads you can use
obj = SmartDL(url, destination,threads=7) #by default thread = 5
obj.start()
you can find many more features from the project page
Downloads: http://pypi.python.org/pypi/pySmartDL/
Documentation: http://itaybb.github.io/pySmartDL/
Project page: https://github.com/iTaybb/pySmartDL/
Bugs and Issues: https://github.com/iTaybb/pySmartDL/issues
You could use grequests to download in parallel.
import grequests
URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600 # 100 MB
HEADERS = []
_start, _stop = 0, 0
for x in range(4): # file size is > 300MB, so we download in 4 parts.
_start = _stop
_stop = 104857600 * (x + 1)
HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})
rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)
with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
for download in downloads:
print(download.status_code)
f.write(download.content)
PS: I did not check if the Ranges are correctly determinded and if the downloaded md5sum matches! This should just show in general how it could work.
You can also you use ThreadPoolExecutor (or ProcessPoolExecutor) from concurrent.futures instead of using asyncio. The following shows how to modify bug's answer by using ThreadPoolExecutor:
Bonus: The following snippet also uses tqdm to show a progress bar of the download. If you don't want to use tqdm then just comment out the block below with tqdm(total=file_size . . .. More information on tqdm is here which can be installed with pip install tqdm. Btw, tqdm can also be used with asyncio.
import requests
import concurrent.futures
from concurrent.futures import as_completed
from tqdm import tqdm
import os
def download_part(url_and_headers_and_partfile):
url, headers, partfile = url_and_headers_and_partfile
response = requests.get(url, headers=headers)
# setting same as below in the main block, but not necessary:
chunk_size = 1024*1024
# Need size to make tqdm work.
size=0
with open(partfile, 'wb') as f:
for chunk in response.iter_content(chunk_size):
if chunk:
size+=f.write(chunk)
return size
def make_headers(start, chunk_size):
end = start + chunk_size - 1
return {'Range': f'bytes={start}-{end}'}
url = 'https://download.samplelib.com/mp4/sample-30s.mp4'
file_name = 'video.mp4'
response = requests.get(url, stream=True)
file_size = int(response.headers.get('content-length', 0))
chunk_size = 1024*1024
chunks = range(0, file_size, chunk_size)
my_iter = [[url, make_headers(chunk, chunk_size), f'{file_name}.part{i}'] for i, chunk in enumerate(chunks)]
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
jobs = [executor.submit(download_part, i) for i in my_iter]
with tqdm(total=file_size, unit='iB', unit_scale=True, unit_divisor=chunk_size, leave=True, colour='cyan') as bar:
for job in as_completed(jobs):
size = job.result()
bar.update(size)
with open(file_name, 'wb') as outfile:
for i in range(len(chunks)):
chunk_path = f'{file_name}.part{i}'
with open(chunk_path, 'rb') as s:
outfile.write(s.read())
os.remove(chunk_path)
I am retrieving data files from a FTP server in a loop with the following code:
response = urllib.request.urlopen(url)
data = response.read()
response.close()
compressed_file = io.BytesIO(data)
gin = gzip.GzipFile(fileobj=compressed_file)
Retrieving and processing the first few works fine, but after a few request I am getting the following error:
530 Maximum number of connections exceeded.
I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?
Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close.
ftplib is more appropriate I think.
Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser).
I think it is also clear enough to serve as a general answer.
import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser
ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()
# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)
with ftplib.FTP(host=ftp_host) as ftpconn:
ftpconn.login()
for year in YEARS:
ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
print(ftp_file)
# read the whole file and save it to a BytesIO (stream)
response = io.BytesIO()
try:
ftpconn.retrbinary('RETR '+ftp_file, response.write)
except ftplib.error_perm as err:
if str(err).startswith('550 '):
print('ERROR:', err)
else:
raise
# decompress and parse each line
response.seek(0) # jump back to the beginning of the stream
with gzip.open(response, mode='rb') as gzstream:
for line in gzstream:
parser.loads(line.decode('latin-1'))
This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.
Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py
from subprocess import call
with open('log.txt', 'a') as f:
call(['python', 'test.py', args[0], args[1]], stdout=f)
Recently I am working on a tiny crawler for downloading images on a url.
I use openurl() in urllib2 with f.open()/f.write():
Here is the code snippet:
# the list for the images' urls
imglist = re.findall(regImg,pageHtml)
# iterate to download images
for index in xrange(1,len(imglist)+1):
img = urllib2.urlopen(imglist[index-1])
f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
print('To Read...')
# potential timeout, may block for a long time
# so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
f.write(img.read())
f.close()
print('Image %d is ready !' % index)
In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue.
I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better.
Any suggestions? Thanks in advance.
p.s. I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. But I found file object has no timeout version of read(). Any suggestions on this ? Thanks very much .
The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.)
This snippet is taken from one of my projects. I use a thread pool to download multiple files at once. It uses urllib.urlretrieve but the logic is the same. The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem.
def download_urls(url_and_path_list, num_concurrent, skip_existing):
# prepare the queue
queue = Queue.Queue()
for url_and_path in url_and_path_list:
queue.put(url_and_path)
# start the requested number of download threads to download the files
threads = []
for _ in range(num_concurrent):
t = DownloadThread(queue, skip_existing)
t.daemon = True
t.start()
queue.join()
class DownloadThread(threading.Thread):
def __init__(self, queue, skip_existing):
super(DownloadThread, self).__init__()
self.queue = queue
self.skip_existing = skip_existing
def run(self):
while True:
#grabs url from queue
url, path = self.queue.get()
if self.skip_existing and exists(path):
# skip if requested
self.queue.task_done()
continue
try:
urllib.urlretrieve(url, path)
except IOError:
print "Error downloading url '%s'." % url
#signals to queue job is done
self.queue.task_done()
When you create tje connection with urllib2.urlopen(), you can give a timeout parameter.
As described in the doc :
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
With this you will be able to manage a maximum waiting duration and catch the exception raised.
The way I crawl a huge batch of documents is by having batch processor which crawls and dumps constant sized chunks.
Suppose you are to crawl a pre-known batch of say 100K documents. You can have some logic to generate constant size chunks of say 1000 documents which would be downloaded by a threadpool. Once the whole chunk is crawled, you can have bulk insert in your database. And then proceed with further 1000 documents and so on.
Advantages you get by following this approach:
You get the advantage of threadpool speeding up your crawl rate.
Its fault tolerant in the sense, you can continue from the chunk where it last failed.
You can have chunks generated on the basis of priority i.e. important documents to crawl first. So in case you are unable to complete the whole batch. Important documents are processed and less important documents can be picked up later on the next run.
An ugly hack that seems to work.
import os, socket, threading, errno
def timeout_http_body_read(response, timeout = 60):
def murha(resp):
os.close(resp.fileno())
resp.close()
# set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
t = threading.Timer(timeout, murha, (response,))
try:
t.start()
body = response.read()
t.cancel()
except socket.error as se:
if se.errno == errno.EBADF: # murha happened
return (False, None)
raise
return (True, body)