multithreaded file download in python and updating in shell with download progress

multithreaded file download in python and updating in shell with download progress - python

in an attempt to learn multithreaded file download I wrote this piece of cake:
import urllib2
import os
import sys
import time
import threading
urls = ["http://broadcast.lds.org/churchmusic/MP3/1/2/nowords/271.mp3",
"http://s1.fans.ge/mp3/201109/08/John_Legend_So_High_Remix(fans_ge).mp3",
"http://megaboon.com/common/preview/track/786203.mp3"]
url = urls[1]
def downloadFile(url, saveTo=None):
file_name = url.split('/')[-1]
if not saveTo:
saveTo = '/Users/userName/Desktop'
try:
u = urllib2.urlopen(url)
except urllib2.URLError , er:
print("%s" % er.reason)
else:
f = open(os.path.join(saveTo, file_name), 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
time.sleep(.2)
sys.stdout.flush()
if file_size_dl == file_size:
print r"Download Completed %s%% for file %s, saved to %s" % (file_size_dl * 100. / file_size, file_name, saveTo,)
f.close()
return
def synchronusDownload():
urls_saveTo = {urls[0]: None, urls[1]: None, urls[2]: None}
for url, saveTo in urls_saveTo.iteritems():
th = threading.Thread(target=downloadFile, args=(url, saveTo), name="%s_Download_Thread" % os.path.basename(url))
th.start()
synchronusDownload()
but it seems like for the initiation of the second download it waits for the first thread and then goes to download the next file, as printed in shell too.
my plan was to begin all downloads simultaneously and print the updated progress of the files getting downloaded.
Any help will be greatly appreciated.
thanks.

This is a common problem and here are the steps typically taken:
1.) use Queue.Queue to create a queue of all the urls you would like to visit.
2.) Create a class that inherits from threading.Thread. It should have a run method that grabs a url from the queue and gets the data.
3.) Create a pool of threads based on your class to be "workers"
4.) Don't exit the program until queue.join() has been completed

Your functions are actually running in parallel. You can verify this by printing at the start of each function - 3 outputs will be printed as soon as your program is started.
What's happening is your first two files are so small that they are completely downloaded before the scheduler switches threads. Try setting bigger files in your list:
urls = [
"http://www.wswd.net/testdownloadfiles/50MB.zip",
"http://www.wswd.net/testdownloadfiles/20MB.zip",
"http://www.wswd.net/testdownloadfiles/100MB.zip",
]
Program output:
Downloading: 100MB.zip Bytes: 104857600
Downloading: 20MB.zip Bytes: 20971520
Downloading: 50MB.zip Bytes: 52428800
Download Completed 100.0% for file 20MB.zip, saved to .
Download Completed 100.0% for file 50MB.zip, saved to .
Download Completed 100.0% for file 100MB.zip, saved to .

Related

Multiprocessing not spawning all the requested processes

I'm working with Orcaflex (a FEM software for offshore analysis, but should not be relevant). I created a script to check if the simulations I've performed have been completed successfully (The simulation can fail for not reaching convergence). Since I'm talking about thousands of files I was trying to parallelize the process with multiprocessing. Following, my code. Sorry but I can't produce a working example for you, but I'll try to explain in detail. I created a derived Class of multiprocessing.Process and overwrite the run() to perform the checks on the simulations files.
Then, in __main__ I set a number of processors, split the files accordingly, and start the execution.
The problem is that the processes are not spawning altogether but, in what appear to be, a random amount of time from one to another. Is this what it is supposed to be or am I missing something?
What I mean by not spawning altogether is that I see:
[Info/Worker-1] child process calling self.run()
and for example:
[Info/Worker-4] child process calling self.run()
after about 10 min of the program running.
Thanks in advance for any help/suggetsion.
import os
import subprocess
import glob
import multiprocessing
import logging
import sys
import OrcFxAPI as of
class Worker(multiprocessing.Process):
myJobs = []
def setJobs(self, jobList):
self.myJobs = jobList
#staticmethod
def changedExtensionFileName(oldFileName, newExtension):
return '.'.join((os.path.splitext(oldFileName)[0], newExtension))
def run(self):
failed = []
model = of.Model(threadCount=1)
for job in self.myJobs:
try:
print('%s starting' % job)
sys.stdout.flush()
model.LoadSimulation(job)
if model.state == of.ModelState.SimulationStoppedUnstable:
newJob = job.replace('.sim', '.dat')
failed.append(newJob)
with open('Failed_Sim.txt', 'a') as f:
f.write(f'{newJob}\n')
f.close()
model.LoadData(newJob)
model.general.ImplicitConstantTimeStep /= 2
model.SaveData(newJob)
print(f'{job} has failed, reducing time step')
except of.DLLError as err:
print('%s ERROR: %s' % (job, err))
sys.stdout.flush()
with open(self.changedExtensionFileName(job, 'FAIL'), 'w') as f:
f.write('%s error: %s' % (job, err))
f.close()
return
if __name__ == '__main__':
import re
sim_file = [f for f in os.listdir() if re.search(r'\d\d\d\d.*.sim', f)]
# begin multprocessing
multiprocessing.log_to_stderr()
logger = multiprocessing.get_logger()
logger.setLevel(logging.INFO)
corecount = 14
workers = []
chunkSize = int(len(sim_file) / corecount)
chunkRemainder = int(len(sim_file) % corecount)
print('%s jobs found, dividing across %s workers - %s each remainder %s' % (str(len(sim_file)), str(corecount), chunkSize, chunkRemainder))
start = 0
for coreNum in range(0, corecount):
worker = Worker()
workers.append(worker)
end = start + chunkSize
if chunkRemainder>0:
chunkRemainder -= 1
end += 1
if end>len(sim_file):
end = len(sim_file)
worker.setJobs(sim_file[start:end])
worker.start()
start = end
if start>=len(sim_file):
break
for worker in workers:
worker.join()
print('Done...')

OK, so no one put up their hand to answer this by minor tweak (which I don't know how to do!), so here comes the larger rejig proposal...
def worker(inpData):
#The worker process
failed1 = []
failed2 = []
for job in inpData: #I'm not sure of the data shape of the chunks, has your original method split them into coherent chunks capable of being processed independently? My step here could be wrong.
try:
#print('%s starting' % job) #Prints won't appear on console from worker processes from windows, so commented them all out
model.LoadSimulation(job)
if model.state == of.ModelState.SimulationStoppedUnstable:
newJob = job.replace('.sim', '.dat')
failed1.append(newJob)
#I'd recommend we pass the list "failed" back to master and write to text from there, otherwise you could have several processes updating the text file at once, leading to possible loss of data
#with open('Failed_Sim.txt', 'a') as f:
# f.write(f'{newJob}\n')
# f.close()
model.LoadData(newJob)
model.general.ImplicitConstantTimeStep /= 2
model.SaveData(newJob)
#print(f'{job} has failed, reducing time step')
except of.DLLError as err:
#print('%s ERROR: %s' % (job, err))
#sys.stdout.flush()
#with open(self.changedExtensionFileName(job, 'FAIL'), 'w') as f:
# f.write('%s error: %s' % (job, err))
# f.close()
failed2.append(job)
#Note I've made two failed lists to pass back, for both failure types
return failed1, failed2
if __name__ == "__main__":
import re
import multiprocessing as mp
nCPUs = mp.cpu_count()
sim_file = [f for f in os.listdir() if re.search(r'\d\d\d\d.*.sim', f)]
#Make the chunks
chunkSize = int(len(sim_file) / corecount)
chunkRemainder = int(len(sim_file) % corecount)
print('%s jobs found, dividing across %s workers - %s each remainder %s' % (str(len(sim_file)), str(corecount), chunkSize, chunkRemainder))
chunks = []
start = 0
for iChunk in range(0, nCPUs)
end = start + chunkSize
if chunkRemainder>0:
chunkRemainder -= 1
end += 1
if end>len(sim_file):
end = len(sim_file)
chunk.append(sim_file[start:end])
#Send to workers
pool = mp.Pool(processes=nCPUs)
futA = []
for iChunk in range(0, nCPUs):
futA.append(pool.apply_async(worker, args=(chunk[iChunk],))
#Gather results
if futA:
failedDat = []
failedSim = []
for iChunk in range(0, len(futA)):
resA, resB = futA[iChunk].get()
failedDat.extend(resA)
failedSim.extend(resB)
pool.close()
if failedDat:
print("Following jobs failed, reducing timesteps:")
print(failedDat)
if failedSim:
print("Following sims failed due to errors")
print(failedSim)

Why does my program stop downloading very large files before finishing the full download?

I'm using this method to download files from some http links. The files generally exceed 20 gigabytes in size, and the program seems to work correctly, but then moves on to the next function before finishing.
import urllib2
import os
def saveFile(url):
print url
file_name = url.split('/')[-1]
#check if the file was already downloaded previously to save time
if os.path.isfile(file_name):
print("File already exists, skipping download...")
return file_name
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
return file_name

Did you checked the Process Manager? Maybe the memory has been occupied over the limits and OutOfMemoryError occurred.

Python: How to calculate multithreaded download speed

I wrote a multi-threaded http down-loader, now it can download a file faster than single threaded down-loader, and the MD5 sum is correct. However, I found the speed it showed is so so fast that I do not believe it is true value.
Unit was not printed yet, But I am sure it is KB/s, please take a look at my code about the measure.
# Setup the slaver
def _download(self):
# Start download partital content when queue not empty
while not self.configer.down_queue.empty():
data_range = self.configer.down_queue.get()
headers = {
'Range': 'bytes={}-{}'.format(*data_range)
}
response = requests.get(
self.configer.url, stream = True,
headers = headers
)
start_point = data_range[0]
for bunch in response.iter_content(self.block_size):
_time = time.time()
with self.file_lock:
with open(
self.configer.path, 'r+b',
buffering = 1
) as f:
f.seek(start_point)
f.write(bunch)
f.flush()
start_point += self.block_size
self.worker_com.put((
threading.current_thread().name,
int(self.block_size / (time.time() - _time))
))
self.configer.down_queue.task_done()
# speed monitor
def speed_monitor(self):
while len(self.thread_list)>0:
try:
info = self.worker_com.get_nowait()
self.speed[info[0]] = info[1]
except queue.Empty:
time.sleep(0.1)
continue
sys.stdout.write('\b'*64 + '{:10}'.format(self.total_speed)
+ ' thread num ' + '{:2}'.format(self.worker_count))
sys.stdout.flush()
If you need more information, please visit my github respository. i will be appreciate if you can point out my error. thanks.

how to use threads to grab multiple chunks of a file concurrently from server but write to disk atomically?

I am stuck in a grieve problem, and not able to figure out which way to go, in attempts made whole day I have posted so many times, this is not a duplicate question, since I need clarity how can I use multiple threads to grab a multiple chunks simultaneously from a server but write to disk atomically by locking the file write operation for single thread access; while the next thread waits for lock to get released.
import argparse,logging, Queue, os, requests, signal, sys, time, threading
import utils as _fdUtils
DESKTOP_PATH = os.path.expanduser("~/Desktop")
appName = 'FileDownloader'
logFile = os.path.join(DESKTOP_PATH, '%s.log' % appName)
_log = _fdUtils.fdLogger(appName, logFile, logging.DEBUG, logging.DEBUG, console_level=logging.DEBUG)
queue = Queue.Queue()
STOP_REQUEST = threading.Event()
maxSplits = threading.BoundedSemaphore(3)
threadLimiter = threading.BoundedSemaphore(5)
lock = threading.Lock()
Update:1
def _grabAndWriteToDisk(url, saveTo, first=None, queue=None, mode='wb', irange=None):
""" Function to download file..
Args:
url(str): url of file to download
saveTo(str): path where to save file
first(int): starting byte of the range
queue(Queue.Queue): queue object to set status for file download
mode(str): mode of file to be downloaded
irange(str): range of byte to download
"""
fileName = url.split('/')[-1]
filePath = os.path.join(saveTo, fileName)
fileSize = int(_fdUtils.getUrlSizeInBytes(url))
downloadedFileSize = 0 if not first else first
block_sz = 8192
irange = irange if irange else '0-%s' % fileSize
# print mode
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
fileBuffer = resp.raw.read()
with open(filePath, mode) as fd:
downloadedFileSize += len(fileBuffer)
fd.write(fileBuffer)
status = r"%10d [%3.2f%%]" % (downloadedFileSize, downloadedFileSize * 100. / fileSize)
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
time.sleep(.05)
sys.stdout.flush()
if downloadedFileSize == fileSize:
STOP_REQUEST.set()
if queue:
queue.task_done()
_log.info("Download Completed %s%% for file %s, saved to %s",
downloadedFileSize * 100. / fileSize, fileName, saveTo)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.lock = threading.Lock()
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
saveTo = DESKTOP_PATH if not items[1] else items[1]
split = items[-1]
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = int(_fdUtils.getUrlSizeInBytes(url))
byteRanges = _fdUtils.getRange(sizeInBytes, split)
mode = 'wb'
th = threading.Thread(target=_grabAndWriteToDisk, args=(url, saveTo, first, self.queue, mode, _range))
_log.info("Pulling for range %s using %s" , _range, th.getName())
th.start()
# _grabAndWriteToDisk(url, saveTo, first, self.queue, mode, _range)
mode = 'a'
finally:
maxSplits.release()
else:
while not STOP_REQUEST.isSet():
self.setName("primary_%s" % url.split('/')[-1])
# if downlaod whole file in single chunk no need
# to start a new thread, so directly download here.
_grabAndWriteToDisk(url, saveTo, 0, self.queue)
finally:
threadLimiter.release()
def main(appName, flag='with'):
args = _fdUtils.getParser()
urls_saveTo = {}
if flag == 'with':
_fdUtils.Watcher()
elif flag != 'without':
_log.info('unrecognized flag: %s', flag)
sys.exit()
# spawn a pool of threads, and pass them queue instance
# each url will be downloaded concurrently
for i in xrange(len(args.urls)):
t = ThreadedFetch(queue)
t.daemon = True
t.start()
split = 4
try:
for url in args.urls:
# TODO: put split as value of url as tuple with saveTo
urls_saveTo[url] = args.saveTo
# populate queue with data
for url, saveTo in urls_saveTo.iteritems():
queue.put((url, saveTo, split))
# wait on the queue until everything has been processed
queue.join()
_log.info('Finsihed all dowonloads.')
except (KeyboardInterrupt, SystemExit):
_log.critical('! Received keyboard interrupt, quitting threads.')
I expect to multiple threads grab each chunk but in the terminal i see same thread grabbing each range of chunk:
INFO - Pulling for range 0-25583 using Thread-1
INFO - Pulling for range 25584-51166 using Thread-1
INFO - Pulling for range 51167-76748 using Thread-1
INFO - Pulling for range 76749-102331 using Thread-1
INFO - Download Completed 100.0% for file 607800main_kepler1200_1600-1200.jpg
but what I am expecting is
INFO - Pulling for range 0-25583 using Thread-1
INFO - Pulling for range 25584-51166 using Thread-2
INFO - Pulling for range 51167-76748 using Thread-3
INFO - Pulling for range 76749-102331 using Thread-4
Please do not mark duplicate, without understanding...
please recommend if there is a better approach of doing what I am trying to do..
Cheers:/

If you want the download happen simultaneously and just writing to the file to be atomic then don't put the lock around downloading and writing to the file, but just around the write part.
As every thread uses its own file object to write I don't see the need to lock that access anyway. What you have to make sure is every thread writes to the correct offset, so you need a seek() call on the file before writing the data chunk. Otherwise you'll have to write the chunks in file order, which would make things more complicated.

Python: multiple files download by turn

In script loop performs files downloading and saving (curl). But loop iterations too quick, so downloading and saving actions have no time to complete it's operations. Thereat result files comes broken
def get_images_thread(table):
class LoopThread ( threading.Thread ):
def run ( self ):
global db
c=db.cursor()
c.execute(""" SELECT * FROM js_stones ORDER BY stone_id LIMIT 1
""")
ec = EasyCurl(table)
while(1):
stone = c.fetchone()
if stone == None:
break
img_fname = stone[2]
print img_fname
url = "http://www.jstone.it/"+img_fname
fname = url.strip("/").split("/")[-1].strip()
ec.perform(url, filename="D:\\Var\\Python\\Jstone\\downloadeble_pictures\\"+fname,
progress=ec.textprogress)

This is an excerpt from the examples for the PycURL library,
# Make a queue with (url, filename) tuples
queue = Queue.Queue()
for url in urls:
url = url.strip()
if not url or url[0] == "#":
continue
filename = "doc_%03d.dat" % (len(queue.queue) + 1)
queue.put((url, filename))
# Check args
assert queue.queue, "no URLs given"
num_urls = len(queue.queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"
class WorkerThread(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while 1:
try:
url, filename = self.queue.get_nowait()
except Queue.Empty:
raise SystemExit
fp = open(filename, "wb")
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url)
curl.setopt(pycurl.FOLLOWLOCATION, 1)
curl.setopt(pycurl.MAXREDIRS, 5)
curl.setopt(pycurl.CONNECTTIMEOUT, 30)
curl.setopt(pycurl.TIMEOUT, 300)
curl.setopt(pycurl.NOSIGNAL, 1)
curl.setopt(pycurl.WRITEDATA, fp)
try:
curl.perform()
except:
import traceback
traceback.print_exc(file=sys.stderr)
sys.stderr.flush()
curl.close()
fp.close()
sys.stdout.write(".")
sys.stdout.flush()
# Start a bunch of threads
threads = []
for dummy in range(num_conn):
t = WorkerThread(queue)
t.start()
threads.append(t)
# Wait for all threads to finish
for thread in threads:
thread.join()

If you're asking what I think you're asking,
from time import sleep
sleep(1)
should "solve"(It's hacky to the max!) your problem. Docs here. I would check that that really is your problem, though. It seems catastrophically unlikely that pausing for a few seconds would stop files from downloading brokenly. Some more detail would be nice too.
os.waitpid()
might also help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

multithreaded file download in python and updating in shell with download progress - python

Related

Multiprocessing not spawning all the requested processes

Why does my program stop downloading very large files before finishing the full download?

Python: How to calculate multithreaded download speed

how to use threads to grab multiple chunks of a file concurrently from server but write to disk atomically?

Python: multiple files download by turn

Categories

Resources