File downloading using python with threads

File downloading using python with threads - python

I'm creating a python script which accepts a path to a remote file and an n number of threads. The file's size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file.
How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don't get scrambled?
Also, what if I'm to download several files simultaneously?

You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python.
I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global Queue.Queue instance, call it workQ for example, for "work requests" (wr = workQ.get() will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss).
A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).
The main thread pushes all work requests to the workQ (just workQ.put((url, from, numbytes)) for each request) and waits for results to come to another Queue instance, call it resultQ (each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).
As each working thread satisfies the request it's doing, it puts the results into resultQ and goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- i.e. if the main thread has other work to do, for example on the GUI) gets results from resultQ and performs the needed open, seek, and write operations to place the data at the right spot.
There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).

You need to fetch completely separate parts of the file on each thread. Calculate the chunk start and end positions based on the number of threads. Each chunk must have no overlap obviously.
For example, if target file was 3000 bytes long and you want to fetch using three thread:
Thread 1: fetches bytes 1 to 1000
Thread 2: fetches bytes 1001 to 2000
Thread 3: fetches bytes 2001 to 3000
You would pre-allocate an empty file of the original size, and write back to the respective positions within the file.

You can use a thread safe "semaphore", like this:
class Counter:
counter = 0
#classmethod
def inc(cls):
n = cls.counter = cls.counter + 1 # atomic increment and assignment
return n
Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes.
That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading.
The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk.

for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading.

Related

Python multiprocessing queue makes code hang with large data

I'm using python's multiprocessing to analyse some large texts. After some days trying to figure out why my code was hanging (i.e. the processes didn't end), I was able to recreate the problem with the following simple code:
import multiprocessing as mp
for y in range(65500, 65600):
print(y)
def func(output):
output.put("a"*y)
if __name__ == "__main__":
output = mp.Queue()
process = mp.Process(target = func, args = (output,))
process.start()
process.join()
As you can see, if the item to put in the queue gets too large, the process just hangs.
It doesn't freeze, if I write more code after output.put() it will run, but still, the process never stops.
This starts happening when the string gets to 65500 chars, depending on your interpreter it may vary.
I was aware that mp.Queue has a maxsize argument, but doing some search I found out it is about the Queue's size in number of items, not the size of the items themselves.
Is there a way around this?
The data I need to put inside the Queue in my original code is very very large...

Your queue fills up with no consumer to empty it.
From the definition of Queue.put:
If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available.
Assuming there is no deadlock possible between producer and consumer (and assuming your original code does have a consumer, since your sample doesn't), eventually the producers should be unlocked and terminate. Check the code of your consumer (or add it to the question, so we an have a look)
Update
This is not the problem, because queue has not been given a maxsize so put should succeed until you run out of memory.
This is not the behavior of Queue. As elaborated in this ticket, the part blocking here is not the queue itself, but the underlying pipe. From the linked resource (inserts between "[]" are mine):
A queue works like this:
- when you call queue.put(data), the data is added to a deque, which can grow and shrink forever
- then a thread pops elements from the deque, and sends them so that the other process can receive them through a pipe or a Unix socket (created via socketpair). But, and that's the important point, both pipes and unix sockets have a limited capacity (used to be 4k - pagesize - on older Linux kernels for pipes, now it's 64k, and between 64k-120k for unix sockets, depending on tunable systcls).
- when you do queue.get(), you just do a read on the pipe/socket
[..] when size [becomes too big] the writing thread blocks on the write syscall.
And since a join is performed before dequeing the item [note: that's your process.join], you just deadlock, since the join waits for the sending thread to complete, and the write can't complete since the pipe/socket is full!
If you dequeue the item before waiting the submitter process, everything works fine.
Update 2
I understand. But I don't actually have a consumer (if it is what I'm thinking it is), I will only get the results from the queue when process has finished putting it into the queue.
Yeah, this is the problem. The multiprocessing.Queue is not a storage container. You should use it exclusively for passing data between "producers" (the processes that generate data that enters the queue) and "consumers (the processes that "use" that data). As you now know, leaving the data there is a bad idea.
How can I get an item from the queue if I cannot even put it there first?
put and get hide away the problem of putting together the data if it fills up the pipe, so you only need to set up a loop in your "main" process to get items out of the queue and, for example, append them to a list. The list is in the memory space of the main process and does not clog the pipe.

Multiple threads that communicate with each other

I'm trying to do something in Python 2.7, and I can't quite figure it out.
What I want is to carry out two sets of actions simultaneously, and in addition there is some need for the two threads to communicate with each other.
More specifically: I want to send a series of HTTP requests, and at the same time (in parallel) send a similar series of HTTP requests. This way I don't have to wait for a (potentially delayed) response, because the other series can just continue on.
The thing is, the number of requests per second cannot exceed a certain value; let's say one request per second. So I need to make sure that the combined request-frequency of the two parallel threads does not exceed this value.
Any help would be appreciated. Apologies if the solution is obvious, I'm still pretty new to python.

Raymond Hettinger gave a really good keynote talk about the proper way to think about concurrency and multithreading here: https://www.youtube.com/watch?v=Bv25Dwe84g0&t=2
And his notes can be found here: https://dl.dropboxusercontent.com/u/3967849/pyru/_build/html/index.html
What I recommend, which is from the talk, is to use an atomic message queue to "talk" between the threads. However, this talk and Raymond's work is done in 3.5 or 3.6. This library https://docs.python.org/3/library/queue.html will help you significantly.

A common way to enforce your rate-limiting requirement is to use a Token Bucket approach.
Specifically in Python, you'd have a queue shared between the threads, and a 3rd thread (perhaps the original initiating thread) which puts one plug object into the queue per second. (That is, it's a simple loop: wait 1 second, put an object, repeat.)
The two worker threads each try to take an object from the queue, and for each object they take, they issue one request. Voila! The workers can't issue more requests, in total, than tokens made available (which equal to the number of seconds that have passed. Even if one thread is stuck on a long-running request, the other can just be the one to repeatedly obtain a token. It's generalizable to N threads: they're all just competing to get the next allow-one-request token from the shared queue.
If many threads are stuck on long-running requests, multiple tokens collect in the queue, allowing a burst of catch-up requests – but still only reaching the overall target average-number-of-requests over a longer period. (By adjusting the maximum size of the queue, or whether it is preloaded with a small surplus of tokens, the exact enforcement of the limit can be adjusted – for example, so that it converges to the correct limit within 10 seconds, or 30, or 3600, whatever.)
The shared queue can also be the mechanism that is used to cleanly tell the worker threads to quit. That is, instead of pushing-into-the-queue whatever signalling-object means, "do one request", an external control thread can push-into-the-queue an object meaning, "finish and exit". Pushing in N such objects will cause the N worker threads to each get the command.

Seems like you need a "semaphore". From the python2.7 docs:
A semaphore manages an internal counter which is decremented by each acquire() call and incremented by each release() call. The counter can never go below zero; when acquire() finds that it is zero, it blocks, waiting until some other thread calls release().
So this semaphore of yours is basically a counter of calls, that reset to the allowed rate every second, shared by all the HTTP threads. If it reaches 0 no thread can make request no more, until another thread release the connection or a second passes and the Counter is filled again.
You can set-up your script with x HTTP request workers and one HTTP Call Rate Resetter worker:
the resetter destroys and regen the semaphore
each worker acquire() every HTTP is made.
If you are using Python2.7 and threading you can find all the docs here:
https://docs.python.org/2/library/threading.html.
And a nice tutorial here:
https://pymotw.com/2/threading/

Handling endless data stream with multiprocessing and Queues

I want to use the Python 2.7 multiprocessing package to operate on an endless stream of data. A subprocess will constantly receive data via TCP/IP or UDP packets and immediately place the data in a multiprocessing.Queue. However, at certain intervals, say, every 500ms, I only want to operate on a user specified slice of this data. Let's say, the last 200 data packets.
I know I can put() and get() on the Queue, but how can I create that slice of data without a) Backing up the queue and b) Keeping things threadsafe?
I'm thinking I have to constantly get() from the Queue with another subprocess to prevent the Queue from getting full. Then I have to store the data in another data structure (such as a list) to build the user specified slice. But the data structure would probably not be thread safe, so it does not sound like a good solution.
Is there some programming paradigm that achieves what I am trying to do easily? I looked at the multiprocessing.Manager class, but wasn't sure it would work.

You can do this as follows:
Use an instance of the threading.Lock class. Call method acquire to claim exclusive access to your queue from a certain thread and call release to grant other threads access.
Since you want to keep gathering your input, copying the whole queue would be probably be to expensive. Probably the fastest way is to first collect data in one queue, than swap it for another and use the old one to read data from into your application by a different thread. Protect the swapping with a Lock instance, so you can be sure that whenever the writer acquires the lock, the current 'listener' queue is ready to receive data.
If only recent data is important, use two circular buffer instead of queues, allowing old data to be overwritten.

From multiprocessing to distributed processing in python standard library

I am studying this code from gitHub about distributed processing. I would like to thank eliben for this nice post. I have read his explanations but there are some dark spots. As far as I understand, the code is for distributing tasks in multiple machines/clients. My questions are:
The most basic of my questions is where the distribution of the work to different machines is happening?
Why there is an if else statement in the main function?
Let me start this question in a more general way. I thought that we usually start a Process in a specific chunk (independent memory part) and not pass all the chunks at once like this:
chunksize = int(math.ceil(len(HugeList) / float(nprocs)))
for i in range(nprocs):
p = Process(
target = myWorker, # This is my worker
args=(HugeList[chunksize * i:chunksize * (i + 1)],
HUGEQ)
processes.append(p)
p.start()
In this simple case where we have nprocs processes. Each process initiate an instance of the function myWorker that work on the specified chunk.
My question here is:
How many threads do we have for each process that work in each chunk?
Looking now into the gitHub code I am trying to understand the mp_factorizer? More specifically, in this function we do not have chunks but a huge queue (shared_job_q). This huge queue is consisted of sub-lists of size 43 maximum. This queue is passed into the factorizer_worker. There via get we obtain those sub-lists and pass them into the serial worker. I understand that we need this queue to share data between clients.
My questions here is:
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
How many threads exist for each process?
Does get function called from each process thread?
Thanks for your time.

The distribution to multiple machines only happens if you actually run the script on multiple machines. The first time you run the script (without the --client option), it starts the Manager server on a specific IP/port, which hosts the shared job/result queues. In addition to starting the Manager server, runserver will also act as a worker, by calling mp_factorizer. It is additionally responsible for collecting the results from the result queue and processing them. You could run this script by itself and get a complete result.
However, you can also distribute the factorization work to other machines, by running the script on other machines using the --client option. That will call runclient, which will connect to the existing Manager server you started with the initial run of the script. That means that the clients are accessing the same shared queues runserver is using, so they can all pull work from and put results to the same queues.
The above should covers questions 1 and 2.
I'm not exactly sure what you're asking in question 3. I think you're wondering why we don't pass a chunk of the list to each worker explicitly (like in the example you included), rather than putting all the chunks into a queue. The answer there is because the runserver method doesn't know how many workers there are actually going to be. It knows that it's going to start 8 workers. However, it doesn't want to split the HugeList into eight chunks and send them to the 8 processes it's creating, because it wants to support remote clients connection to the Manager and doing work, too. So instead, it picks an arbitrary size for each chunk (43) and divides the list into as many chunks of that size as it takes to consume the entire HugeList, and sticks it in a Queue. Here's the code in runserver that does that:
chunksize = 43
for i in range(0, len(nums), chunksize):
#print 'putting chunk %s:%s in job Q' % (i, i + chunksize)
shared_job_q.put(nums[i:i + chunksize]) # Adds a 43-item chunk to the shared queue.
That way, as many workers as you want can connect to the Manager server, grab a chunk from shared_job_q, process it, and return a result.
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Yes
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
We don't have 43 chunks. We have X number of chunks, each of size 43. Each worker process just grabs chunks off the queue and processes them. Which part it gets is arbitrary and depends on how many workers there are and how fast each is going.
How many threads exist for each process?
One. If you mean now many worker processes exist for each instance of the script, there are 8 in the server process, and 4 in each client process.
Does get function called from each process thread?
Not sure what you mean by this.

What's the Best Way to Schedule and Manage Multiple Processes in Python 3

I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!

If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.

You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.