Python multiprocessing question

Python multiprocessing question - python

Trying to think of the best way to code 2 processes that have to run in parallel. I not even sure if multiprocessing is the preferred module.
I am generating a lot of data for a long period of time with a dataCollector, but I would like to check the data periodically with a dataChecker while the dataCollector keeps running. In my mind there are 2 significant times that I consider, one, the time in which the dataCollector dumps a file a begins writing another, which is the same time that the dataChecker will start analyzing the dumped file, and two, the time in which the dataChecker is finished and begins waiting for the dataCollector again.
Can someone suggest a general outline with the multiprocessing module? Should I be using a different module? Thanks

Why would you use any module at all? This is simple to do by having two separate processes that start at the same time. The dataChecker would list all files in a directory, count them, and sleep for a short time (several seconds or more). Then it would do it again and if the number of files changes, it opens the new ones, reads them and processes them.
The synchronisation of the two processes would be done entirely through mailboxes, implemented as a directory with files in it. A message is received only when the dataCollector starts writing a new message.

Related

In Python, should I use threading for this?

I'm working on a small program that collects certain data and processes it. Currently, the program runs constantly on my server, storing the data to the disk. Every once in a while I run another program that reads the stored data, processes it, sorts it, saves it to a new location, and clears the old data files.
I've never learned about threads, but it SOUNDS like this is a good place to use them? If threading works the way I think it does, I could set up a queue to hold the data, and have a separate thread that could pull data from the queue and process it as it's ready. If the queue is full, thread1 could sleep for a bit. If it's empty, thread2 could sleep for a bit
That would reduce disk writing, get rid of disk reading, and make the data collection run side-by-side with the data processing to save time.
Is any of this accurate? I'm a senior CS student and threads have never once come up (Surely that's a little odd?). I would appreciate any tips/knowledge/advice with using threads, as well as if this is the correct solution to my "problem".
Thanks!

That does sound like a situation where some form of parallelism might be useful. However, since this is Python, you might not want to actually use threads. Python, in the standard implementation, has something called the Global Interpreter Lock. Effectively, to allow the garbage collector to work, only one thread of a Python program can actually be running Python code at any time (modules written directly in C, or external operations such as disk IO or database queries, are not "running Python code" for this purpose, though you will have called them from Python).
Because of this, threading in Python is generally only a good idea if your Python code spends significant amounts of time waiting on responses from non-Python parts of the program, or external sources. If the data collection or processing is being done outside Python (collecting from a database or website, processing in numpy, etc.), that might be reasonable. If your code isn't in this situation often enough, your program ends up wasting more time switching between threads than it gains (because if two threads are both in Python code it still only runs one at a time)
If not, you should try the multiprocessing module instead. This is also a generally safer model, as the only things that can be shared between processes in multiprocessing are the things you explicitly share (while threads share all state, potentially allowing one thread to break another because you forgot to lock something).
Alternately, you might use subprocess. Effectively, this would be having your first program intermittently restart the second one every time it finishes a batch of data.

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.

Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!

Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.

As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

Python threading or multiprocessing for my 'tool'

I have created a script that :
Imports a list of IP's from .txt ( around 5K )
Connects to a REST API and performs a query based on the IP ( web logs for each IP)
Data is returned from the API and some calculations are done on the data
Results of calculations are written to a .csv
At the moment it's really slow as it takes one IP at a time does everything and then goes to the next IP .
I may be wrong but from my understanding with threading or multiprocessing i could have 3-4 threads each doing an IP which would increase the speed of the tool by a huge margin . Is my understanding correct and if it is should i be looking at threading or multi-processing for my task ?
Any help would amazing
Random info, running python 2.7.5 , Win7 with plenty or resources.

multiprocessing is definitely the way to go forward. You could start a process that reads the IPs and puts them in a multiprocessing.Queue and then make a few processes (depending on available resources) that read from this queue, connect to the API and make the requests. These requests shall then be made in parallel, and if the API can handle these requests, your program should finish faster. If the calculations are complex and time consuming, the output from the API can be put into another Queue, from where other processes you start, can read them and make the calculations and store results. You may have to start a collector process to collect the final outputs.
You can find some sample code for such problems in this stackoverflow question. If you require further explanation or sample code, let me know in comments.

With multiprocessing a primitive way to do this whould be chunck the file into 5 equal pieces and give it to 5 different processes write their results to 5 different files, when all processes are done you will merge the results.
You can have the same logic with Python threads without much complication. And probably won't make any difference since the bottle neck is probably the API. So in the end it does not really matter which approach you choose here.
There are two things two consider though:
Using Threads, you are not realy using multiple CPUs hence you are have "wasted resources"
Using Multiprocessing will use multiple processors but it is heavier on start up ... So you will benefit from never stoping the script and keeping the processes alive if the script needs to run very often.
Since the information you gave about the scenario where you use this script (or better say program) is limited, it really hard to say which is the better approach.

What's the Best Way to Schedule and Manage Multiple Processes in Python 3

I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!

If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.

You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.

One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.

One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.

You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.

I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck

It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.

Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)

If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.