beginner question about python multiprocessing?

beginner question about python multiprocessing? - python

I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.
I wish to know whether does multiprocessing speeds up the time required to do such tasks.
I did a
multiprocessing.cpu_count
and it returns 8. I have tried something like
process = []
for i in range(4):
if i == 3:
limit = resultsSize - (3 * division)
else:
limit = division
#limit and offset indicates the subset of records the function would fetch in the db
p = Process(target=sub_table.processR,args=(limit,offset,i,))
p.start()
process.append(p)
offset += division + 1
for po in process:
po.join()
but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?

Why is this so?
Can someone please enlighten in what cases does multiprocessing gives better performances?
Here's one trick.
Multiprocessing only helps when your bottleneck is a resource that's not shared.
A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.
To find a non-shared resource, you must have independent objects. Like a list that's already in memory.
If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.
Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.
Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.
Here's the other trick.
Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.
A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.
In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.

Here are a couple of questions:
In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)
It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?
Hope this helps.

In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).
From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).
Thus, accelerating the CPU operation does not make a very big difference overall.

Related

Need help minimizing the loss of time for data collection

I am trying to come up with the best way to minimize the loss of time in a data harvesting application I am building. Here are some of the restraints/factors:
I can only query for data every 12 seconds on a specific channel
I can connect to as many channels simultaneously.
I want to keep the number of channels in use to a minimum
With these factors in mind, I have thought of a solution, but would like for more input.
I have decided to in a way load balance this collection of data. My thoughts are this:
Main Program utilizes m processes (for now I am thinking 4).
Each process uses n threads, where each thread is listening on a channel.(for now I am thinking 12).
There is a variable thread_start_time_factor = 12 seconds / n threads
There is a variable process_start_time_factor = thread_start_time_factor / m processes
Each thread query's data every 12 seconds, however threads start consecutively after one another based on the thread_start_time_factor. So if I am using 12 threads, thread 1 starts, (1 second pause), thread 2 starts, ... This way data collection is now happening every 1 second.
Each process then starts one after the other based on the
process_start_time_factor
In theory, data collection SHOULD be happening every process_start_time_factor If going with the configuration above, the process_start_time_factor should be .250 seconds. (If my logic is wrong here, please let me know).
Now here is my question. Is this a good way to do this? My thoughts for using multiple processes is to essentially capture data whenever the other processes are not. The program will be written in Python (Not that it matters). Has anyone had experience with (weird) data collection restrictions like this where they have to think outside the box? Thanks to all of those who reply in advance. I am for sure open to other solutions.

given that you're using proxies, not linked to the site, and are being somewhat obscure about the question, suggests it's bordering on illegal
that said, some numbers you've not given are how long each request takes (e.g. TTFB, total duration, total data transferred) and what it takes to process the responses.
assuming you're not doing much processing on ingress, then I'd just go with an asyncio (i.e. no process/thread parallelism) approach as it's much easier to get the coordination straight. multithreading/process coordination is much awkward to reason about
you should be able to saturate a 1GB connection with HTTP requests from a single thread, maybe just using multiple processes to do post-processing so that doesn't get in the way

How to manage program turnaround time?

I have just written a script that is intended to be run 24/7 to update some files. However, if it takes 3 minutes to update one file, then it would take 300 minutes to update 100 files.
Is it possible to run n instances of the script to manage n separate files to speed up the turnaround time?

Yes it is possible. Use the multiprocessing module to start several concurrent processes. This has the advantage that you do not run into problems because of the Global Interpreter Lock and threads as is explained in the manual page. The manual page includes all the examples you will need to make your script execute in parallel. Of course this works best if the processes do not have to interact, which your example suggests.

I suggest you first find out if there is any way to reduce the 3 minutes in a single thread.
The method I use to discover speedup opportunities is demonstrated here.
That will also tell you if you are purely I/O bound.
If you are completely I/O bound, and all files are on a single disk, parallelism won't help.
In that case, possibly storing the files on a solid-state drive would help.
On the other hand, if you are CPU bound, parallelism will help, as #hochl said.
Regardless, find the speedup opportunities and fix them.
I've never seen any good-size program that didn't have one or several of them.
That will give you one speedup factor, and parallelism will give you another, and the total speedup will be the product of those two factors.

Python, solr and massive amounts of queries: need some suggestions

i'm facing a design problem within my project.
PROBLEM
i need to query solr with all the possible combinations (more or less 20 millions) of some parameters extracted from our lists, to test wether they give at least 1 result. in the case they don't, that combination is inserted into a blacklist (used for statistical analysis and sitemap creation)
HOW I'M DOING IT NOW
nested for loops to combine parameters (extracted from python lists) and pass them to a method (the same i use in production environment to query the db within the website) that tests for 0-results. if it's 0, there's a method inserting inside the blacklist
no threading involved
HOW I'D LIKE TO TO THIS
i'd like to put all the combinations inside a queue and let a thread object pull them, query and insert, for better performances
WHAT PROBLEMS I'M EXPERIENCING
slowliness: being single threaded, it now takes a lot to complete (when and if it completes)
connection reset by peer[104] : it's an error throwed by solr after a while it's been queried (i increased the pool size, but nothing changes) this is the most recurrent (and annoying) error, at the moment.
python hanging: this i resolved with a timeout decorator (which isn't a correct solution, but at least it helps me go throu the whole processing and have a quick test output for now. i'll drop this whenever i can come to a smart solution)
queue max size: a queue object can contain up to 32k elements, so it won't fit my numbers
WHAT I'M USING
python 2.7
mysql
apache-solr
sunburnt (python interface to solr)
linux box
I don't need any code debugging, since i'd rather throw away what i did for a fresh start, instead than patching it over and over and over... "Trial by error" is not what i like.
I'd like every suggestion that can come in mind to you to design this in the correct way. Also links, websites, guides are very much welcomed, since my experience with this kind of scripts is building as i work.
Thanks all in advance for your help! If you didn't understand something, just ask, i'll answer/update the post if needed!
EDIT BASED ON SOME ANSWERS (will keep this updated)
i'll probably drop python threads for the multiprocessing lib: this could solve my performance issues
divide-and-conquer based construction method: this should add some logic in my parameters construction, without needing any bruteforce approac
what i still need to know: where can i store my combinations to feed the worker thread? maybe this is no more an issue, since the divide-and-conquer approach may let me generate runtime the combinations and split them between the working threads.
NB: i wont' accept any answer for now, since i'd like to mantain this post alive for a while, just to gather more and more ideas (not only for me, but maybe for future reference of others, since it's generic nature)
Thanks all again!

Instead of brute force, change to using a divide-and-conquer approach while keeping track of the number of hits for each search. If you subdivide into certain combinations, some of those sets will be empty so you eliminate many subtrees at once. Add missing parameters into remaining searches and repeat until you are done. It takes more bookkeeping but many fewer searches.

You can use the stdlib "multiprocessing" module in order to have several subprocesses working with your combinations - This works better than Python's threads, and allow at least each logical CPU core in your configuration to run at the same time.
Here is a minimalist example of how it works:
import random
from multiprocessing import Pool
def a(a):
if random.randint(0, 100000) == 0:
return True
return False
# the number bellow should be a equal to your number of processor cores:
p = Pool(4)
x = any(p.map(a, xrange(1000000)))
print x
So, this makes a 10 million test, divided in 4 "worker" processes, with no scaling issues.
However, given the nature of the error messages you are getting, though you don't explicitly says so, you seem to be running an application with a web interface - and you wait for all the processing to finish before rendering a result to the browser. This tipically won't work with long running calculations - you'd better perform all your calculations in a separate process than the server process serving your web interface, and update the web interface via asynchronous requests, using a little javascript. That way you will avoid any "connection reset by peer" errors.

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multiprocessing module? Should Queue or JoinableQueue in multiprocessing be used? Or the Queue module itself? Or, should I map the file iterable over a pool of processes using multiprocessing? I've experimented with these approaches but the overhead is immense in distribution the data line by line. I've settled on a lightweight pipe-filters design by using cat file | process1 --out-file out1 --num-processes 2 | process2 --out-file out2, which passes a certain percentage of the first process's input directly to the second input (see this post), but I'd like to have a solution contained entirely in Python.
Surprisingly, the Python documentation doesn't suggest a canonical way of doing this (despite a lengthy section on programming guidelines in the multiprocessing documentation).
Thanks,
Vince
Additional information: Processing time per line varies. Some problems are fast and barely not I/O bound, some are CPU-bound. The CPU bound, non-dependent tasks will gain the post from parallelization, such that even inefficient ways of assigning data to a processing function would still be beneficial in terms of wall clock time.
A prime example is a script that extracts fields from lines, checks for a variety of bitwise flags, and writes lines with certain flags to a new file in an entirely new format. This seems like an I/O bound problem, but when I ran it with my cheap concurrent version with pipes, it was about 20% faster. When I run it with pool and map, or queue in multiprocessing it is always over 100% slower.

One of the best architectures is already part of Linux OS's. No special libraries required.
You want a "fan-out" design.
A "main" program creates a number of subprocesses connected by pipes.
The main program reads the file, writing lines to the pipes doing the minimum filtering required to deal the lines to appropriate subprocesses.
Each subprocess should probably be a pipeline of distinct processes that read and write from stdin.
You don't need a queue data structure, that's exactly what an in-memory pipeline is -- a queue of bytes between two concurrent processes.

One strategy is to assign each worker an offset so if you have eight worker processes you assign then numbers 0 to 7. Worker number 0 reads the first record processes it then skips 7 and goes on to process the 8th record etc., worker number 1 reads the second record then skips 7 and processes the 9th record.........
There are a number of advantages to this scheme. It doesnt matter how big the file is the work is always divided evenly, processes on the same machine will process at roughly the same rate, and use the same buffer areas so you dont incur any excessive I/O overhead. As long as the file hasnt been updated you can rerun individual threads to recover from failures.

You dont mention how you are processing the lines; possibly the most important piece of info.
Is each line independant? Is the calculation dependant on one line coming before the next? Must they be processed in blocks? How long does the processing for each line take? Is there a processing step that must incorporate "all" the data at the end? Or can intermediate results be thrown away and just a running total maintained? Can the file be initially split by dividing filesize by count of threads? Or does it grow as you process it?
If the lines are independant and the file doesn't grow, the only coordination you need is to farm out "starting addresses" and "lengths" to each of the workers; they can independantly open and seek into the file and then you must simply coordinate their results; perhaps by waiting for N results to come back into a queue.
If the lines are not independant, the answer will depend highly on the structure of the file.

I know you specifically asked about Python, but I will encourage you to look at Hadoop (http://hadoop.apache.org/): it implements the Map and Reduce algorithm which was specifically designed to address this kind of problem.
Good luck

It depends a lot on the format of your file.
Does it make sense to split it anywhere? Or do you need to split it at a new line? Or do you need to make sure that you split it at the end of an object definition?
Instead of splitting the file, you should use multiple readers on the same file, using os.lseek to jump to the appropriate part of the file.
Update: Poster added that he wants to split on new lines. Then I propose the following:
Let's say you have 4 processes. Then the simple solution is to os.lseek to 0%, 25%, 50% and 75% of the file, and read bytes until you hit the first new line. That's your starting point for each process. You don't need to split the file to do this, just seek to the right location in the large file in each process and start reading from there.

Fredrik Lundh's Some Notes on Tim Bray's Wide Finder Benchmark is an interesting read, about a very similar use case, with a lot of good advice. Various other authors also implemented the same thing, some are linked from the article, but you might want to try googling for "python wide finder" or something to find some more. (there was also a solution somewhere based on the multiprocessing module, but that doesn't seem to be available anymore)

If the run time is long, instead of having each process read its next line through a Queue, have the processes read batches of lines. This way the overhead is amortized over several lines (e.g. thousands or more).

How should I optimize this filesystem I/O bound program?

I have a python program that does something like this:
Read a row from a csv file.
Do some transformations on it.
Break it up into the actual rows as they would be written to the database.
Write those rows to individual csv files.
Go back to step 1 unless the file has been totally read.
Run SQL*Loader and load those files into the database.
Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.
There are a few ideas that I have to solve this:
Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.
Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?

Poor man's map-reduce:
Use split to break the file up into as many pieces as you have CPUs.
Use batch to run your muncher in parallel.
Use cat to concatenate the results.

Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.
If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.
Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

If you are I/O bound, the best way I have found to optimize is to read or write the entire file into/out of memory at once, then operate out of RAM from there on.
With extensive testing I found that my runtime eded up bound not by the amount of data I read from/wrote to disk, but by the number of I/O operations I used to do it. That is what you need to optimize.
I don't know Python, but if there is a way to tell it to write the whole file out of RAM in one go, rather than issuing a separate I/O for each byte, that's what you need to do.
Of course the drawback to this is that files can be considerably larger than available RAM. There are lots of ways to deal with that, but that is another question for another time.

Can you use a ramdisk for step 4? Low millions sounds doable if the rows are less than a couple of kB or so.

Use buffered writes for step 4.
Write a simple function that simply appends the output onto a string, checks the string length, and only writes when you have enough which should be some multiple of 4k bytes. I would say start with 32k buffers and time it.
You would have one buffer per file, so that most "writes" won't actually hit the disk.

Isn't it possible to collect a few thousand rows in ram, then go directly to the database server and execute them?
This would remove the save to and load from the disk that step 4 entails.
If the database server is transactional, this is also a safe way to do it - just have the database begin before your first row and commit after the last.

The first thing is to be certain of what you should optimize. You seem to not know precisely where your time is going. Before spending more time wondering, use a performance profiler to see exactly where the time is going.
http://docs.python.org/library/profile.html
When you know exactly where the time is going, you'll be in a better position to know where to spend your time optimizing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.