Improving performance of a repetitive, time-consuming program

Improving performance of a repetitive, time-consuming program - python

I'm having trouble finding an answer to this question, and it may be due to poor phrasing.
I have a small python program that extracts data from a large log file.
It then displays the data in a particular format. Nothing fancy, just reads, parses and prints.
It takes about a minute to do this.
Now, I want to run this across 300 files. If I put my code inside a loop that iterates over the 300 files and executes the same piece of code, one by one, it will take 300 minutes to complete. I would rather it didn't take this long.
I have 8 virtual processors on this machine. It can handle extra load when this program is being run. Can I spread the workload over these vcpus to reduce the total runtime? If so - what is the ideal way to implement this?
It's not code I'm after, it's the theory behind it.
Thanks

Don't make parallelism your first priority. Your first priority should be making the single-thread performance as fast as possible. I rely on this method. From your short description, it sounds like there might be juicy opportunities for speedup in the I/O and in the parsing.
After you do that, if the program is CPU-bound (which I doubt - it should be spending most of its time in I/O) then parallelism might help.

Related

How to make sure each process uses roughly same amount of time when using multiprocessing module in Python?

Currently I am working on an asynchronous gradient algorithm with Python multiprocessing module, the main idea is that I run multiple processes that update an array of global parameters asynchronously. I have finished most of the framework but I got a problem that some processes seems to "get stuck" sometimes while other are still running, that causes this algorithm less effective. So I am wondering if there are good ways to make sure that they use roughly the same amount of time?
Thanks!

This depends almost entirely on the problem you try to tackle. If you distribute a large task to several workers and one unpredictably gets a much larger chunk than the others, you will have this situation.
There are several options to avoid it:
Try to estimate the effort for each chunk more precisely. Depending on your task, this might be possible. The chunks with the most predicted effort should be split.
A very common way to approach this is to split the task into lots of very small chunks, many more than workers are present. Then feed all chunks into a queue and let your workers eat their chunks from the queue. This way when a worker receives an easy chunk it will finish it fast and take at once the next chunk from the queue, thus not ending up idle while other workers seem to be "stuck" with their harder chunk.
A real deadlock will not be fixed of course by whatever approach.

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.

Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!

Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.

As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

How to manage program turnaround time?

I have just written a script that is intended to be run 24/7 to update some files. However, if it takes 3 minutes to update one file, then it would take 300 minutes to update 100 files.
Is it possible to run n instances of the script to manage n separate files to speed up the turnaround time?

Yes it is possible. Use the multiprocessing module to start several concurrent processes. This has the advantage that you do not run into problems because of the Global Interpreter Lock and threads as is explained in the manual page. The manual page includes all the examples you will need to make your script execute in parallel. Of course this works best if the processes do not have to interact, which your example suggests.

I suggest you first find out if there is any way to reduce the 3 minutes in a single thread.
The method I use to discover speedup opportunities is demonstrated here.
That will also tell you if you are purely I/O bound.
If you are completely I/O bound, and all files are on a single disk, parallelism won't help.
In that case, possibly storing the files on a solid-state drive would help.
On the other hand, if you are CPU bound, parallelism will help, as #hochl said.
Regardless, find the speedup opportunities and fix them.
I've never seen any good-size program that didn't have one or several of them.
That will give you one speedup factor, and parallelism will give you another, and the total speedup will be the product of those two factors.

beginner question about python multiprocessing?

I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.
I wish to know whether does multiprocessing speeds up the time required to do such tasks.
I did a
multiprocessing.cpu_count
and it returns 8. I have tried something like
process = []
for i in range(4):
if i == 3:
limit = resultsSize - (3 * division)
else:
limit = division
#limit and offset indicates the subset of records the function would fetch in the db
p = Process(target=sub_table.processR,args=(limit,offset,i,))
p.start()
process.append(p)
offset += division + 1
for po in process:
po.join()
but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?

Why is this so?
Can someone please enlighten in what cases does multiprocessing gives better performances?
Here's one trick.
Multiprocessing only helps when your bottleneck is a resource that's not shared.
A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.
To find a non-shared resource, you must have independent objects. Like a list that's already in memory.
If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.
Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.
Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.
Here's the other trick.
Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.
A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.
In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.

Here are a couple of questions:
In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)
It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?
Hope this helps.

In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).
From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).
Thus, accelerating the CPU operation does not make a very big difference overall.

For my app, how many threads would be optimal?

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.

You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.

I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.

It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.

cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.

You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.

One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).

Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.