I have created a script that :
Imports a list of IP's from .txt ( around 5K )
Connects to a REST API and performs a query based on the IP ( web logs for each IP)
Data is returned from the API and some calculations are done on the data
Results of calculations are written to a .csv
At the moment it's really slow as it takes one IP at a time does everything and then goes to the next IP .
I may be wrong but from my understanding with threading or multiprocessing i could have 3-4 threads each doing an IP which would increase the speed of the tool by a huge margin . Is my understanding correct and if it is should i be looking at threading or multi-processing for my task ?
Any help would amazing
Random info, running python 2.7.5 , Win7 with plenty or resources.
multiprocessing is definitely the way to go forward. You could start a process that reads the IPs and puts them in a multiprocessing.Queue and then make a few processes (depending on available resources) that read from this queue, connect to the API and make the requests. These requests shall then be made in parallel, and if the API can handle these requests, your program should finish faster. If the calculations are complex and time consuming, the output from the API can be put into another Queue, from where other processes you start, can read them and make the calculations and store results. You may have to start a collector process to collect the final outputs.
You can find some sample code for such problems in this stackoverflow question. If you require further explanation or sample code, let me know in comments.
With multiprocessing a primitive way to do this whould be chunck the file into 5 equal pieces and give it to 5 different processes write their results to 5 different files, when all processes are done you will merge the results.
You can have the same logic with Python threads without much complication. And probably won't make any difference since the bottle neck is probably the API. So in the end it does not really matter which approach you choose here.
There are two things two consider though:
Using Threads, you are not realy using multiple CPUs hence you are have "wasted resources"
Using Multiprocessing will use multiple processors but it is heavier on start up ... So you will benefit from never stoping the script and keeping the processes alive if the script needs to run very often.
Since the information you gave about the scenario where you use this script (or better say program) is limited, it really hard to say which is the better approach.
Related
I'm not a Python programmer and i rarely work with linux but im forced to use it for a project. The project is fairly straightforward, one task constantly gathers information as a single often updating numpy float32 value in a class, the other task that is also running constantly needs to occasionally grab the data in that variable asynchronously but not in very time critical way, and entirely on linux. My default for such tasks is the creation of a thread but after doing some research it appears as if python threading might not be the best solution for this from what i'm reading.
So my question is this, do I use multithreading, multiprocessing, concurrent.futures, asyncio, or (just thinking out loud here) some stdin triggering / stdout reading trickery or something similar to DDS on linux that I don't know about, on 2 seperate running scripts?
Just to append, both tasks do a lot of IO, task 1 does a lot of USB IO, the other task does a bit serial and file IO. I'm not sure if this is useful. I also think the importance of resetting the data once pulled and having as little downtime in task 1 as possible should be stated. Having 2 programs talk via a file probably won't satisfy this.
Any help would be appreciated, this has proven to be a difficult use case to google for.
Threading will probably work fine, a lot of the problems with the BKL (big kernel lock) are overhyped. You just need to make sure both threads provide active opportunities for the scheduler to switch contexts on a regular basis, typically by calling sleep(0). Most threads run in a loop and if the loop body is fairly short then calling sleep(0) at the top or bottom of it on every iteration is usually enough. If the loop body is long you might want to put a few more in along the way. It’s just a hint to the scheduler that this would be a good time to switch if other threads want to run.
Have you considered a double ended queue? You write to one end and grab data from the other end. Then with your multi-threading you could write with one thread and read with the other:
https://docs.python.org/3/library/collections.html#collections.deque
Quoting the documentation: "Deques are a generalization of stacks and queues (the name is pronounced “deck” and is short for “double-ended queue”). Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction."
Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.
Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!
Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.
As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.
I'm making a python script that needs to do 3 things simultaneously.
What is a good way to achieve this as do to what i've heard about the GIL i'm not so lean into using threads anymore.
2 of the things that the script needs to do will be heavily active, they will have lots of work to do and then i need to have the third thing reporting to the user over a socket when he asks (so it will be like a tiny server) about the status of the other 2 processes.
Now my question is what would be a good way to achieve this? I don't want to have three different script and also due to GIL using threads i think i won't get much performance and i'll make things worse.
Is there a fork() for python like in C so from my script so fork 2 processes that will do their job and from the main process to report to the user? And how can i communicate from the forked processes with the main process?
LE:: to be more precise 1thread should get email from a imap server and store them into a database, another thread should get messages from db that needs to be sent and then send them and the main thread should be a tiny http server that will just accept one url and will show the status of those two threads in json format. So are threads oK? will the work be done simultaneously or due to the gil there will be performance issues?
I think you could use the multiprocessing package that has an API similar to the threading package and will allow you to get a better performance with multiple cores on a single CPU.
To view the gain of performance using multiprocessing instead threading, check on this link about the average time comparison of the same program using multiprocessing x threading.
The GIL is really only something to care about if you want to do multiprocessing, that is spread the load over several cores/processors. If that is the case, and it kinda sounds like it from your description, use multiprocessing.
If you just need to do three things "simultaneously" in that way that you need to wait in the background for things to happen, then threads are just fine. That's what threads are for in the first place. 8-I)
I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.
You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.
I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.
It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.
cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.
You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.
One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).
Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.
I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx
#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2
You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.
Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313
Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.