I have a script that uses a lot of headless Selenium automation and looped HTTP requests. It's very important that I implement a threading/worker queue for this script. I've done that.
My question is: Should I be using multi-thread or multi-process? Thread or ProcessPool? I know that:
"If your program spends more time waiting on file reads or network requests or any type of I/O task, then it is an I/O bottleneck and you should be looking at using threads to speed it up."
and...
"If your program spends more time in CPU based tasks over large datasets then it is a CPU bottleneck. In this scenario you may be better off using multiple processes in order to speed up your program. I say may as it’s possible that a single-threaded Python program may be faster for CPU bound problems, it can depend on unknown factors such as the size of the problem set and so on."
Which is the case when it comes to Selenium? Am I right to think that all CPU-bound tasks related to Selenium will be executed separately via the web driver or would my script benefit from multiple processes?
Or to be more concise: When I thread Selenium in my script, is the web driver limited to 1 CPU core, the same core the script threads are running on?
Web driver is just a driver, a driver cannot drive a car without a car.
For example when you use ChromeDriver to communicate with browser, you are launching Chrome. And ChromeDriver itself does no calculation but Chrome does.
So to clarify, webdriver is a tool to manipulate browser but itself is not a browser.
Based on this, definitely you should choose thread pool instead of process pool as it is surely an I/O bound problem in your python script.
Related
i have a task where i need selenium automation to do multiple searches inputs, each has to open the browser do some interactions and close, i can do that one after the other , but i thought that if i implement multithreading on this project it would be alot faster,i tried to implement it but it never worked as expected
i did some searchs a read about queue and thread workers
but couldn't implement that either
so can i make a queue and only 4 Threads work at time ?
because i guess more than 4 browser would be alot. and would it be safe threaded ?
You can use the module threading and a function:
import threading
def main():
# your code to execute with Selenium
for _ in range(4):
threading.Thread(tagret=main).start()
More than 4 browsers doing things at the same time could be indeed a lot, but it really depends on your PC and on how heavy the page is. You can always try out the code above with less threads and see how it goes.
I have a scraper that uses multithreaded rotating proxies. The program works great however I was wonder what I could do to make it run faster. Currently I run it on my home router and a VPS, I notice the program runs much faster on my VPS.
I was wondering why this was the case? Is this based on GPU, Bandwith, Allowed Concurrent Connections By The ISP, amount of open threads allowed by proxyrack?
My end goal is to be able to run 10,000+ threads a second.
I have tried purchasing more threads on proxyrack but it just maxes out around 2000/5000. The support told me my threads close quickly.
I started programming in Python just a couple weeks ago. I have some experience with Java, so it wasn't too hard too hard for me to set up.
Right now I have a program that uses URLLib to scrape the source code of a list of sites.
I have thousands of sites to scrape, so I'm obviously looking to make it multi-threaded or multi-processed (I don't really know the difference).
The good thing is that my multi-threading works! But it's basically pointless for me to do, because all of my threads are scraping the exact same sites and giving me nothing but duplicates.
How can I avoid this issue? Thanks for your help in advance :)
The difference between multithreading and multiprocessing is important in python because the Global Interpreter Lock prevents threads from executing code simulteaneously in the interpreter. For web scraping purposes it's fine to use threading as long as your thread only executes the web request (so that only the thread blocks while waiting). If you also want to to some processing of the responses in parallel, it's better to use multiprocessing so that each subprocess will have it's own interpreter and you can leverage your cpu cores.
Regarding the issue with duplicates, there is probably a bug in the way you distribute the list of sites to the threads or subprocesses. In multiprocessing you have a Queue which is process-safe (thread-safe too). This means that if two subprocesses try to get from the queue at the same time, they will be given sequential items from the queue, instead of the same one.
In summary, You should put each site in the Queue from the main thread and then get from each worker thread or subprocess.
Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!
The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.
Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.
When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.
Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.
Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.
A web crawler script that spawns at most 500 threads and each thread basically requests for certain data served from the remote server, which each server's reply is different in content and size from others.
i'm setting stack_size as 756K's for threads
threading.stack_size(756*1024)
which enables me to have the sufficient number of threads required and complete most of the jobs and requests. But as some servers' responses are bigger than others, and when a thread gets that kind of response, script dies with SIGSEGV.
stack_sizes more than 756K makes it impossible to have the required number of threads at the same time.
any suggestions on how can i continue with given stack_size without crashes?
and how can i get the current used stack_size of any given thread?
Why on earth are you spawning 500 threads? That seems like a terrible idea!
Remove threading completely, use an event loop to do the crawling. Your program will be faster, simpler, and easier to maintain.
Lots of threads waiting for network won't make your program wait faster. Instead, collect all open sockets in a list and run a loop where you check if any of them has data available.
I recommend using Twisted - It is an event-driven networking engine. It is very flexile, secure, scalable and very stable (no segfaults).
You could also take a look at Scrapy - It is a web crawling and screen scraping framework written in Python/Twisted. It is still under heavy development, but maybe you can take some ideas.