i have a task where i need selenium automation to do multiple searches inputs, each has to open the browser do some interactions and close, i can do that one after the other , but i thought that if i implement multithreading on this project it would be alot faster,i tried to implement it but it never worked as expected
i did some searchs a read about queue and thread workers
but couldn't implement that either
so can i make a queue and only 4 Threads work at time ?
because i guess more than 4 browser would be alot. and would it be safe threaded ?
You can use the module threading and a function:
import threading
def main():
# your code to execute with Selenium
for _ in range(4):
threading.Thread(tagret=main).start()
More than 4 browsers doing things at the same time could be indeed a lot, but it really depends on your PC and on how heavy the page is. You can always try out the code above with less threads and see how it goes.
Related
I am facing a small issue in my code. I have a main function that, given a certain condition arises, has to launch one or more different functions which deal with web scraping, in particular they use Selenium. The problem is that I would simply like to launch this web scraping "task", which is simply a python function, and not wait for it to terminate, rather let it go on independently from the rest of my code, so that I might be independently running 5 different instances of the same function, without waiting for them to terminate.
Some pseudo code:
while True:
condition = SomeComputation()
if(condition):
IndependentFunction( some_parameter )
Once IndependtFunction is called, I would like to not have to wait for it to end. I have looked at multiprocessing, but from what I understood I might not need such type of parallelisation.
Thanks!
You would need multithreading in order to do that. The basic usage of threading module on par with your independent function could be like this:
import threading
while True:
condition = SomeComputation()
if(condition):
newThread = threading.Thread(target=IndependentFunction, args=(some_parameter,), daemon=True)
newThread.start()
That daemon=True argument means that the thread will execute fully independently, and the main program will not wait for it to finish what it is doing before quitting the program entirely.
Check this page for a more detailed tutorial.
If you're not depending on the output of that scraping then you could use threading
it would be like
mytask = threading.Thread(myfunction,args=(arg1,arg2,argn,))
mytask.start()
more details documentation: https://docs.python.org/3/library/threading.html
I'm using requests and threading in python to do some stuff. My question is: Is this code running truly multithreaded and is it safe to use? I'm experiencing some slow down over time. Note: I'm not using this exact code but mine is doing similar things.
import time
import requests
current_threads = 0
max_threads = 32
def doStuff():
global current_threads
r = requests.get('https://google.de')
current_threads-=1
while True:
while current_threads >= max_threads:
time.sleep(0.05)
thread = threading.Thread(target = doStuff)
thread.start()
current_threads+=1
There could be a number of reasons for the issue you are facing. I'm not an expert in Python but I can see the potential for a number of causes for the slow down. Potential reasons I can think of are as follows:
Depending on the size of the data you are pulling down you could potentially be overloading your bandwidth. Hard one to prove without seeing the exact code you are are using and what your code is doing and knowing your bandwidth.
Kinda connected to the fist one but if your files are taking some time to come down per thread it maybe getting clogged up at the:
while current_threads >= max_threads:
time.sleep(0.05)
you could try reducing the max number of threads and see if that helps though it may not if it's the files that are taking time to download.
The problem may not be with your code or your bandwidth but with the server you are pulling the files from, if that server is over loaded it maybe slowing down your transfers.
Firewalls, IPS, IDS, Policys on the server maybe throttling your requests. If you make too many requests to quickly all from the same IP the server side network equipment may mistake this as some sort of DoS attack and throttle your requests in response.
Unfortunately Python, as compared to other lower level languages such as C# or C++ is not as good at multithreading. This is due to something called the GIL (Global Interpreter Lock) which comes into play when you are accessing/manipulating the same data in multiple threads. This is quite a sizeable subject in it's self but if you want to read up on it have a look at this link.
https://medium.com/practo-engineering/threading-vs-multiprocessing-in-python-7b57f224eadb
Sorry I can't be of any more assistance but this is as much as I can say on the subject given the provided information.
Sure, you're running multiple threads and provided they're not accessing/mutating the same resources you're probably "safe".
Whenever I'm accessing external resources (ie, using requests), I always recommend asyncio over vanilla threading, as it allows custom context switching (everywhere you have an "await" you switch contexts, whereas in vanilla threading switching between threads is determined by the OS and might not be optimal) and reduced overhead (you're only using ONE thread).
I have a script that uses a lot of headless Selenium automation and looped HTTP requests. It's very important that I implement a threading/worker queue for this script. I've done that.
My question is: Should I be using multi-thread or multi-process? Thread or ProcessPool? I know that:
"If your program spends more time waiting on file reads or network requests or any type of I/O task, then it is an I/O bottleneck and you should be looking at using threads to speed it up."
and...
"If your program spends more time in CPU based tasks over large datasets then it is a CPU bottleneck. In this scenario you may be better off using multiple processes in order to speed up your program. I say may as it’s possible that a single-threaded Python program may be faster for CPU bound problems, it can depend on unknown factors such as the size of the problem set and so on."
Which is the case when it comes to Selenium? Am I right to think that all CPU-bound tasks related to Selenium will be executed separately via the web driver or would my script benefit from multiple processes?
Or to be more concise: When I thread Selenium in my script, is the web driver limited to 1 CPU core, the same core the script threads are running on?
Web driver is just a driver, a driver cannot drive a car without a car.
For example when you use ChromeDriver to communicate with browser, you are launching Chrome. And ChromeDriver itself does no calculation but Chrome does.
So to clarify, webdriver is a tool to manipulate browser but itself is not a browser.
Based on this, definitely you should choose thread pool instead of process pool as it is surely an I/O bound problem in your python script.
I started programming in Python just a couple weeks ago. I have some experience with Java, so it wasn't too hard too hard for me to set up.
Right now I have a program that uses URLLib to scrape the source code of a list of sites.
I have thousands of sites to scrape, so I'm obviously looking to make it multi-threaded or multi-processed (I don't really know the difference).
The good thing is that my multi-threading works! But it's basically pointless for me to do, because all of my threads are scraping the exact same sites and giving me nothing but duplicates.
How can I avoid this issue? Thanks for your help in advance :)
The difference between multithreading and multiprocessing is important in python because the Global Interpreter Lock prevents threads from executing code simulteaneously in the interpreter. For web scraping purposes it's fine to use threading as long as your thread only executes the web request (so that only the thread blocks while waiting). If you also want to to some processing of the responses in parallel, it's better to use multiprocessing so that each subprocess will have it's own interpreter and you can leverage your cpu cores.
Regarding the issue with duplicates, there is probably a bug in the way you distribute the list of sites to the threads or subprocesses. In multiprocessing you have a Queue which is process-safe (thread-safe too). This means that if two subprocesses try to get from the queue at the same time, they will be given sequential items from the queue, instead of the same one.
In summary, You should put each site in the Queue from the main thread and then get from each worker thread or subprocess.
I've run into situations as of late when writing scripts for both Maya and Houdini where I need to wait for aspects of the GUI to update before I can call the rest of my Python code. I was thinking calling time.sleep in both situations would have fixed my problem, but it seems that time.sleep just holds up the parent application as well. This means my script evaluates the exact same regardless of whether or not the sleep is in there, it just pauses part way through.
I have a thought to run my script in a separate thread in Python to see if that will free up the application to still run during the sleep, but I haven't had time to test this yet.
Thought I would ask in the meantime if anybody knows of some other solution to this scenario.
Maya - or more precisely Maya Python - is not really multithreaded (Python itself has a dodgy kind of multithreading because all threads fight for the dread global interpreter lock, but that's not your problem here). You can run threaded code just fine in Maya using the threading module; try:
import time
import threading
def test():
for n in range (0, 10):
print "hello"
time.sleep(1)
t = threading.Thread(target = test)
t.start()
That will print 'hello' to your listener 10 times at one second intervals without shutting down interactivity.
Unfortunately, many parts of maya - including most notably ALL user created UI and most kinds of scene manipulation - can only be run from the "main" thread - the one that owns the maya UI. So, you could not do a script to change the contents of a text box in a window using the technique above (to make it worse, you'll get misleading error messages - code that works when you run it from the listener but errors when you call it from the thread and politely returns completely wrong error codes). You can do things like network communication, writing to a file, or long calculations in a separate thread no problem - but UI work and many common scene tasks will fail if you try to do them from a thread.
Maya has a partial workaround for this in the maya.utils module. You can use the functions executeDeferred and executeInMainThreadWithResult. These will wait for an idle time to run (which means, for example, that they won't run if you're playing back an animation) and then fire as if you'd done them in the main thread. The example from the maya docs give the idea:
import maya.utils import maya.cmds
def doSphere( radius ):
maya.cmds.sphere( radius=radius )
maya.utils.executeInMainThreadWithResult( doSphere, 5.0 )
This gets you most of what you want but you need to think carefully about how to break up your task into threading-friendly chunks. And, of course, running threaded programs is always harder than the single-threaded alternative, you need to design the code so that things wont break if another thread messes with a variable while you're working. Good parallel programming is a whole big kettle of fish, although boils down to a couple of basic ideas:
1) establish exclusive control over objects (for short operations) using RLocks when needed
2) put shared data into safe containers, like Queue in #dylan's example
3) be really clear about what objects are shareable (they should be few!) and which aren't
Here's decent (long) overview.
As for Houdini, i don't know for sure but this article makes it sound like similar issues arise there.
A better solution, rather than sleep, is a while loop. Set up a while loop to check a shared value (or even a thread-safe structure like a Queue). The parent processes that your waiting on can do their work (or children, it's not important who spawns what) and when they finish their work, they send a true/false/0/1/whatever to the Queue/variable letting the other processes know that they may continue.