I am trying to run multiple selenium instances in which I need to enter captchas, but I am a beginner in multiprocessing.
So while running and its time to give input it shows an error:
EOFError: EOF when reading a line
Here is an example of the code I am running:
import time
from selenium import webdriver
import multiprocessing
def first():
chromedriver = "C:\chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.set_window_size(1000, 1000)
driver.get('https://www.google.com/')
time.sleep(5)
captcha1 = input("in1: ")
print(inn)
def sec():
chromedriver = "C:\chromedriver"
driverr = webdriver.Chrome(chromedriver)
driverr.set_window_size(1000, 1000)
driverr.get('https://www.google.com/')
captcha2 = input("in2: ")
print(ins)
if __name__ == '__main__':
p1 = multiprocessing.Process(target=first)
p2 = multiprocessing.Process(target=sec)
p1.start()
p2.start()
p1.join()
p2.join()
Not only do I need to know how to give input but in this instance the 'captcha2' input would be needed first, so the 'captcha1' would have to wait until 'captcha2' is given...
You need to send messages requesting user input back to the main process so that it (and only it) can ask the user about them. The simplest way to do this is probably to create a multiprocessing.Queue object for the requests (so that the main process can listen to all children) and a Pipe for each process for the answers. Each request would of course be labeled with an identifier for the process sending it so that the response could be sent to the right place.
Related
So this is the first time I am playing around with threading so please bare with me here. In my main application (which I will implement this into), I need to add multithreading into my script. The script will read account info from a text file, then login & do some tasks with that account. I need to make sure that threads aren't reading the same line from the accounts text file since that would screw everything up, which I'm not quite sure about how to do.
from multiprocessing import Queue, Process
from threading import Thread
from time import sleep
urls_queue = Queue()
max_process = 10
def dostuff():
with open ('acc.txt', 'r') as accounts:
for account in accounts:
account.strip()
split = account.split(":")
a = {
'user': split[0],
'pass': split[1],
'name': split[2].replace('\n', ''),
}
sleep(1)
print(a)
for i in range(max_process):
urls_queue.put("DONE")
def doshit_processor():
while True:
url = urls_queue.get()
if url == "DONE":
break
def main():
file_reader_thread = Thread(target=dostuff)
file_reader_thread.start()
procs = []
for i in range(max_process):
p = Process(target=doshit_processor)
procs.append(p)
p.start()
for p in procs:
p.join()
print('all done')
# wait for all tasks in the queue
file_reader_thread.join()
if __name__ == '__main__':
main()
So at the moment I don't think the threading is even working, because it's printing one account out per second, even with 10 threads. So it should be printing 10 accounts per second which it isn't which has me confused. Also I am not sure how to make sure that threads won't pick the same account line. Help by a big brain is much appreciated
The problem is that you create a single thread to generate the data for your processes but then don't post that data to the queue. You sleep in that single thread so you see one item generated per second and then... nothing because the item isn't queued. It seems that all you are doing is creating a process pool and the inbuilt multiprocessing.Pool should work for you.
I've set pool "chunk size" low so that workers are only given 1 work item at a time. This is good for workflows where processing time can vary for each work item. By default, pool tries to optimize for the case where processing time is roughly equivalent and instead tries to reduce interprocess communication time.
Your data looks like a colon-separated file and you can use csv to cut down the processing there too. This smaller script should do what you want.
import multiprocessing as mp
from time import sleep
import csv
max_process = 10
def doshit_processor(row):
time.sleep(1) # if you want to simulate work
print(row)
def main():
with open ('acc.txt', newline='') as accounts:
table = list(csv.DictReader(accounts, fieldnames=('user', 'pass', 'name'),
delimiter=':')
with mp.Pool(max_process) as pool:
pool.map(doshit_processor, table, chunksize=1)
print('all done')
if __name__ == '__main__':
main()
As can be seen in the code below, two multiprocessing runs together, but both have a moment that can ask for an input() in the Terminal, is there any way to pause the other multiprocessing until the answer is given in the Terminal?
File Code_One archaic and simple example to speed up the explanation:
from time import sleep
def main():
sleep(1)
print('run')
sleep(1)
print('run')
sleep(1)
input('Please, give the number:')
File Code_Two archaic and simple example to speed up the explanation:
from time import sleep
def main():
sleep(2)
input('Please, give the number:')
sleep(1)
print('run 2')
sleep(1)
print('run 2')
sleep(1)
print('run 2')
sleep(1)
print('run 2')
sleep(1)
print('run 2')
File Main_Code:
import Code_One
import Code_Two
import multiprocessing
from time import sleep
def main():
while True:
pression = multiprocessing.Process(target=Code_One.main)
xgoals = multiprocessing.Process(target=Code_Two.main)
pression.start()
xgoals.start()
pression.join()
xgoals.join()
print('Done')
sleep(5)
if __name__ == '__main__':
main()
How should I proceed in this situation?
In this example, as it doesn't pause the other multi, whenever it asks for an input this error happens:
input('Please, give the number:')
EOFError: EOF when reading a line
Sure, this is possible. To do it you will need to use some sort of interprocess communication (IPC) mechanism to allow the two processes to coordinate. time.sleep is not the best option though, and there are much more efficient ways of tackling it that are specifically made just for this problem.
Probably the most efficient way is to use a multiprocessing.Event, like this:
import multiprocessing
import sys
import os
def Code_One(event, fno):
proc_name = multiprocessing.current_process().name
print(f'running {proc_name}')
sys.stdin = os.fdopen(fno)
val = input('give proc 1 input: ')
print(f'proc 1 got input: {val}')
event.set()
def Code_Two(event, fno):
proc_name = multiprocessing.current_process().name
print(f'running {proc_name} and waiting...')
event.wait()
sys.stdin = os.fdopen(fno)
val = input('give proc 2 input: ')
print(f'proc 2 got input {val}')
if __name__ == '__main__':
event = multiprocessing.Event()
pression = multiprocessing.Process(name='code_one', target=Code_One, args=(event, sys.stdin.fileno()))
xgoals = multiprocessing.Process(name='code_two', target=Code_Two, args=(event, sys.stdin.fileno()))
xgoals.start()
pression.start()
xgoals.join()
pression.join()
This creates the event object, and the two subprocesses. Event objects have an internal flag that starts out False, and can then be toggled True by any process calling event.set(). If a process calls event.wait() while the flag is False, that process will block until another process calls event.set().
The event is created in the parent process, and passed to each subprocess as an argument. Code_Two begins and calls event.wait(), which blocks until the internal flag in the event is set to True. Code_One executes immediately and then calls event.set(), which sets event's internal flag to True, and allows Code_Two to proceed. At that point both processes have returned and called join, and the program ends.
This is a little hacky because it is also passing the stdin file number from the parent to the child processes. That is necessary because when subprocesses are forked, those file descriptors are closed, so for a child process to read stdin using input it first needs to open the correct input stream (that is what sys.stdin = os.fdopen(fno) is doing). It won't work to just send sys.stdin to the child as another argument, because of the mechanics that Python uses to set up the environment for forked processes (sys.stdin is a IO wrapper object and is not pickleable).
I can't get this code to run an input whilst another block of code is running. I want to know if there are any workarounds, my code is as follows.
import multiprocessing
def test1():
input('hello')
def test2():
a=True
while a == True:
b = 5
if __name__ == "__main__":
p1 = multiprocessing.Process(target=test1)
p2 = multiprocessing.Process(target=test2)
p1.start()
p2.start()
p1.join()
p2.join()
When the code is run I get an EOF error which apparently happens when the input function is interrupted.
I would have the main process create a daemon thread responsible for doing the input in conjunction with creating the greatly under-utilized full duplex Pipe which provides two two-way communication Connection instances. For simplicity the following demo just creates one Process instance that loops doing input requests echoing the response until the user enters 'quit':
import multiprocessing
import threading
def test1(conn):
while True:
conn.send('Please enter a value: ')
s = conn.recv()
if s == 'quit':
break
print(f'You entered: "{s}"')
def inputter(conn):
while True:
# The contents of the request is the prompt to be used:
prompt = conn.recv()
conn.send(input(prompt))
if __name__ == "__main__":
conn1, conn2 = multiprocessing.Pipe(duplex=True)
t = threading.Thread(target=inputter, args=(conn1,), daemon=True)
p = multiprocessing.Process(target=test1, args=(conn2,))
t.start()
p.start()
p.join()
That's not all of your code, because it doesn't show the multiprocessing. However, the issue is that only the main process can interact with the console. The other processes do not have a stdin. You can use a Queue to communicate with the main process if you need to, but in general you want the secondary processes to be pretty much standalone.
Here's the situation:
I create a child process which opens and deals with a webdriver. The child process is finicky and might error, in which case it would close immediately, and control would be returned to the main function. In this situation, however, the browser would still be open (as the child process never completely finished running). How can I close a browser that is initialized in a child process?
Approaches I've tried so far:
1) Initializing the webdriver in the main function and passing it to the child process as an argument.
2) Passing the webdriver between the child and parent process using a queue.
The code:
import multiprocessing
def foo(queue):
driver = webdriver.Chrome()
queue.put(driver)
# Do some other stuff
# If finicky stuff happens, this driver.close() will not run
driver.close()
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=foo, name='foo', args=(queue,))
# Wait for process to finish
# Try to close the browser if still open
try:
driver = queue.get()
driver.close()
except:
pass
I found a solution:
In foo(), get the process ID of the webdriver when you open a new browser. Add the process ID to the queue. Then in the main function, add time.sleep(60) to wait for a minute, then get the process ID from the queue and use a try-except to try and close the particular process ID.
If foo() running in a separate process hangs, then the browser will be closed in the main function after one minute.
I have a program that creates a multiprocessing pool to handle a webextraction job. Essentially, a list of product ID's is fed into a pool of 10 processes that handle the queue. The code is pretty simple:
import multiprocessing
num_procs = 10
products = ['92765937', '20284759', '92302047', '20385473', ...etc]
def worker():
for workeritem in iter(q.get, None):
time.sleep(10)
get_product_data(workeritem)
q.task_done()
q.task_done()
q = multiprocessing.JoinableQueue()
procs = []
for i in range(num_procs):
procs.append(multiprocessing.Process(target=worker))
procs[-1].daemon = True
procs[-1].start()
for product in products:
time.sleep(10)
q.put(product)
q.join()
for p in procs:
q.put(None)
q.join()
for p in procs:
p.join()
The get_product_data() function takes the product, opens an instance of Selenium, and navigates to a site, logs in, and collects the details of the product and outputs to a csv file. The problem is, randomly (literally... it happens at different points of the website's navigation or extraction process) Selenium will stop doing whatever it's doing and just sit there and stop doing it's job. No exceptions are thrown or anything. I've done everything I can in the get_product_data() function to get this to not happen, but it seems to just be a problem with Selenium (i've tried using Firefox, PhantomJS, and Chrome as it's driver, and still run into the same problem no matter what).
Essentially, the process should never run for longer than, say, 10 minutes. Is there any way to kill a process and restart it with the same product id if it has been running for longer than the specified time?
This is all running on a Debian Wheezy box with Python 2.7.
You could write your code using multiprocessing.Pool and the timeout() function suggested by #VooDooNOFX. Not tested, consider it an executable pseudo-code:
#!/usr/bin/env python
import signal
from contextlib import closing
from multiprocessing import Pool
class Alarm(Exception):
pass
def alarm_handler(*args):
raise Alarm("timeout")
def mp_get_product_data(id, timeout=10, nretries=3):
signal.signal(signal.SIGALRM, alarm_handler) #XXX could move it to initializer
for i in range(nretries):
signal.alarm(timeout)
try:
return id, get_product_data(id), None
except Alarm as e:
timeout *= 2 # retry with increased timeout
except Exception as e:
break
finally:
signal.alarm(0) # disable alarm, no need to restore handler
return id, None, str(e)
if __name__=="__main__":
with closing(Pool(num_procs)) as pool:
for id, result, error in pool.imap_unordered(mp_get_product_data, products):
if error is not None: # report and/or reschedule
print("error: {} for {}".format(error, id))
pool.join()
You need to ask Selenium to wait an explicit amount of time, or wait for some implicit DOM object to be available. Take a quick look at the selenium docs about that.
From the link, here's a process that waits 10 seconds for the DOM element myDynamicElement to appear.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
ff = webdriver.Firefox()
ff.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))
except TimeoutException as why:
# Do something to reject this item, possibly by re-adding it to the worker queue.
finally:
ff.quit()
If nothing is available in the given time period, a selenium.common.exceptions.TimeoutException is raised, which you can catch in a try/except loop like above.
EDIT
Another option is to ask multiprocessing to timeout the process after some amount of time. This is done using the built-in library signal. Here's an excellent example of doing this, however it's still up to you to add that item back into the work queue when you detect a process has been killed. You can do this in the def handler section of the code.