Python Multiprocessing Timeout problems - python

I am using multiprocessing in python, try to kill the running after a timeout. But it doesn't work, and I don't know the reason.
I followed an example, it seems easy. Just need to start the process, after 2 seconds, terminate the running. But it doesnt work for me.
Could you please help me figure it out? Thanks for your help!
from amazonproduct import API
import multiprocessing
import time
AWS_KEY = '...'
SECRET_KEY = '...'
ASSOC_TAG = '...'
def crawl():
api = API(AWS_KEY, SECRET_KEY, 'us', ASSOC_TAG)
for root in api.item_search('Beauty', Keywords='maybelline',
ResponseGroup='Large'):
# extract paging information
nspace = root.nsmap.get(None, '')
products = root.xpath('//aws:Item',
namespaces={'aws' : nspace})
for product in products:
print product.ASIN,
if __name__ == '__main__':
p = multiprocessing.Process(target = crawl())
p.start()
if time.sleep(2.0):
p.terminate()

Well, this won't work:
if time.sleep(2.0):
p.terminate()
time.sleep does not return anything, so the above statement is always equivalent to if None:. None is False in a boolean context, so there you go.
If you want it to always terminate, take out that if statement. Just do a bare time.sleep.
Also, bug:
p = multiprocessing.Process(target = crawl())
This isn't doing what you think it's doing. You need to specify target=crawl, NOT target=crawl(). The latter calls the function in your main thread, the former passes the function as an argument to Process which will then execute it in parallel.

Related

Python Threading issue, not working and I need to make sure threads dont read same line

So this is the first time I am playing around with threading so please bare with me here. In my main application (which I will implement this into), I need to add multithreading into my script. The script will read account info from a text file, then login & do some tasks with that account. I need to make sure that threads aren't reading the same line from the accounts text file since that would screw everything up, which I'm not quite sure about how to do.
from multiprocessing import Queue, Process
from threading import Thread
from time import sleep
urls_queue = Queue()
max_process = 10
def dostuff():
with open ('acc.txt', 'r') as accounts:
for account in accounts:
account.strip()
split = account.split(":")
a = {
'user': split[0],
'pass': split[1],
'name': split[2].replace('\n', ''),
}
sleep(1)
print(a)
for i in range(max_process):
urls_queue.put("DONE")
def doshit_processor():
while True:
url = urls_queue.get()
if url == "DONE":
break
def main():
file_reader_thread = Thread(target=dostuff)
file_reader_thread.start()
procs = []
for i in range(max_process):
p = Process(target=doshit_processor)
procs.append(p)
p.start()
for p in procs:
p.join()
print('all done')
# wait for all tasks in the queue
file_reader_thread.join()
if __name__ == '__main__':
main()
So at the moment I don't think the threading is even working, because it's printing one account out per second, even with 10 threads. So it should be printing 10 accounts per second which it isn't which has me confused. Also I am not sure how to make sure that threads won't pick the same account line. Help by a big brain is much appreciated
The problem is that you create a single thread to generate the data for your processes but then don't post that data to the queue. You sleep in that single thread so you see one item generated per second and then... nothing because the item isn't queued. It seems that all you are doing is creating a process pool and the inbuilt multiprocessing.Pool should work for you.
I've set pool "chunk size" low so that workers are only given 1 work item at a time. This is good for workflows where processing time can vary for each work item. By default, pool tries to optimize for the case where processing time is roughly equivalent and instead tries to reduce interprocess communication time.
Your data looks like a colon-separated file and you can use csv to cut down the processing there too. This smaller script should do what you want.
import multiprocessing as mp
from time import sleep
import csv
max_process = 10
def doshit_processor(row):
time.sleep(1) # if you want to simulate work
print(row)
def main():
with open ('acc.txt', newline='') as accounts:
table = list(csv.DictReader(accounts, fieldnames=('user', 'pass', 'name'),
delimiter=':')
with mp.Pool(max_process) as pool:
pool.map(doshit_processor, table, chunksize=1)
print('all done')
if __name__ == '__main__':
main()

How get value from function which is executed by Thread?

I have main_script.py which import scripts which get data from webpages. I want do this by use multithreading. I came up with this solution, but it does not work:
main_script:
import script1
temp_path = ''
thread1 = threading.Thread(target=script1.Main,
name='Script1',
args=(temp_path, ))
thread1.start()
thread1.join()
script1:
class Main:
def __init__()
def some_func()
def some_func2()
def __main__():
some_func()
some_func2()
return callback
Now only 1 way I know to get value of callback from script1 to main_script is:
main_script:
import script1
temp_path = ''
# make instance of class with temp_path
inst_script1 = script1.Main(temp_path)
print("instance1:")
print(inst_script1.callback)
It's works but then I run instances of scripts one-by-one, no concurrently.
Anybody has any idea how handle that? :)
First off if you are using threading in Python make sure you read: https://docs.python.org/2/glossary.html#term-global-interpreter-lock. Unless you are using C modules or a lot of I/O you won't see the scripts run concurrently. Generally speaking, multiprocessing.pool is a better approach.
If you are certain we want threads rather then processes you can use a mutable variable to store the result. For example, a dictionary which keeps track of the result of each thread.
result = {}
def test(val, name, target):
target[name] = val * 4
temp_path = 'ASD'
thread1 = threading.Thread(target=test,
name='Script1',
args=(temp_path, 'A', result))
thread1.start()
thread1.join()
print (result)
Thanks for response. Yes, I readed about GIL, but it's doesn't make me any problem yet. Generally I solve my problem, because I find solution on other website. Code like this now:
Main_script:
import queue
import script1
import script2
queue_callbacks = queue.Queue()
threads_list = list()
temp_path1 = ''
thread1 = threading.Thread(target= lambda q, arg1: q.put(Script1.Main(arg1)),
name='Script1',
args=(queue_callbacks, temp_path1, ))
thread1.start()
temp_path2 = ''
thread2 = threading.Thread(target= lambda q, arg1: q.put(Script2.Main(arg1)),
name='Script2',
args=(queue_callbacks, temp_path2, ))
thread2.start()
for t in threads_list:
t.join()
while not kolejka_callbacks.empty():
result = queue_callbacks.get()
callbacks.append({"service": result.service, "callback": result.callback, "error": result.error})
And this works fine. Now I have other problem, because I want this to work in big scale, where I have a hundreds of scripts and handle this by e.q. 5 threads.
In general, is there any limit to the number of threads running at any one time?

How to run multiple asynchronous processes in Python using multiprocessing?

I need to run multiple background asynchronous functions, using multiprocessing. I have working Popen solution, but it looks a bit unnatural. Example:
from time import sleep
from multiprocessing import Process, Value
import subprocess
def worker_email(keyword):
subprocess.Popen(["python", "mongoworker.py", str(keyword)])
return True
keywords_list = ['apple', 'banana', 'orange', 'strawberry']
if __name__ == '__main__':
for keyword in keywords_list:
# Do work
p = Process(target=worker_email, args=(keyword,))
p.start()
p.join()
If I try not to use Popen, like:
def worker_email(keyword):
print('Before:' + keyword)
sleep(10)
print('After:' + keyword)
return True
Functions run one-by-one, no async. So, how to run all functions at the same time without using Popen?
UPD: I'm using multiprocessing.Value to return results from Process, like:
def worker_email(keyword, func_result):
sleep(10)
print('Yo:' + keyword)
func_result.value = 1
return True
func_result = Value('i', 0)
p = Process(target=worker_email, args=(doc['check_id'],func_result))
p.start()
# Change status
if func_result.value == 1:
stream.update_one({'_id': doc['_id']}, {"$set": {"status": True}}, upsert=False)
But it doesn't work without .join(). Any ideas how to make it work or similar way? :)
If you just remove the line p.join() it should work.
You only need p.join if you want to wait for the process to finish before executing further. At the end of the Program python waits for all Process to finished before closing, so you don't need to worry about that.
Solved problem with getting Process result by transferring result check and status update into worker function. Something like:
# Update task status if work is done
def update_status(task_id, func_result):
# Connect to DB
client = MongoClient('mongodb://localhost:27017/')
db = client.admetric
stream = db.stream
# Update task status if OK
if func_result:
stream.update_one({'_id': task_id}, {"$set": {"status": True}}, upsert=False)
# Close DB connection
client.close()
# Do work
def yo_func(keyword):
sleep(10)
print('Yo:' + keyword)
return True
# Worker function
def worker_email(keyword, task_id):
update_status(task_id, yo_func(keyword))

python threading : cannot switch thread to Daemon

I would expect next code to be executed simultaneously and all filenames from os.walk iterations , that got 0 at random , will get in result dictionary. And all threads that have some timeout would get into deamon mode and will be killed as soon as script reaches end. However, script respects all timeouts for each thread.
Why is this happening? Should it put all threads in backgroung and kill them if they will not finish and return result before the end of script execution? thank you.
import threading
import os
import time
import random
def check_file(file_name,timeout):
time.sleep(timeout)
print file_name
result.append(file_name)
result = []
for home,dirs,files in os.walk("."):
for ifile in files :
filename = '/'.join([home,ifile])
t = threading.Thread(target=check_file(filename,random.randint(0,5)))
t.setDaemon(True)
t.start()
print result
Solution: I found my mistake:
t = threading.Thread(target=check_file(filename,random.randint(0,5)))
has to be
t = threading.Thread(target=check_file, args=(filename,random.randint(0,5)))
In this case, threading will spawn a thread with function as object ang give it arguments. In my initial example, function with args has to be resolved BEFORE thread spawns. And this is fair.
However, example above works for me at 2.7.3 , but at 2.7.2 i cannot make it working.
I `m getting got exception that
function check_file accepts exactly 1 argument (34 is given).
Soulution :
in 2.7.2 i had to put ending coma in args tuple , considering that i have 1 variable only . God knows why this not affects 2.7.3 version . It was
t = threading.Thread(target=check_file, args=(filename))
and started to work with
t = threading.Thread(target=check_file, args=(filename,))
I understand what you were trying to do, but you're not using the right format for threading. I fixed your example...look up the Queue class on how to do this properly.
Secondly, never ever do string manipulation on file paths. Use the os.path module; there's a lot more than adding separators between strings that you and I don't think about most of the time.
Good luck!
import threading
import os
import time
import random
import Queue
def check_file():
while True:
item = q.get()
time.sleep(item[1])
print item
q.task_done()
q = Queue.Queue()
result = []
for home,dirs,files in os.walk("."):
for ifile in files:
filename = os.path.join(home, ifile)
q.put((filename, random.randint(0,5)))
number_of_threads = 25
for i in range(number_of_threads):
t = threading.Thread(target=check_file)
t.daemon = True
t.start()
q.join()
print result

parallelly execute blocking calls in python

I need to do a blocking xmlrpc call from my python script to several physical server simultaneously and perform actions based on response from each server independently.
To explain in detail let us assume following pseudo code
while True:
response=call_to_server1() #blocking and takes very long time
if response==this:
do that
I want to do this for all the servers simultaneously and independently but from same script
Use the threading module.
Boilerplate threading code (I can tailor this if you give me a little more detail on what you are trying to accomplish)
def run_me(func):
while not stop_event.isSet():
response= func() #blocking and takes very long time
if response==this:
do that
def call_to_server1():
#code to call server 1...
return magic_server1_call()
def call_to_server2():
#code to call server 2...
return magic_server2_call()
#used to stop your loop.
stop_event = threading.Event()
t = threading.Thread(target=run_me, args=(call_to_server1))
t.start()
t2 = threading.Thread(target=run_me, args=(call_to_server2))
t2.start()
#wait for threads to return.
t.join()
t2.join()
#we are done....
You can use multiprocessing module
import multiprocessing
def call_to_server(ip,port):
....
....
for i in xrange(server_count):
process.append( multiprocessing.Process(target=call_to_server,args=(ip,port)))
process[i].start()
#waiting process to stop
for p in process:
p.join()
You can use multiprocessing plus queues. With one single sub-process this is the example:
import multiprocessing
import time
def processWorker(input, result):
def remoteRequest( params ):
## this is my remote request
return True
while True:
work = input.get()
if 'STOP' in work:
break
result.put( remoteRequest(work) )
input = multiprocessing.Queue()
result = multiprocessing.Queue()
p = multiprocessing.Process(target = processWorker, args = (input, result))
p.start()
requestlist = ['1', '2']
for req in requestlist:
input.put(req)
for i in xrange(len(requestlist)):
res = result.get(block = True)
print 'retrieved ', res
input.put('STOP')
time.sleep(1)
print 'done'
To have more the one sub-process simply use a list object to store all the sub-processes you start.
The multiprocessing queue is a safe object.
Then you may keep track of which request is being executed by each sub-process simply storing the request associated to a workid (the workid can be a counter incremented when the queue get filled with new work). Usage of multiprocessing.Queue is robust since you do not need to rely on stdout/err parsing and you also avoid related limitation.
Then, you can also set a timeout on how long you want a get call to wait at max, eg:
import Queue
try:
res = result.get(block = True, timeout = 10)
except Queue.Empty:
print error
Use twisted.
It has a lot of useful stuff for work with network. It is also very good at working asynchronously.

Categories