Python multiprocessing does not exit - python

I had a code that was running successfully, but takes too long to run. So I decided to try to parallelize it.
Here is a simplified version of the code:
import multiprocessing as mp
import os
import time
output = mp.Queue()
def calcSum(Nstart,Nstop,output):
pid = os.getpid()
for s in range(Nstart, Nstop):
file_name = 'model' + str(s) + '.pdb'
file = 'modelMap' + str(pid) + '.dat'
#does something with the contents of the pdb file
#creates another file by using some other library:
someVar.someFunc(file_name=file)
#uses a function to read the file
density += readFile(file)
os.remove(file)
print pid,s
output.put(density)
if __name__ == '__main__':
snapshots = int(sys.argv[1])
cpuNum = int(sys.argv[2])
rangeSet = np.zeros((cpuNum)) + snapshots//cpuNum
for i in range(snapshots%cpuNum):
rangeSet[i] +=1
processes = []
for c in range(cpuNum):
na,nb = (np.sum(rangeSet[:c])+1, np.sum(rangeSet[:c+1]))
processes.append(mp.Process(target=calcSum,args=(int(na),int(nb),output)))
for p in processes:
p.start()
print 'now i''m here'
results = [output.get() for p in processes]
print 'now i''m there'
for p in processes:
p.join()
print 'think i''l stay around'
t1 =time.time()
print len(results)
print (t1-t0)
I run this code with the command python run.py 10 4.
This code prints the pid and s successfully in the outer loop in calcSum. I can also see that two CPUs are at 100% in the terminal. What happens is that finally pid 5 and pid 10 are printed, then the CPU usage drops to zero, and nothing happens. None of the following print statements work, and the script still looks like it's running in the terminal. I'm guessing that the processes are not exited. Is that the case? How can I fix it?
Here's the complete output:
$ python run.py 10 4
now im here
9600
9601
9602
9603
9602 7
9603 9
9601 4
9600 1
now im there
9602 8
9600 2
9601 5
9603 10
9600 3
9601 6
At that point I have to stop termination with Ctrl+C.
A few other notes:
if I comment os.remove(file) out, I can see the created files in the directory
unfortunately, I cannot bypass the part in which a file is created and then read, within calcSum
EDIT At first it worked to switch output.get() and p.join(), but upon some other edits in the code, this is no longer working. I have updated the code above.

Related

Running lots of processes modifying the same list with Python Multiprocessing results in OSError: Too many open files

I can't get the following code to execute successfully, always ending up with the following error:
OSError: [Errno 24] Too many open files
with the most recent call on p.start()
As you can see I already tried to execute the processes in chunks of 500, as they run fine when only 500 are executed, however after the second loop I still receive the above error.
I guess that the processes are not properly closed after execution, however I could not figure out how to check and properly close them...
here is my code:
import multiprocessing
manager = multiprocessing.Manager()
manager_lists = manager.list()
def multiprocessing_lists(a_list):
thread_lists = []
for r in range(400):
thread_lists.append(a_list)
manager_lists.extend(thread_lists)
manager = multiprocessing.Manager()
manager_lists = manager.list()
processes = []
counter = 0
chunk_size = 500
all_processes = 4000
for i in range(0, all_processes, chunk_size):
for j in range(chunk_size):
a_list = [1,2,3,4,5,6,7,8,9]
p = multiprocessing.Process(target=multiprocessing_lists, args=(a_list,))
processes.append(p)
p.start()
counter += 1
for process in processes:
process.join()
process.terminate()
print(len(manager_lists))

How to execute command on linux plateform with different thread in python?

I want to execute two separate commands on command prompt and read command's output in python. My approach is, I want to execute these command on certain time interval i.e. after x seconds.
I have two commands say command1 and command2. command1 is taking maximum 30 seconds to print it's output on console. command2 is taking maximum 10 seconds to print it't output on console.
I want to execute this command1 and command2 with new thread every time (time interval) i.e. after every x seconds
program code -
import os,sys
import thread,threading
import time
def read_abc_data():
with open("abc.txt", "a") as myfile:
output = os.popen('command1').read()
myfile.write(output +"\n\n")
def abc(threadName):
while True:
threading.Thread(target = read_abc_data).start()
time.sleep(10)
def read_pqr_data():
with open("pqr.txt", "a") as myfile:
output = os.popen('command2').read()
myfile.write(output +"\n\n")
def pqr(threadName):
while True:
threading.Thread(target = read_pqr_data).start()
time.sleep(10)
if __name__ == "__main__":
try:
thread.start_new_thread( abc, ("Thread-1", ) )
thread.start_new_thread( pqr, ("Thread-2", ) )
except:
print "Error: unable to start thread"
while 1:
pass
Currently I have given 10 seconds sleep (delay) to execute read_abc_data() and read_pqr_data() functions. after executing this program for 1 minute I'm getting abc.txt file as empty. I think the reason is command1 didn't provide complete output in 10 seconds. right?
I want abc.txt and pqr.txt files with commands output as data in that. Is I'm missing something?
As commands output while running, so you'd better get outputs streamingly. It can be done with subprocess and readline:
import subprocess
def read_abc_data():
with open("abc.txt", "a") as myfile:
process = subprocess.Popen('command1', stdout=subprocess.PIPE)
for line in iter(process.stdout.readline, ''):
myfile.write(line)
Try this code. It uses locks to lock the file resource during updates, and uses a single function that gets executed as part of thread execution.
import time
import subprocess
import threading
from thread import start_new_thread
command1 = "ls"
command2 = "date"
file_name1 = "/tmp/one"
file_name2 = "/tmp/two"
def my_function(command, file_name, lock):
process_obj = subprocess.Popen(command, stdout=subprocess.PIPE)
command_output, command_error = process_obj.communicate()
print command_output
lock.acquire()
with open(file_name, 'a+') as f:
f.write(command_output)
print 'Writing'
lock.release()
if __name__ == '__main__':
keep_running = True
lock1 = threading.Lock()
lock2 = threading.Lock()
while keep_running:
try:
start_new_thread(my_function, (command1, file_name1, lock1))
start_new_thread(my_function, (command2, file_name2, lock2))
time.sleep(10)
except KeyboardInterrupt, e:
keep_running = False

Python Pool.map() - locally works, on server fails

I've researched many pool.map on SO and still can't seem to find anything that hints at my issue.
I have if __name__ == '__main__' in every .py file. I have freeze_support() in each .py that contains import multiprocessing, I am still at a loss for what is happening. I've moved the freeze_support() around in my code with the same unsuccessful results.
Script A calls Script B, Script B calls Script C (where the multiprocessing happens). Locally this scenario works perfectly, but when I load it to a Windows Server 2008 machine, strange things start happening.
On the server I can see the first iterable printed to the interpreter, but it then jumps back to Script B and keeps processing. There are 51 other items in the list for Script C.
Script B Code:
if not arcpy.Exists(MergedDataFC):
ScriptC.intersect_main(input1, input2)
if not arcpy.Exists(MergedDataSHP):
shpList = arcpy.ListFields(*.shp) # output of multiprocess
# Merge all shapefiles into single shapefile
# Being executed before the multiprocess finishes all 52 items
Script C Code:
import multiprocessing as mp
def intersect_main(input1,input2):
try:
mp.freeze_support()
# Create a list of states for input1 polygons
log.log("Creating Polygon State list...")
fldList = arcpy.ListFields(input1)
flds = [fld.name for fld in fldList]
idList = []
with arcpy.da.SearchCursor(input1, flds) as cursor:
for row in cursor:
idSTATE = row[flds.index("STATE")]
idList.append(idSTATE)
idList = set(idList)
log.log("There are " + str(len(idList)) + " States (polygons) to process.")
log.log("Sending to pool")
# declare number of cores to use, use 1 less than the max
cpuNum = mp.cpu_count() -1
# Create the pool object
pool = mp.Pool(processes=cpuNum)
# Fire off list to worker function.
# res is a list that is created with what ever the worker function is returning
log.log ("Entering intersectWork")
res = pool.map((intersectWork(input1, input2, idSTATE)),idList)
pool.close()
pool.join()
# If an error has occurred report it
if False in res:
log.log ("A worker failed!")
log.log (strftime('[%H:%M:%S]', localtime()))
raise Exception
else:
log.log("Finished multiprocessing!")
log.log (strftime('[%H:%M:%S]', localtime()))
except Exception, e:
tb = sys.exc_info()[2]
# Geoprocessor threw an error
log.log("An error occurred on line " + str(tb.tb_lineno))
log.log (str(e))
def intersectWork(input1,input2, idSTATE):
try:
if idSTATE == None:
query = "STATE IS NULL"
idSTATE = 'pr'
else:
query = "STATE = '" + idSTATE + "'"
DEMOlayer = arcpy.MakeFeatureLayer_management(input1,"input1_" + idSTATE)
log.log (query)
arcpy.SelectLayerByAttribute_management(DEMOlayer,"NEW_SELECTION",query)
# Do the Intersect
outFC = r'C:/EclipseWorkspace' + '/INTER_' + idSTATE.upper() + '.shp'
strIntersect = str(DEMOlayer) + ";" + str(input2)
arcpy.Intersect_analysis(strIntersect, outFC, "ALL", "", "LINE")
return True
except:
# Some error occurred so return False
log.log(arcpy.GetMessage(2))
return False
if __name__ == '__main__':
intersect_main(input1, input2)
Edit
All the data on the server is stored locally, no across network processing.
The issue was the full path to the data wasn't being properly passed into the pool.map() on the server, from previous modules. I had to add all the files paths under the import statements. Not very elegant looking, but it's working.

The multiprocessing python module is unable to put back the processed results

I'm writing a TCP SYN scanner that checks for all the opened ports. The script is able to get all the opened ports by making use of multiple cores. At the end of the script, when trying to fetch the results using the get() method, the scripts becomes non-functional. On doing the keyboard interrupt, there appears a Traceback which is mentioned below the code. When I'm using 2 cores, the script runs fine but when loop is made to run for 3 or more times (utilizing 3 or more cores), the script gets stuck. Any suggestions on how to go further with this ?
==============Code is below=====================================
#!/usr/bin/python
import multiprocessing as mp
from scapy.all import *
import sys
import time
results = []
output = mp.Queue()
processes = []
def portScan(ports,output):
ip = sys.argv[1]
for port in range(ports-100,ports):
response = sr1(IP(dst=ip)/TCP(dport=port, flags="S"), verbose=False, timeout=.2)
if response:
if response[TCP].flags == 18 :
print "port number ======> %d <====== Status: OPEN" %(port)
output.put(port)
ports = 0
for loop in range(4):
ports += 100
print "Ports %d sent as the argument"%ports
processes.append(mp.Process(target=portScan,args=(ports,output)))
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
===========Output======================
./tcpSynmultiprocess.py 10.0.2.1
WARNING: No route found for IPv6 destination :: (no default route?)
Ports 100 sent as the argument
Ports 200 sent as the argument
Ports 300 sent as the argument
port number ======> 23 <====== Status: OPEN
port number ======> 80 <====== Status: OPEN
^CTraceback (most recent call last):
===========TraceBack===================
^CTraceback (most recent call last):
File "./tcpSynmultiprocess.py", line 43, in <module>
results = [output.get() for p in processes]
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
KeyboardInterrupt
By default, Queue.get() blocks until it has data to return, which it won't if all the processes have already ended.
You can use output.get(False) to not block on processes that return nothing (you'll have to handle the Queue.Empty exception).
Or, since the queue size can also be bigger than the number of processes, you should rather use Queue.qsize() instead of processes:
results = [output.get() for x in range(output.qsize())]
#!/usr/bin/python
import multiprocessing as mp
from scapy.all import *
import sys
import time
results = []
output = mp.Queue()
processes = []
def portScan(ports,output):
ip = sys.argv[1]
for port in range(ports-100,ports):
response = sr1(IP(dst=ip)/TCP(dport=port, flags="S"), verbose=False, timeout=.2)
if response:
if response[TCP].flags == 18 :
print "port number ======> %d <====== Status: OPEN" %(port)
output.put(port)
ports = 0
for loop in range(4):
ports += 100
print "Ports %d sent as the argument"%ports
processes.append(mp.Process(target=portScan,args=(ports,output)))
for p in processes:
p.start()
for p in processes:
p.join()
for size in range(output.qsize()):
try:
results.append(output.get())
except:
print "Nothing fetched from the Queue..."
print results

Resource.getrusage() always returns 0

At the end of a script, I would like to return the peak memory usage. After reading other questions, here is my script:
#!/usr/bin/env python
import sys, os, resource, platform
print platform.platform(), platform.python_version()
os.system("grep 'VmRSS' /proc/%s/status" % os.getpid())
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
dat = [x for x in xrange(10000000)]
os.system("grep 'VmRSS' /proc/%s/status" % os.getpid())
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
and here is what I get:
$ test.py
Linux-2.6.18-194.26.1.el5-x86_64-with-redhat-5.5-Final 2.7.2
VmRSS: 4472 kB
0
VmRSS: 322684 kB
0
Why is resource.getrusage always returning me 0?
The same thing happens interactively in a terminal. Can this be due to the way Python was specifically installed on my machine? (It's a computer cluster I'm using with others and managed by admins.)
Edit: same thing happen when I use subprocess; executing this script
#!/usr/bin/env python
import sys, os, resource, platform
from subprocess import Popen, PIPE
print platform.platform(), platform.python_version()
p = Popen(["grep", "VmRSS", "/proc/%s/status" % os.getpid()], shell=False, stdout=PIPE)
print p.communicate()
print "resource:", resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
dat = [x for x in xrange(10000000)]
p = Popen(["grep", "VmRSS", "/proc/%s/status" % os.getpid()], shell=False, stdout=PIPE)
print p.communicate()
print "resource:", resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
gives this:
$ test.py
Linux-2.6.18-194.26.1.el5-x86_64-with-redhat-5.5-Final 2.7.2
('VmRSS:\t 4940 kB\n', None)
resource: 0
('VmRSS:\t 323152 kB\n', None)
resource: 0
Here's a way to replace the ´os.system´ call
In [131]: from subprocess import Popen, PIPE
In [132]: p = Popen(["grep", "VmRSS", "/proc/%s/status" % os.getpid()], shell=False, stdout=PIPE)
In [133]: p.communicate()
Out[133]: ('VmRSS:\t 340832 kB\n', None)
I also have no issue running the line you felt you have problems with:
In [134]: print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
340840
Edit
The rusage issue could well be a kernel dependent issue and simply not available on your red hat dist http://bytes.com/topic/python/answers/22489-getrusage
You could of course have a separate thread in your code looking at the current usage and storing throughout the execution of the code and store the highest value observed
Edit 2
Here's a full solution skipping resource and monitoring usages via Popen. The frequency of checking must of course be relevant but not frequent so that it eats all cpu.
#!/usr/bin/env python
import threading
import time
import re
import os
from subprocess import Popen, PIPE
maxUsage = 0
keepThreadRunning = True
def memWatch(freq=20):
global maxUsage
global keepThreadRunning
while keepThreadRunning:
p = Popen(["grep", "VmRSS", "/proc/%s/status" % os.getpid()],
shell=False, stdout=PIPE)
curUsage = int(re.search(r'\d+', p.communicate()[0]).group())
if curUsage > maxUsage:
maxUsage = curUsage
time.sleep(1.0 / freq)
if __name__ == "__main__":
t = threading.Thread(target=memWatch)
t.start()
print maxUsage
[p for p in range(1000000)]
print maxUsage
[str(p) for p in range(1000000)]
print maxUsage
keepThreadRunning = False
t.join()
The memWatch function can be optimized by calculating the sleep time once, not reformatting the path to the process each loop and compiling the regular expression before entering the while loop. But in all I hope that was the functionality you sought.

Categories