Python multiprocessing hangs on heavy multithreading situation - python

I'm struggling with a very weird Python behavior I cannot get rid off of.
I created a Python external lib which I imports from inside my main python script. This library receive a huge (up to 50MB) JSON text document divided into sections. I need to parse each one of those sections and extract data with regex.
To speed up this procedure, and knowning the limits of Python not scaling on multicore CPUs, I decided to use the multiprocessing library to create as many processes as there are available core on the physical CPU.
The library splits the main JSON into sections and initializes the different multiprocessing.Process instances passing to each one of the processes a specific text section.
I do this this way:
p_one = multiprocessing.Process(
name=name,
target=functionOne,
args=(buffer1, id1)
)
p_one.start()
p_two = multiprocessing.Process(
name=name,
target=functionTwo,
args=(buffer2, id2)
)
p_two start()
p_three = multiprocessing.Process(
name=name,
target=functionThree,
args=(buffer3, id3)
)
p_three.start()
p_four = multiprocessing.Process(
name=name,
target=functionFour,
args=(buffer4, id4)
)
p_four.start()
p_one.join()
p_two.join()
p_three.join()
p_four.join()
This usually works, however from time to time one of the called joins hangs and it hangs whole main lib - thus preventing my main script from going ahead.
The child processes are not crashing tho, they use Google re2 as regex library and they finish their parsing routine.
As previously said, this doesn't happen everytime, it's very random I'd say. If I kill the whole process and restart it with the same buffer of strings, then it perfectly works, so the regex rules are not wrong nor they are hanging anything.
I tried with the multiprocessing map function, but it was ending up in creating zombie processes and thus I swapped to the multiprocessing.Process.
I also checked with Pyrasite and the blocked thread hangs on the join function.
"/usr/local/lib/python2.7/dist-packages/xxxxx/dispatcher.py", line 401, in p_domain
p_two.join()
File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
res = self._popen.wait(timeout)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
return self.poll(0)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
pid, sts = os.waitpid(self.pid, flag)
Please do you have any hint/suggestion/something that could help me in understanding why this is happening and how to get this fixed?
Many thanks!
* Addition *
This is an example of a piece of code called inside the subprocess.Process() child
def check_domains(buf, id):
warnings = []
related_list = []
strings, _ = parse_json_string(buf)
for string in strings:
_current_result = _check_single_domain(string)
# If no result, add a warning message
if _current_result is False:
warnings.append("String timed out in DOMAIN section.")
else:
if _current_result is not None:
related_list.append(_current_result)
related_list = list(set(related_list))
[...cut...]
def _check_single_domain(string):
global DOMAIN_RESULT
# Check string length
if len(string) > MAX_STRING_LENGTH_FOR_DOMAIN:
return None
# Check unacceptable characters inside string
for unacceptable in UNACCEPTABLE_DOMAIN_CHARACTERS:
if unacceptable in string:
return None
# Check if string contains unacceptable characters
for required in REQUIRED_DOMAIN_CHARACTERS:
if required not in string:
return None
# Try to match string against Domain regex using Thread with timeout
thread = threading.Thread(name='Thread-DOMAIN', target=_perform_regex_against_string, args=(string, ))
thread.setDaemon(True)
thread.start()
thread.join(TIMEOUT_REGEX_FOR_DOMAIN_IN_SECONDS)
# If a time out occurred, return False that meaning no result got
if thread.isAlive():
return False
if DOMAIN_RESULT is None:
return None
# Domain can not starts or ends with a dot character
if DOMAIN_RESULT.endswith(".") or DOMAIN_RESULT.startswith("."):
return None
return DOMAIN_RESULT
def _perform_regex_against_string(string):
global DOMAIN_RESULT
# Set result to default value
DOMAIN_RESULT = None
# Regex for Domains
matched = re.search(REGEX, string, re.IGNORECASE)
if matched:
DOMAIN_RESULT = ''.join(matched.groups())

Related

Why does a result() from concurrent.futures.as_completed() occasionally return None?

I've been writing a fairly complex series of modules to help test our poorly documented networking gear, this one focused on trying the various passwords used across the company. I've been using the concurrent.futures module to speed things along by testing many devices in parallel but am running in to a problem where occasionally a result comes back as None and it's a mystery to me as why. It revolves around this chunk of code:
def process_device(dev):
check = CheckPass(creds, inv_list)
dev = switch.Switch(**dev)
online, net_device = check.device_online(dev)
if online == True:
conn, net_device = check.try_passwords(dev, return_conn=True)
if conn != False and conn != "none":
if conn != True:
hostname,model,serial_num,version,date = net_device.get_id_details(conn)
net_device.hostname = hostname
net_device.model = model
net_device.serial_num = serial_num
net_device.version = version
net_device.date = date
conn.disconnect()
conf_dev = check.confirm_device(net_device)
check.dev_confirmed.append(conf_dev)
return dev.hostname, dev.ip
with concurrent.futures.ThreadPoolExecutor(max_workers=120) as executor:
threads = {executor.submit(process_device, dev): dev for dev in inv_list}
for future in concurrent.futures.as_completed(threads):
name, ip = future.result()
At unpredictable intervals future.result() will have a result of None and I can't reliably reproduce it. The error occurs with different devices every time, I've checked the logs and fed it an inventory list containing just the device that it had processed and running that device by itself will be fine. I have tried using fewer workers and more. Clearly I'm not understanding something about how future.result() works. dev.hostname and dev.ip always have values so process_device() should always return them (barring unhandled exceptions, which haven't occurred) yet I always end up with TypeError: Cannot unpack non-iterable NoneType object referencing the name, ip = future.results() command.
It is not problem with future but with your function process_device() which sometimes can return None.
When online is False or when if conn != False and conn != "none": gives False then process_device() will run default return None
def process_device(dev):
# ... your code ...
return None # default behavior
You should filter results
result = future.result()
if result:
name, ip = result

How to know how many threads / workers from a pool in multiprocessing (Python module) has been completed?

I am using imapala shell to compute some stats over a text file containing the table names.
I am using Python multiprocessing module to pool the processes.
The thing is thing task is very time consuming, so I need to keep track of how many files have been completed to see the job progress.
So let me give you some ideas about the functions that I am using.
job_executor is the function that takes a list of tables and perform the tasks.
main() is the functions, that takes file location, no of executors(pool_workers), converts the file containing table to list of tables and does the multiprocessing thing
I want to see the progress like how much file has been processed by job_executor, but I can't find a solution . Using a counter also doesn't work.
def job_executor(text):
impala_cmd = "impala-shell -i %s -q 'compute stats %s.%s'" % (impala_node, db_name, text)
impala_cmd_res = os.system(impala_cmd) #runs impala Command
#checks for execution type(success or fail)
if impala_cmd_res == 0:
print ("invalidated the metadata.")
else:
print("error while performing the operation.")
def main(args):
text_file_path = args.text_file_path
NUM_OF_EXECUTORS = int(args.pool_executors)
with open(text_file_path, 'r') as text_file_reader:
text_file_rows = text_file_reader.read().splitlines() # this will return list of all the tables in the file.
process_pool = Pool(NUM_OF_EXECUTORS)
try:
process_pool.map(job_executor, text_file_rows)
process_pool.close()
process_pool.join()
except Exception:
process_pool.terminate()
process_pool.join()
def parse_args():
"""
function to take scraping arguments from test_hr.sh file
"""
parser = argparse.ArgumentParser(description='Main Process file that will start the process and session too.')
parser.add_argument("text_file_path",
help='provide text file path/location to be read. ') # text file fath
parser.add_argument("pool_executors",
help='please provide pool executors as an initial argument') # pool_executor path
return parser.parse_args() # returns list/tuple of all arguments.
if __name__ == "__main__":
mail_message_start()
main(parse_args())
mail_message_end()
If you insist on needlessly doing it via multiprocessing.pool.Pool(), the easiest way to keep a track of what's going on is to use a non-blocking mapping (i.e. multiprocessing.pool.Pool.map_async()):
def main(args):
text_file_path = args.text_file_path
NUM_OF_EXECUTORS = int(args.pool_executors)
with open(text_file_path, 'r') as text_file_reader:
text_file_rows = text_file_reader.read().splitlines()
total_processes = len(text_file_rows) # keep the number of lines for reference
process_pool = Pool(NUM_OF_EXECUTORS)
try:
print('Processing {} lines.'.format(total_processes))
processing = process_pool.map_async(job_executor, text_file_rows)
processes_left = total_processes # number of processing lines left
while not processing.ready(): # start a loop to wait for all to finish
if processes_left != processing._number_left:
processes_left = processing._number_left
print('Processed {} out of {} lines...'.format(
total_processes - processes_left, total_processes))
time.sleep(0.1) # let it breathe a little, don't forget to `import time`
print('All done!')
process_pool.close()
process_pool.join()
except Exception:
process_pool.terminate()
process_pool.join()
This will check every 100ms if some of the processes finished processing and if something changed since the last check it will print out the number of lines processed so far. If you need more insight into what's going on with your subprocesses, you can use some of the shared structures like multiprocessing.Queue() or multiprocessing.Manager() structures to directly report from within your processes.

How to append to a list using map function?

I'm trying to use map, to map the ping_all function to a list of hosts.
The problem I'm having is that inside the ping_all function, I'm trying to append all failed hosts to a list. Normally I would call the ping_all function, passing in the empty list as an argument and returning the modified list, but since I'm using map here, I'm not sure how to achieve that?
import os
import argparse
from subprocess import check_output
from multiprocessing import Pool
parser = argparse.ArgumentParser(description='test')
args = parser.parse_args()
dead_hosts = []
def gather_hosts():
""" Returns all environments from opsnode and puts them in a dict """
host_list = []
url = 'http://test.com/hosts.json'
opsnode = requests.get(url)
content = json.loads(opsnode.text)
for server in content["host"]:
if server.startswith("ip-10-12") and server.endswith(".va.test.com"):
host_list.append(str(server))
return host_list
def try_ping(hostnames):
try:
hoststatus = check_output(["ping", "-c 1", hostnames])
print "Success:", hostnames
except:
print "\033[1;31mPing Failed:\033[1;m", hostnames
global dead_hosts
dead_hosts.append(hostnames)
def show_dead_hosts(dead_hosts):
print '\033[1;31m******************* Following Hosts are Unreachable ******************* \n\n\033[1;m'
for i in dead_hosts:
print '\033[1;31m{0} \033[1;m'.format(i)
if __name__ == '__main__':
hostnames = gather_hosts()
pool = Pool(processes=30) # process per core
pool.map(try_ping, hostnames, dead_hosts)
show_dead_hosts(dead_hosts)
I tried passing dead_hosts as a second argument in map, but after running this script, dead_hosts remains an empty list, it does not appear that the hosts are appending to the list.
What am I doing wrong?
There are several issues with your code:
The third argument to Pool.map is the chunksize, so passing dead_hosts (a list) is definitely incorrect.
You can't access globals when using a multiprocessing Pool because the tasks in the pool run in separate processes. See Python multiprocessing global variable updates not returned to parent for more details.
Related to the previous point, Pool.map should return a result list (since global side-effects will be mostly invisible). Right now you're just calling it and throwing away the result.
Your format codes weren't properly clearing in my terminal, so everything was turning bold+red...
Here's a version that I've updated and tested—I think it does what you want:
import os
import argparse
from subprocess import check_output
from multiprocessing import Pool
parser = argparse.ArgumentParser(description='test')
args = parser.parse_args()
def gather_hosts():
""" Returns all environments from opsnode and puts them in a dict """
host_list = []
url = 'http://test.com/hosts.json'
opsnode = requests.get(url)
content = json.loads(opsnode.text)
for server in content["host"]:
if server.startswith("ip-10-12") and server.endswith(".va.test.com"):
host_list.append(str(server))
return host_list
def try_ping(host):
try:
hoststatus = check_output(["ping", "-c 1", "-t 1", host])
print "Success:", host
return None
except:
print "\033[1;31mPing Failed:\033[0m", host
return host
def show_dead_hosts(dead_hosts):
print '\033[1;31m******************* Following Hosts are Unreachable ******************* \n\n\033[0m'
for x in dead_hosts:
print '\033[1;31m{0} \033[0m'.format(x)
def main():
hostnames = gather_hosts()
pool = Pool(processes=30) # process per core
identity = lambda x: x
dead_hosts = filter(identity, pool.map(try_ping, hostnames))
show_dead_hosts(dead_hosts)
if __name__ == '__main__':
main()
The main change that I've made is that try_ping either returns None on success, or the host's name on failure. The pings are done in parallel by your task pool, and the results are aggregated into a new list. I run a filter over the list to get rid of all of the None values (None is "falsey" in Python), leaving only the hostnames that failed the ping test.
You'll probably want to get rid of the print statements in try_ping. I'm assuming you just had those for debugging.
You could also consider using imap and ifilter if you need more asynchrony.
Your try_ping function doesn't actually return anything. If I were you, I wouldn't bother with having dead_hosts outside of the function but inside the try_ping function. And then you should return that list.
I'm not familiar with the modules you're using so I don't know if pool.map can take lists.

How to shutdown an httplib2 request when it is too long

I have a pretty annoying issue at the moment. When I process to a httplib2.request with a way too large page, I would like to be able to stop it cleanly.
For example :
from httplib2 import Http
url = 'http://media.blubrry.com/podacademy/p/content.blubrry.com/podacademy/Neuroscience_and_Society_1.mp3'
h = Http(timeout=5)
h.request(url, 'GET')
In this example, the url is a podcast and it will keep being downloaded forever. My main process will hang indefinitely in this situation.
I have tried to set it in a separate thread using this code and to delete straight my object.
def http_worker(url, q):
h = Http()
print 'Http worker getting %s' % url
q.put(h.request(url, 'GET'))
def process(url):
q = Queue.Queue()
t = Thread(target=http_worker, args=(url, q))
t.start()
tid = t.ident
t.join(3)
if t.isAlive():
try:
del t
print 'deleting t'
except: print 'error deleting t'
else: print q.get()
check_thread(tid)
process(url)
Unfortunately, the thread is still active and will continue to consume cpu / memory.
def check_thread(tid):
import sys
print 'Thread id %s is still active ? %s' % (tid, tid in sys._current_frames().keys() )
Thank you.
Ok I found an hack to be able to deal with this issue.
The best solution so far is to set a maximum of data read and to stop reading from the socket. The data is read from the method _safe_read of httplib module. In order to overwrite this method, I used this lib : http://blog.rabidgeek.com/?tag=wraptools
And voila :
from httplib import HTTPResponse, IncompleteRead, MAXAMOUNT
from wraptools import wraps
#wraps(httplib.HTTPResponse._safe_read)
def _safe_read(original_method, self, amt):
"""Read the number of bytes requested, compensating for partial reads.
Normally, we have a blocking socket, but a read() can be interrupted
by a signal (resulting in a partial read).
Note that we cannot distinguish between EOF and an interrupt when zero
bytes have been read. IncompleteRead() will be raised in this
situation.
This function should be used when <amt> bytes "should" be present for
reading. If the bytes are truly not available (due to EOF), then the
IncompleteRead exception can be used to detect the problem.
"""
# NOTE(gps): As of svn r74426 socket._fileobject.read(x) will never
# return less than x bytes unless EOF is encountered. It now handles
# signal interruptions (socket.error EINTR) internally. This code
# never caught that exception anyways. It seems largely pointless.
# self.fp.read(amt) will work fine.
s = []
total = 0
MAX_FILE_SIZE = 3*10**6
while amt > 0 and total < MAX_FILE_SIZE:
chunk = self.fp.read(min(amt, httplib.MAXAMOUNT))
if not chunk:
raise IncompleteRead(''.join(s), amt)
total = total + len(chunk)
s.append(chunk)
amt -= len(chunk)
return ''.join(s)
In this case, MAX_FILE_SIZE is set to 3Mb.
Hopefully, this will help others.

python process communications via pipes: Race condition

So I have two Python3.2 processes that need to communicate with each other. Most of the information that needs to be communicated are standard dictionaries. Named pipes seemed like the way to go so I made a pipe class that can be instantiated in both processes. this class implements a very basic protocol for getting information around.
My problem is that sometimes it works, sometimes it doesn't. There seems to be no pattern to this behavior except the place where the code fails.
Here are the bits of the Pipe class that matter. Shout if you want more code:
class Pipe:
"""
there are a bunch of constants set up here. I dont think it would be useful to include them. Just think like this: Pipe.WHATEVER = 'WHATEVER'
"""
def __init__(self,sPath):
"""
create the fifo. if it already exists just associate with it
"""
self.sPath = sPath
if not os.path.exists(sPath):
os.mkfifo(sPath)
self.iFH = os.open(sPath,os.O_RDWR | os.O_NONBLOCK)
self.iFHBlocking = os.open(sPath,os.O_RDWR)
def write(self,dMessage):
"""
write the dict to the fifo
if dMessage is not a dictionary then there will be an exception here. There never is
"""
self.writeln(Pipe.MESSAGE_START)
for k in dMessage:
self.writeln(Pipe.KEY)
self.writeln(k)
self.writeln(Pipe.VALUE)
self.writeln(dMessage[k])
self.writeln(Pipe.MESSAGE_END)
def writeln(self,s):
os.write(self.iFH,bytes('{0} : {1}\n'.format(Pipe.LINE_START,len(s)+1),'utf-8'))
os.write(self.iFH,bytes('{0}\n'.format(s), 'utf-8'))
os.write(self.iFH,bytes(Pipe.LINE_END+'\n','utf-8'))
def readln(self):
"""
look for LINE_START, get line size
read until LINE_END
clean up
return string
"""
iLineStartBaseLength = len(self.LINE_START)+3 #'{0} : '
try:
s = os.read(self.iFH,iLineStartBaseLength).decode('utf-8')
except:
return Pipe.READLINE_FAIL
if Pipe.LINE_START in s:
#get the length of the line
sLineLen = ''
while True:
try:
sCurrent = os.read(self.iFH,1).decode('utf-8')
except:
return Pipe.READLINE_FAIL
if sCurrent == '\n':
break
sLineLen += sCurrent
try:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
except:
raise Exception('Not a valid line length: "{0}"'.format(sLineLen))
#read the line
sLine = os.read(self.iFHBlocking,iLineLen).decode('utf-8')
#read the line terminator
sTerm = os.read(self.iFH,len(Pipe.LINE_END+'\n')).decode('utf-8')
if sTerm == Pipe.LINE_END+'\n':
return sLine
return Pipe.READLINE_FAIL
else:
return Pipe.READLINE_FAIL
def read(self):
"""
read from the fifo, make a dict
"""
dRet = {}
sKey = ''
sValue = ''
sCurrent = None
def value_flush():
nonlocal dRet, sKey, sValue, sCurrent
if sKey:
dRet[sKey.strip()] = sValue.strip()
sKey = ''
sValue = ''
sCurrent = ''
if self.message_start():
while True:
sLine = self.readln()
if Pipe.MESSAGE_END in sLine:
value_flush()
return dRet
elif Pipe.KEY in sLine:
value_flush()
sCurrent = Pipe.KEY
elif Pipe.VALUE in sLine:
sCurrent = Pipe.VALUE
else:
if sCurrent == Pipe.VALUE:
sValue += sLine
elif sCurrent == Pipe.KEY:
sKey += sLine
else:
return Pipe.NO_MESSAGE
It sometimes fails here (in readln):
try:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
except:
raise Exception('Not a valid line length: "{0}"'.format(sLineLen))
It doesn't fail anywhere else.
An example error is:
Not a valid line length: "KE 17"
The fact that it's intermittent says to me that it's due to some kind of race condition, I'm just struggling to figure out what it might be. Any ideas?
EDIT added stuff about calling processes
How the Pipe is used is it is instantiated in processA and ProcessB by calling the constructor with the same path. Process A will then intermittently write to the Pipe and processB will try to read from it. At no point do I ever try to get the thing acting as a two way.
Here is a more long winded explanation of the situation. I've been trying to keep the question short but I think it's about time I give up on that. Anyhoo, I have a daemon and a Pyramid process that need to play nice. There are two Pipe instances in use: One that only Pyramid writes to, and one that only the daemon writes to. The stuff Pyramid writes is really short, I have experienced no errors on this pipe. The stuff that the daemon writes is much longer, this is the pipe that's giving me grief. Both pipes are implemented in the same way. Both processes only write dictionaries to their respective Pipes (if this were not the case then there would be an exception in Pipe.write).
The basic algorithm is: Pyramid spawns the daemon, the daemon loads craze object hierarchy of doom and vast ram consumption. Pyramid sends POST requests to the daemon which then does a whole bunch of calculations and sends data to Pyramid so that a human-friendly page can be rendered. the human can then respond to what's in the hierarchy by filling in HTML forms and suchlike thus causing pyramid to send another dictionary to the daemon, and the daemon sending back a dictionary response.
So: only one pipe has exhibited any problems, the problem pipe has a lot more traffic than the other one, and it is a guarentee that only dictionaries are written to either
EDIT as response to question and comment
Before you tell me to take out the try...except stuff read on.
The fact that the exception gets raised at all is what is bothering me. iLineLengh = int(stuff) looks to me like it should always be passed a string that looks like an integer. This is the case only most of the time, not all of it. So if you feel the urge to comment about how it's probably not an integer please please don't.
To paraphrase my question: Spot the race condition and you will be my hero.
EDIT a little example:
process_1.py:
oP = Pipe(some_path)
while 1:
oP.write({'a':'foo','b':'bar','c':'erm...','d':'plop!','e':'etc'})
process_2.py:
oP = Pipe(same_path_as_before)
while 1:
print(oP.read())
After playing around with the code, I suspect the problem is coming from how you are reading the file.
Specifically, lines like this:
os.read(self.iFH, iLineStartBaseLength)
That call doesn't necessarily return iLineStartBaseLength bytes - it might consume "LI" , then return READLINE_FAIL and retry. On the second attempt, it will get the remainder of the line, and somehow end up giving the non-numeric string to the int() call
The unpredictability likely comes from how the fifo is being flushed - if it happens to flush when the complete line is written, all is fine. If it flushes when the line is half-written, weirdness.
At least in the hacked-up version of the script I ended up with, the oP.read() call in process_2.py often got a different dict to the one sent (where the KEY might bleed into the previous VALUE and other strangeness).
I might be mistaken, as I had to make a bunch of changes to get the code running on OS X, and further while experimenting. My modified code here
Not sure exactly how to fix it, but.. with the json module or similar, the protocol/parsing can be greatly simplified - newline separated JSON data is much easier to parse:
import os
import time
import json
import errno
def retry_write(*args, **kwargs):
"""Like os.write, but retries until EAGAIN stops appearing
"""
while True:
try:
return os.write(*args, **kwargs)
except OSError as e:
if e.errno == errno.EAGAIN:
time.sleep(0.5)
else:
raise
class Pipe(object):
"""FIFO based IPC based on newline-separated JSON
"""
ENCODING = 'utf-8'
def __init__(self,sPath):
self.sPath = sPath
if not os.path.exists(sPath):
os.mkfifo(sPath)
self.fd = os.open(sPath,os.O_RDWR | os.O_NONBLOCK)
self.file_blocking = open(sPath, "r", encoding=self.ENCODING)
def write(self, dmsg):
serialised = json.dumps(dmsg) + "\n"
dat = bytes(serialised.encode(self.ENCODING))
# This blocks until data can be read by other process.
# Can just use os.write and ignore EAGAIN if you want
# to drop the data
retry_write(self.fd, dat)
def read(self):
serialised = self.file_blocking.readline()
return json.loads(serialised)
Try getting rid of the try:, except: blocks and seeing what exception is actually being thrown.
So replace your sample with just:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
I bet it'll now throw a ValueError, and it's because you're trying to cast "KE 17" to an int.
You'll need to strip more than string.whitespace and string.punctuation if you're going to cast the string to an int.

Categories