Multiprocessing inside module creating loop instead of running target method - python

I'm trying to build a set of lambda calls which work in parallel to do some processing of data. These methods return their bits of data back to the parent process, which then combines it all.
I need to be able to run this locally as well for a testing script which runs several test cases and scores the accuracy of the processing. To do this, I mock the lambda calls and import the lambda as a module and execute the handler directly.
So my top level lambda, uses multiprocessing.Process to call methods which invoke the other lambdas like so:
# Top Level Lambda
l = client('lambda')
def process_data(data, conn):
response = l.Invoke(
FunctionName = os.environ.get('PROCESS_FUNCTION'),
InvocationType = 'RequestResponse',
LogType = 'Tail',
Payload = json.dump({
'data': data
})
)
conn.send(response.Payload.read())
conn.close()
def create(datas)
p_parent, p_child = mp.Pipe()
process = mp.Process(target=process_data, args=(datas[0], p_child))
process.start()
I've cut out a lot of code to give the gist here. I get an error on process.start()
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
I've tried putting in freeze_support calls in each lambda in a __name__ == '__main__' block, I've tried putting it in the lambda handler, and I have tried putting it in the test script, but I always get the same error.
What really throws me here is that the new process doesn't call the target function at all, but instead runs the test script from the beginning and the 2nd call to invoke a new process in that subprocess is what is giving me the error

Related

Python multiprocessing - Subprocesses with same PID

Basically I'm trying to open a new process every time I call a function. The problem is that when I get the PID inside of the function, the PID is the same as in another functions even if the other functions haven't finished yet.
I'm wrapping my function with a decorator:
def run_in_process(function):
"""Runs a function isolated in a new process.
Args:
function (function): Function to execute.
"""
def wrapper(*args):
parent_connection, child_connection = Pipe()
process = Process(target=function, args=(*args, child_connection))
process.start()
response = parent_connection.recv()
process.join()
return response
return wrapper
And declaring the function like this:
#run_in_process
def example(data, pipe):
print(os.getpid())
pipe.send("Just an example here!")
pipe.close()
Obs1.:This code is running inside a AWS Lambda.
Obs2.: Those lambdas didn't finish before the other one starts, because this tasks takes at least 10 seconds.
Log of execution 1
Log of execution 2
Log of execution 3
You can see at the logs that each one is a different execution and they are executed at the "same" time.
The question is: Why they have the same PID even knowing that they are running concurrently? Shouldn't they have different PIDs?
I obligatorily need to execute this function in an isolated process
Your Lambda function could have been running in multiple containers at once in the AWS cloud. If you've been heavily testing your function with multiple concurrent requests, it is quite possible that the AWS orchestration created additional instances to handle the traffic.
With serverless, you lose some of the visibility into exactly how your code is being executed, but does it really matter?

Python running subprocesses without waiting whilst still receiving return codes

I have found relating questions to mine but cannot find one that solves my problem.
The problem
I am building a program that monitors several directories, then spawns a subprocess based on directory or particular filename.
These subprocesses can often take up to several hours (for example if rendering 000's of PDFs) to complete. Because of this, I would like to know the best way for the program to continue monitoring the folders in parallel to the subprocess that is still running, and be able to spawn additional subprocesses, as long as they are of a different type to the subprocess currently running.
Once the subprocess has completed, the program should be able to receive a return code, that subprocess would be available to run again.
Code as it stands
This is the simple code that runs the program currently, calling functions when a file is found:
while 1:
paths_to_watch = ['/dir1','/dir2','/dir3','/dir4']
after = {}
for x in paths_to_watch:
key = x
after.update({key :[f for f in os.listdir(x)]})
for key, files in after.items():
if(key == '/dir1'):
function1(files)
elif(key == '/dir2'):
function2(files)
elif(key == '/dir3'):
function3(files)
elif(key == '/dir4'):
function3(files)
time.sleep(10)
Of course this means that the program waits for the process to be finished before it continues to check for files in paths_to_watch
From other questions, it looks like this is something that could be handled with process pools, however my lack of knowledge in this area means I do not know where to start.
I am assuming that you can use threads rather than processes, an assumption that will hold up if your functions function1 thorugh function4 are predominately I/O bound. Otherwise you should substitute ProcessPoolExecutor for ThreadPoolExecutor in the code below. Right now your program loops indefinitely, so the threads too will never terminate. I am also assuming that that functions function1 through function4 have unique implementations.
import os
import time
from concurrent.futures import ThreadPoolExecutor
def function1(files):
pass
def function2(files):
pass
def function3(files):
pass
def function4(files):
pass
def process_path(path, function):
while True:
files = os.listdir(path)
function(files)
time.sleep(10)
def main():
paths_to_watch = ['/dir1','/dir2','/dir3','/dir4']
functions = [function1, function2, function3, function4]
with ThreadPoolExecutor(max_workers=len(paths_to_watch)) as executor:
results = executor.map(process_path, paths_to_watch, functions)
for result in results:
# threads never return so we never get a result
print(result)
if __name__ == '__main__':
main()

Python variables not defined after if __name__ == '__main__'

I'm trying to divvy up the task of looking up historical stock price data for a list of symbols by using Pool from the multiprocessing library.
This works great until I try to use the data I get back. I have my hist_price function defined and it outputs to a list-of-dicts pcl. I can print(pcl) and it has been flawless, but if I try to print(pcl) after the if __name__=='__main__': block, it blows up saying pcl is undefined. I've tried declaring global pcl in a couple places but it doesn't make a difference.
from multiprocessing import Pool
syms = ['List', 'of', 'symbols']
def hist_price(sym):
#... lots of code looking up data, calculations, building dicts...
stlh = {"Sym": sym, "10D Max": pcmax, "10D Min": pcmin} #simplified
return stlh
#global pcl
if __name__ == '__main__':
pool = Pool(4)
#global pcl
pcl = pool.map(hist_price, syms)
print(pcl) #this works
pool.close()
pool.join()
print(pcl) #says pcl is undefined
#...rest of my code, dependent on pcl...
I've also tried removing the if __name__=='__main__': block but it gives me a RunTimeError telling me specifically to put it back. Is there some other way to call variables to use outside of the if block?
I think there are two parts to your issue. The first is "what's wrong with pcl in the current code?", and the second is "why do I need the if __name__ == "__main__" guard block at all?".
Lets address them in order. The problem with the pcl variable is that it is only defined in the if block, so if the module gets loaded without being run as a script (which is what sets __name__ == "__main__"), it will not be defined when the later code runs.
To fix this, you can change how your code is structured. The simplest fix would be to guard the other bits of the code that use pcl within an if __name__ == "__main__" block too (e.g. indent them all under the current block, perhaps). An alternative fix would be to put the code that uses pcl into functions (which can be declared outside the guard block), then call the functions from within an if __name__ == "__main__" block. That would look something like this:
def do_stuff_with_pcl(pcl):
print(pcl)
if __name__ == "__main__":
# multiprocessing code, etc
pcl = ...
do_stuff_with_pcl(pcl)
As for why the issue came up in the first place, the ultimate cause is using the multiprocessing module on Windows. You can read about the issue in the documentation.
When multiprocessing creates a new process for its Pool, it needs to initialize that process with a copy of the current module's state. Because Windows doesn't have fork (which copies the parent process's memory into a child process automatically), Python needs to set everything up from scratch. In each child process, it loads the module from its file, and if you the module's top-level code tries to create a new Pool, you'd have a recursive situation where each of the child process would start spawning a whole new set of child processes of its own.
The multiprocessing code has some guards against that, I think (so you won't fork bomb yourself out of simple carelessness), but you still need to do some of the work yourself too, by using if __name__ == "__main__" to guard any code that shouldn't be run in the child processes.

multiprocessing.Pool: calling helper functions when using apply_async's callback option

How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?
You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.

Multiprocessing program - Not Executing in parallel

I am currently scrapping data from various sites.
The code for the scrappers is stored in modules (x,y,z,a,b)
Where x.dump is a function which uses Files for storing the scraped data.
The dump function takes a single argument 'input'.
Note : All the dump functions are not same.
I am trying to run each of these dump function in parallel.
The following code runs fine.
But i have noticed that it still follows serial order x then y ... for execution.
Is this the correct way of going about the problem?
Are multithreading and multiprocessing the only native ways for parallel programming?
from multiprocessing import Process
import x.x as x
import y.y as y
import z.z as z
import a.a as a
import b.b as b
input = ""
f_list = [x.dump, y.dump, z.dump, a.dump, b.dump]
processes = []
for function in f_list:
processes.append(Process(target=function, args=(input,)))
for process in processes:
process.run()
for process in processes:
process.join()
That's because run() is the method to implement the task itself, you're not meant to call it from outside like that. You are supposed to call start() which spawns a new process which then calls run() in the other process and returns control to you so you can do more work (and later join()).
You should be calling process.start() not process.run()
The start method does the work of starting the extra process and then running the run method in that process.
Python docs

Categories