Python multiprocessing - Subprocesses with same PID - python

Basically I'm trying to open a new process every time I call a function. The problem is that when I get the PID inside of the function, the PID is the same as in another functions even if the other functions haven't finished yet.
I'm wrapping my function with a decorator:
def run_in_process(function):
"""Runs a function isolated in a new process.
Args:
function (function): Function to execute.
"""
def wrapper(*args):
parent_connection, child_connection = Pipe()
process = Process(target=function, args=(*args, child_connection))
process.start()
response = parent_connection.recv()
process.join()
return response
return wrapper
And declaring the function like this:
#run_in_process
def example(data, pipe):
print(os.getpid())
pipe.send("Just an example here!")
pipe.close()
Obs1.:This code is running inside a AWS Lambda.
Obs2.: Those lambdas didn't finish before the other one starts, because this tasks takes at least 10 seconds.
Log of execution 1
Log of execution 2
Log of execution 3
You can see at the logs that each one is a different execution and they are executed at the "same" time.
The question is: Why they have the same PID even knowing that they are running concurrently? Shouldn't they have different PIDs?
I obligatorily need to execute this function in an isolated process

Your Lambda function could have been running in multiple containers at once in the AWS cloud. If you've been heavily testing your function with multiple concurrent requests, it is quite possible that the AWS orchestration created additional instances to handle the traffic.
With serverless, you lose some of the visibility into exactly how your code is being executed, but does it really matter?

Related

AWS Lambda function not running concurrently

I have an AWS Lambda function that gets invoked from another function. The first function processes the data and invokes the other when it is finished. The second function will get n instances to run at the same time.
For example the second function takes about 5 seconds (for each invoke) to run; I want this function to run all at the time they are invoked for a total run time of about 5 seconds.
The function takes longer than that and runs each function one at a time until the one prior is finished; this process takes 5*n seconds.
I see that I can scale the function to run up to 1,000 in my region as stated by AWS. How can I make this run concurrently? Don't need a code example, just a general process I can look into to fix the problem.
The first function header looks like this: (I have other code that gets the json_file that I left out)
def lambda_handler(event=None, context=None):
for n in range(len(json_file)):
response = client.invoke(
FunctionName='docker-selenium-lambda-prod-demo',
InvocationType='RequestResponse',
Payload=json.dumps(json_file[n])
)
responseJson = json.load(response['Payload'])
where json_file[n] is being sent to the other function to run.
As you can see in boto3 docs about invoke function:
Invokes a Lambda function. You can invoke a function synchronously (and wait for the response), or asynchronously. To invoke a function asynchronously, set InvocationType to Event .
If you are using RequestResponse, your code will wait until the lambda called is terminated.
You can either change InvocationType to Event or use something like ThreadPoolExecutor and wait until all executions are finished

How to start separate process that runs a function with multiple arguments?

So I'm having a though time wrapping my head around multiprocessing library and all the functionality. Basicly what I'm trying to accomplish is to start a separate process from a background thread that receives function object and it's positional and keyword arguments.
I have a thread that is started at the beginning and it's job is to execute functions that are passed to it via dependency injection. Once the thread detects that new job is scheduled it takes the job and executes it. The problem is that I have no idea how long that job will take and I would like to terminate it if let's say 10 minutes have passed. Since this can't be accomplished via threading module I decided to take a look at multiprocessing since it's processes can be terminated.
Dependency injection is solved via decorator that encapsulates each function (that is intended to be executed by the thread) that passes function object and it's positional and keyword arguments to the thread that's gonna execute it via * and **. The thread at the end gets all arguments and the function object (this works).
The problem begins when i try to create a Pool and assing work to a single worker. Since I have no idea of function input arguments, how am I able to use apply_async functin with * and **?
def intercept(callback):
def wrapper(*args, **kwargs):
# pass callback, args and kwargs to the thread
pass
return wrapper
#intercept
def do_some_work(first, second, third=None):
time.sleep(10)
def bg_thread():
while True:
# acquire callback, args and kwargs from intercept decorator
# if new job is scheduled create a process and execute it
# if process did not finish in timeout, terminate it
p = multiprocessing.Pool()
ret = p.apply_async(callback, args, kwargs)
p.close()
try:
ret.get(5)
except:
p.terminate()
t = threading.Thread(target=bg_thread)
t.start()
do_some_work()

multiprocessing.Pool: calling helper functions when using apply_async's callback option

How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?
You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.

Multiprocessing program - Not Executing in parallel

I am currently scrapping data from various sites.
The code for the scrappers is stored in modules (x,y,z,a,b)
Where x.dump is a function which uses Files for storing the scraped data.
The dump function takes a single argument 'input'.
Note : All the dump functions are not same.
I am trying to run each of these dump function in parallel.
The following code runs fine.
But i have noticed that it still follows serial order x then y ... for execution.
Is this the correct way of going about the problem?
Are multithreading and multiprocessing the only native ways for parallel programming?
from multiprocessing import Process
import x.x as x
import y.y as y
import z.z as z
import a.a as a
import b.b as b
input = ""
f_list = [x.dump, y.dump, z.dump, a.dump, b.dump]
processes = []
for function in f_list:
processes.append(Process(target=function, args=(input,)))
for process in processes:
process.run()
for process in processes:
process.join()
That's because run() is the method to implement the task itself, you're not meant to call it from outside like that. You are supposed to call start() which spawns a new process which then calls run() in the other process and returns control to you so you can do more work (and later join()).
You should be calling process.start() not process.run()
The start method does the work of starting the extra process and then running the run method in that process.
Python docs

How to use multiprocessing with class instances in Python?

I am trying to create a class than can run a separate process to go do some work that takes a long time, launch a bunch of these from a main module and then wait for them all to finish. I want to launch the processes once and then keep feeding them things to do rather than creating and destroying processes. For example, maybe I have 10 servers running the dd command, then I want them all to scp a file, etc.
My ultimate goal is to create a class for each system that keeps track of the information for the system in which it is tied to like IP address, logs, runtime, etc. But that class must be able to launch a system command and then return execution back to the caller while that system command runs, to followup with the result of the system command later.
My attempt is failing because I cannot send an instance method of a class over the pipe to the subprocess via pickle. Those are not pickleable. I therefore tried to fix it various ways but I can't figure it out. How can my code be patched to do this? What good is multiprocessing if you can't send over anything useful?
Is there any good documentation of multiprocessing being used with class instances? The only way I can get the multiprocessing module to work is on simple functions. Every attempt to use it within a class instance has failed. Maybe I should pass events instead? I don't understand how to do that yet.
import multiprocessing
import sys
import re
class ProcessWorker(multiprocessing.Process):
"""
This class runs as a separate process to execute worker's commands in parallel
Once launched, it remains running, monitoring the task queue, until "None" is sent
"""
def __init__(self, task_q, result_q):
multiprocessing.Process.__init__(self)
self.task_q = task_q
self.result_q = result_q
return
def run(self):
"""
Overloaded function provided by multiprocessing.Process. Called upon start() signal
"""
proc_name = self.name
print '%s: Launched' % (proc_name)
while True:
next_task_list = self.task_q.get()
if next_task is None:
# Poison pill means shutdown
print '%s: Exiting' % (proc_name)
self.task_q.task_done()
break
next_task = next_task_list[0]
print '%s: %s' % (proc_name, next_task)
args = next_task_list[1]
kwargs = next_task_list[2]
answer = next_task(*args, **kwargs)
self.task_q.task_done()
self.result_q.put(answer)
return
# End of ProcessWorker class
class Worker(object):
"""
Launches a child process to run commands from derived classes in separate processes,
which sit and listen for something to do
This base class is called by each derived worker
"""
def __init__(self, config, index=None):
self.config = config
self.index = index
# Launce the ProcessWorker for anything that has an index value
if self.index is not None:
self.task_q = multiprocessing.JoinableQueue()
self.result_q = multiprocessing.Queue()
self.process_worker = ProcessWorker(self.task_q, self.result_q)
self.process_worker.start()
print "Got here"
# Process should be running and listening for functions to execute
return
def enqueue_process(target): # No self, since it is a decorator
"""
Used to place an command target from this class object into the task_q
NOTE: Any function decorated with this must use fetch_results() to get the
target task's result value
"""
def wrapper(self, *args, **kwargs):
self.task_q.put([target, args, kwargs]) # FAIL: target is a class instance method and can't be pickled!
return wrapper
def fetch_results(self):
"""
After all processes have been spawned by multiple modules, this command
is called on each one to retreive the results of the call.
This blocks until the execution of the item in the queue is complete
"""
self.task_q.join() # Wait for it to to finish
return self.result_q.get() # Return the result
#enqueue_process
def run_long_command(self, command):
print "I am running number % as process "%number, self.name
# In here, I will launch a subprocess to run a long-running system command
# p = Popen(command), etc
# p.wait(), etc
return
def close(self):
self.task_q.put(None)
self.task_q.join()
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(5):
worker = Worker(config, index)
worker.run_long_command("ls /")
workers.append(worker)
for worker in workers:
worker.fetch_results()
# Do more work... (this would actually be done in a distributor in another class)
for worker in workers:
worker.close()
Edit: I tried to move the ProcessWorker class and the creation of the multiprocessing queues outside of the Worker class and then tried to manually pickle the worker instance. Even that doesn't work and I get an error
RuntimeError: Queue objects should only be shared between processes
through inheritance
. But I am only passing references of those queues into the worker instance?? I am missing something fundamental. Here is the modified code from the main section:
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(1):
task_q = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
process_worker = ProcessWorker(task_q, result_q)
worker = Worker(config, index, process_worker, task_q, result_q)
something_to_look_at = pickle.dumps(worker) # FAIL: Doesn't like queues??
process_worker.start()
worker.run_long_command("ls /")
So, the problem was that I was assuming that Python was doing some sort of magic that is somehow different from the way that C++/fork() works. I somehow thought that Python only copied the class, not the whole program into a separate process. I seriously wasted days trying to get this to work because all of the talk about pickle serialization made me think that it actually sent everything over the pipe. I knew that certain things could not be sent over the pipe, but I thought my problem was that I was not packaging things up properly.
This all could have been avoided if the Python docs gave me a 10,000 ft view of what happens when this module is used. Sure, it tells me what the methods of multiprocess module does and gives me some basic examples, but what I want to know is what is the "Theory of Operation" behind the scenes! Here is the kind of information I could have used. Please chime in if my answer is off. It will help me learn.
When you run start a process using this module, the whole program is copied into another process. But since it is not the "__main__" process and my code was checking for that, it doesn't fire off yet another process infinitely. It just stops and sits out there waiting for something to do, like a zombie. Everything that was initialized in the parent at the time of calling multiprocess.Process() is all set up and ready to go. Once you put something in the multiprocess.Queue or shared memory, or pipe, etc. (however you are communicating), then the separate process receives it and gets to work. It can draw upon all imported modules and setup just as if it was the parent. However, once some internal state variables change in the parent or separate process, those changes are isolated. Once the process is spawned, it now becomes your job to keep them in sync if necessary, either through a queue, pipe, shared memory, etc.
I threw out the code and started over, but now I am only putting one extra function out in the ProcessWorker, an "execute" method that runs a command line. Pretty simple. I don't have to worry about launching and then closing a bunch of processes this way, which has caused me all kinds of instability and performance issues in the past in C++. When I switched to launching processes at the beginning and then passing messages to those waiting processes, my performance improved and it was very stable.
BTW, I looked at this link to get help, which threw me off because the example made me think that methods were being transported across the queues: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
The second example of the first section used "next_task()" that appeared (to me) to be executing a task received via the queue.
Instead of attempting to send a method itself (which is impractical), try sending a name of a method to execute.
Provided that each worker runs the same code, it's a matter of a simple getattr(self, task_name).
I'd pass tuples (task_name, task_args), where task_args were a dict to be directly fed to the task method:
next_task_name, next_task_args = self.task_q.get()
if next_task_name:
task = getattr(self, next_task_name)
answer = task(**next_task_args)
...
else:
# poison pill, shut down
break
REF: https://stackoverflow.com/a/14179779
Answer on Jan 6 at 6:03 by David Lynch is not factually correct when he says that he was misled by
http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html.
The code and examples provided are correct and work as advertised. next_task() is executing a task received via the queue -- try and understand what the Task.__call__() method is doing.
In my case what, tripped me up was syntax errors in my implementation of run(). It seems that the sub-process will not report this and just fails silently -- leaving things stuck in weird loops! Make sure you have some kind of syntax checker running e.g. Flymake/Pyflakes in Emacs.
Debugging via multiprocessing.log_to_stderr()F helped me narrow down the problem.

Categories