this is my first question here and I hope that I'm not opening up a question very similar to an already existing one. If that's the case, please excuse me!
So, the situation I'm having a bit of trouble with, is the following:
I would like to run independent python scripts in parallel, which can access the same python objects, in my case a Pandas Dataframe. My idea is that one script is basically constantly running and subscribes to a stream of data (here data that is pushed via a websocket), which it then appends to a shared Dataframe. The second script should be able to be started independent of the first one and still access the Dataframe, which is constantly updated by the first script. In the second script I want to execute different kinds of analysis in predefined time intervals or do other relatively time intesive operations on the live data.
I've already tried to run all operations from within one script, but I kept having disconnects from the websocket. In the future there are also multiple scripts supposed to access the shared data in real time.
Instead of saving a csv file or pickle after every update in script 1, I would rather have a solution where both scripts basically share the same memory. I also only need one of the scripts to write and append to the Dataframe, the other only needs to read from it. The multiprocessing module seems to have some interesting features, that might help but I couldn't realy make any sense of it so far. I also looked at global variables but that also doesn't seem to be the right thing to use in this case.
What I imagine is something like this (I know, that the code is completely wrong and this is just for illustrative purposes):
The first script should keep assigning new data from the datastream to the dataframe and share this object.
from share_data import share
shared_df = pd.DataFrame()
for data from datastream:
shared_df.append(data)
share(shared_df)
The second script should then be able to do the following:
from share_data import get
df = get(shared_df)
Is this at all possible or do you have any ideas on how the accomplish this in a simple way? Or do you have any suggestions which packages might be good to use?
Best regards,
Ole
You already have quite the right sense of what you can do to use your data.
The best solution depends on your actual needs,
so I will try to cover the possibilities with a working example.
What you want
If I understand your need completely, you want to
continuously update a DataFrame (from a websocket)
while doing some computations on the same DataFrame
keeping the DataFrame up to date on the computation workers,
one computation is CPU intensive
another is not.
What you need
As you said, you will need a way to run different threads or processes in order to keep the computation running.
How about Threads
The easiest way to achieve what you want would be to use the threading library.
Since threads can share memory, and you only have one worker actually updating the DataFrame, it is quite easy to propose a way to manage the data up to date:
import time
from dataclasses import dataclass
import pandas
from threading import Thread
#dataclass
class DataFrameHolder:
"""This dataclass holds a reference to the current DF in memory.
This is necessary if you do operations without in-place modification of
the DataFrame, since you will need replace the whole object.
"""
dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])
def update(self, data):
self.dataframe = self.dataframe.append(data, ignore_index=True)
class StreamLoader:
"""This class is our worker communicating with the websocket"""
def __init__(self, df_holder: DataFrameHolder) -> None:
super().__init__()
self.df_holder = df_holder
def update_df(self):
# read from websocket and update your DF.
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
}
self.df_holder.update(data)
def run(self):
# limit loop for the showcase
for _ in range(5):
self.update_df()
print("[1] Updated DF %s" % str(self.df_holder.dataframe))
time.sleep(3)
class LightComputation:
"""This class is a random computation worker"""
def __init__(self, df_holder: DataFrameHolder) -> None:
super().__init__()
self.df_holder = df_holder
def compute(self):
print("[2] Current DF %s" % str(self.df_holder.dataframe))
def run(self):
# limit loop for the showcase
for _ in range(5):
self.compute()
time.sleep(5)
def main():
# We create a DataFrameHolder to keep our DataFrame available.
df_holder = DataFrameHolder()
# We create and start our update worker
stream = StreamLoader(df_holder)
stream_process = Thread(target=stream.run)
stream_process.start()
# We create and start our computation worker
compute = LightComputation(df_holder)
compute_process = Thread(target=compute.run)
compute_process.start()
# We join our Threads, i.e. we wait for them to finish before continuing
stream_process.join()
compute_process.join()
if __name__ == "__main__":
main()
Note that we use a class to hold reference of the current DataFrame because some operations like append are not necessarily inplace,
thus, if we directly sent the reference to the DataFrame object, the modification would be lost on the worker.
Here the DataFrameHolder object will keep the same location in memory, thus the worker can still access the updated DataFrame.
Processes may be more powerful
Now if you need more computation power, processes may be more useful since they enable you to isolate your worker on a different core.
To start a Process instead of a Thread in python, you can use the multiprocessing library.
The API of both objects are the same and you will only have to change the constructors as follow
from threading import Thread
# I create a thread
compute_process = Thread(target=compute.run)
from multiprocessing import Process
# I create a process that I can use the same way
compute_process = Process(target=compute.run)
Now if you tried to change the values in the above script, you will see that the DataFrame is not updating correctly.
For this you will need more work since the two processes don't share memory, and you have multiple ways of communicating between them (https://en.wikipedia.org/wiki/Inter-process_communication)
The reference of #SimonCrane is quite interesting on the matters and showcases the use of a shared-memory between two processes using multiprocessing.manager.
Here is a version with the worker using a separate process with a shared memory:
import logging
import multiprocessing
import time
from dataclasses import dataclass
from multiprocessing import Process
from multiprocessing.managers import BaseManager
from threading import Thread
import pandas
#dataclass
class DataFrameHolder:
"""This dataclass holds a reference to the current DF in memory.
This is necessary if you do operations without in-place modification of
the DataFrame, since you will need replace the whole object.
"""
dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])
def update(self, data):
self.dataframe = self.dataframe.append(data, ignore_index=True)
def retrieve(self):
return self.dataframe
class DataFrameManager(BaseManager):
"""This dataclass handles shared DataFrameHolder.
See https://docs.python.org/3/library/multiprocessing.html#examples
"""
# You can also use a socket file '/tmp/shared_df'
MANAGER_ADDRESS = ('localhost', 6000)
MANAGER_AUTH = b"auth"
def __init__(self) -> None:
super().__init__(self.MANAGER_ADDRESS, self.MANAGER_AUTH)
self.dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])
#classmethod
def register_dataframe(cls):
BaseManager.register("DataFrameHolder", DataFrameHolder)
class DFWorker:
"""Abstract class initializing a worker depending on a DataFrameHolder."""
def __init__(self, df_holder: DataFrameHolder) -> None:
super().__init__()
self.df_holder = df_holder
class StreamLoader(DFWorker):
"""This class is our worker communicating with the websocket"""
def update_df(self):
# read from websocket and update your DF.
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
}
self.df_holder.update(data)
def run(self):
# limit loop for the showcase
for _ in range(4):
self.update_df()
print("[1] Updated DF\n%s" % str(self.df_holder.retrieve()))
time.sleep(3)
class LightComputation(DFWorker):
"""This class is a random computation worker"""
def compute(self):
print("[2] Current DF\n%s" % str(self.df_holder.retrieve()))
def run(self):
# limit loop for the showcase
for _ in range(4):
self.compute()
time.sleep(5)
def main():
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
# Register our DataFrameHolder type in the DataFrameManager.
DataFrameManager.register_dataframe()
manager = DataFrameManager()
manager.start()
# We create a managed DataFrameHolder to keep our DataFrame available.
df_holder = manager.DataFrameHolder()
# We create and start our update worker
stream = StreamLoader(df_holder)
stream_process = Thread(target=stream.run)
stream_process.start()
# We create and start our computation worker
compute = LightComputation(df_holder)
compute_process = Process(target=compute.run)
compute_process.start()
# The managed dataframe is updated in every Thread/Process
time.sleep(5)
print("[0] Main process DF\n%s" % df_holder.retrieve())
# We join our Threads, i.e. we wait for them to finish before continuing
stream_process.join()
compute_process.join()
if __name__ == "__main__":
main()
As you can see, the differences between threading and processing are quite tiny.
With a few tweaks, you can start from there to connect to the same manager if you want to use a different file to handle your CPU intensive processing.
Related
I'm having an issue with instances not retaining changes to attributes, or even keeping new attributes that are created. I think I've narrowed it down to the fact that my script takes advantage of multiprocessing, and I'm thinking that changes occurring to instances in separate process threads are not 'remembered' when the script returns to the main thread.
Basically, I have several sets of data which I need to process in parallel. The data is stored as an attribute, and is altered via several methods in the class. At the conclusion of processing, I'm hoping to return to the main thread and concatenate the data from each of the object instances. However, as described above, when I try to access the instance attribute with the data after the parallel processing bit is done, there's nothing there. It's as if any changes enacted during the multiprocessing bit are 'forgotten'.
Is there an obvious solution to fix this? Or do I need to rebuild my code to instead return the processed data rather than just altering/storing it as an instance attribute? I guess an alternative solution would be to serialize the data, and then re-read it in when necessary, rather than just keeping it in memory.
Something maybe worth noting here is that I am using the pathos module rather than python's multiprocessingmodule. I was getting some errors pertaining to pickling, similar to here: Python multiprocessing PicklingError: Can't pickle <type 'function'>. My code is broken across several modules and as mentioned, the data processing methods are contained within a class.
Sorry for the wall of text.
EDIT
Here's my code:
import importlib
import pandas as pd
from pathos.helpers import mp
from provider import Provider
# list of data providers ... length is arbitrary
operating_providers = ['dataprovider1', 'dataprovider2', 'dataprovider3']
# create provider objects for each operating provider
provider_obj_list = []
for name in operating_providers:
loc = 'providers.%s' % name
module = importlib.import_module(loc)
provider_obj = Provider(module)
provider_obj_list.append(provider_obj)
processes = []
for instance in provider_obj_list:
process = mp.Process(target = instance.data_processing_func)
process.daemon = True
process.start()
processes.append(process)
for process in processes:
process.join()
# now that data_processing_func is complete for each set of data,
# stack all the data
stack = pd.concat((instance.data for instance in provider_obj_list))
I have a number of modules (their names listed in operating_providers) that contain attributes specific to their data source. These modules are iteratively imported and passed to new instances of the Provider class, which I created in a separate module (provider). I append each Provider instance to a list (provider_obj_list), and then iteratively create separate processes which call the instance method instance.data_processing_func. This function does some data processing (with each instance accessing completely different data files), and creates new instance attributes along the way, which I need to access when the parallel processing is complete.
I tried using multithreading instead, rather than multiprocessing -- in this case, my instance attributes persisted, which is what I want. However, I am not sure why this happens -- I'll have to study the differences between threading vs. multiprocessing.
Thanks for any help!
Here's some sample code showing how to do what I outlined in comment. I can't test it because I don't have provider or pathos installed, but it should give you a good idea of what I suggested.
import importlib
from pathos.helpers import mp
from provider import Provider
def process_data(loc):
module = importlib.import_module(loc)
provider_obj = Provider(module)
provider_obj.data_processing_func()
if __name__ == '__main__':
# list of data providers ... length is arbitrary
operating_providers = ['dataprovider1', 'dataprovider2', 'dataprovider3']
# create list of provider locations for each operating provider
provider_loc_list = []
for name in operating_providers:
loc = 'providers.%s' % name
provider_loc_list.append(loc)
processes = []
for loc in provider_loc_list:
process = mp.Process(target=process_data, args=(loc,))
process.daemon = True
process.start()
processes.append(process)
for process in processes:
process.join()
Python 3
I would like to know what a really clean, pythonic concurrent data loader should look like. I need this approach for a project of mine that does heavy computations on data that is too big to entirely fit into memory. Hence, I implemented data loaders that should run concurrently and store data in a queue, so that the main process can work while (in the mean time) the next data is being loaded & prepared. Of course, the queue should block when it is empty (main process trying to consume more items -> queue should wait for new data) or full (worker process should wait until main process consumes data out of the queue to prevent out-of-memory errors).
I have written a class to fulfill this need using Python's multiprocessing module (multiprocessing.Queue and multiprocessing.Process). The crucial parts of the class are implemented as follows:
import multiprocessing as mp
from itertools import cycle
class ConcurrentLoader:
def __init__(path_to_data, queue_size, batch_size):
self._batch_size
self._path = path_to_data
filenames = ... # filenames for path 'path_to_data',
# get loaded using glob
self._files = cycle()
self._q = mp.Queue(queue_size)
...
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start() # only started, never stopped
def _worker_func(self):
while True:
buffer = list()
for i in range(batch_size):
f = next(self._files)
... # load f and do some pre-processing with NumPy
... # add it to buffer
self._q.put(np.array(buffer).astype(np.float32))
def get_batch_data(self):
self._q.get()
The class has some more methods, but they are all for "convenience functionality". For example, it counts in a dict how often each file was loaded, how often the whole data set was loaded and so on, but these are rather easy to implement in Python and do not waste much computation time (sets, dicts, ...).
The data part itself on the other hand, due to I/O and pre-processing, can even take seconds. That is the reason why I want this to happen concurrently.
ConcurrentLoader should:
block main process: if get_batch_data is called, but queue is empty
block worker process: if queue is full, to prevent out-of-memory errors and prevent while True from wasting resources
be "transparent" to any class that uses ConcurrentLoader: they should just supply the path to the data and use get_batch_data without noticing that this actually works concurrently ("hassle free usage")
terminate its worker when main process dies to free resources again
Considering these goals (have I forgotten anything?) what should I do to enhance the current implementation? Is it thread/dead-lock safe? Is there a more "pythonic" way of implementation? Can I get it more clean? Does waste resources somehow?
Any class that uses ConcurrentLoader would roughly follow this setup:
class Foo:
...
def do_something(self):
...
data1 = ConcurrentLoader("path/to/data1", 64, 8)
data2 = ConcurrentLoader("path/to/data2", 256, 16)
...
sample1 = data1.get_batch_data()
sample2 = data2.get_batch_data()
... # heavy computations with data contained in 'sample1' & 'sample2'
# go *here*
Please either point out mistakes of any kind in order to improve my approach or supply an own, cleaner, more pythonic approach.
Blocking when a multiprocessing.Queue is empty/full and
get()/put() is called on it happens automatically.
This behavior is transparent to calling functions.
Use self._worker.daemon = True before self._worker.start() so the worker(s) will automatically be killed when main process exits
I want to read in a stream of float numbers, do some simple calculation and append the value into a global list. Can you tell where I get it wrong? The list is not appending.
from random import random
from time import sleep
def process(x):
from random import random
sleep(random()*2)
t = x * 2
processed_queue.append(t)
print(processed_queue)
return t
if __name__ == "__main__":
from distributed import Client
from queue import Queue
client = Client()
processed_queue = []
input_q = Queue()
remote_q = client.scatter(input_q)
processed_q = client.map(process, remote_q)
result_q = client.gather(processed_q)
for i in [random() for x in range(100)]:
sleep(random())
input_q.put(i)
print(i)
print(processed_queue)
print(result_q.qsize())
Whilest queue.Queue and multiprocessing.Queue can be used to send data between threads and processes, generally this kind of programming-by-side-effect is not the model encouraged by dask.
You are able to pass data to functions executed by the cluster and get their return values in real time using client.submit, what are the queues doing for you that you cannot do otherwise? In addition, there are some dask constructs such as shared variables that maybe could do this, but (again) that is rarely used and I think unlikely the right paradigm for you.
For the specific reason that the code is not working for you: Client() creates at least one separate process for the scheduler and one for a worker with one or more threads (see your task-manager, top, or other system-watching tool). The queue.Queue is process-local, so each process will see the empty queue and add to it, but that information is not seen in the main process, and actions on the input queue are not seen in the workers.
I'm attempting to create unittests for my application which uses multiple processes, but have been having strange issues when attempting to run all the tests together. Basically when running tests individually they pass without issue but when run sequentially, such as when running all tests in the file, some tests will fail.
What I'm seeing is that many python processes are being created but they aren't closing when the test is reported as passed. For example if 2 tests are run that each generate 5 proceses, then 10 python processes show up in the system monitor.
I've tried using terminate and join but neither work. Is there a way to force a test to correctly close all processes that it generated before running the next test?
I'm running Python 2.7 in Ubuntu 16.04.
Edit:
It's a fairly large code base so here a simplified example.
from multiprocessing import Pipe, Process
class BaseDevice:
# Various methods
pass
class BaseInstr(BaseDevice, Process):
def __init__(self, pipe):
Process.__init__(self)
self.pipe = pipe
def run(self):
# Do stuff and wait for terminate message on pipe
# Various other higher level methods
class BaseCompountInstrument(BaseInstr):
def __init__(self, pipe):
# Create multiple instruments, usually done with config file but simplified here
BaseInstr.__init__(self, pipe)
instrlist = list()
for _ in range(5):
masterpipe, slavepipe = Pipe()
instrlist.append([BaseInstr(slavepipe), masterpipe])
def run(self):
pass
# Listen for message from pipe, send messages to sub-instruments
def shutdown(self):
# When shutdown message received, send to all sub-instruments
pass
class test(unittest.TestCase):
def setUp(self):
# Load up a configuration file from the sample configs so that they're updated
self.parentConn, self.childConn = Pipe()
self.instr = BaseCompountInstrument( self.childConn)
self.instr.start()
def tearDown(self):
self.parentConn.send("shutdown") # Propagates to all sub-instruments
def test1(self):
pass
def test2(self):
pass
After struggling a while (2 days actually) with this, I found a solution with it is not technically wrong, but removes all the parallel code you can have (Only in tests, only in tests...)
I use this package mock to mock functions (which I realize now it's part of the unittest module since Python 3.3 xD), you can suppose the execution of certain function worked well, fix a certain return value, or change the function itself.
So I did the last option: Change the function itself.
In my case I used a list of Process (because Pool didn't work in my case) and Manager's list to share data between the processes.
My original code would be something like this:
import multiprocessing as mp
manager = mp.Manager()
list_data = manager.list()
list_return = manager.list()
def parallel_function(list_data, list_return)
while len(list_data) > 0:
# Do things and make sure to "pop" the data in list_data
list_return.append(return_data)
return None
# Create as many processes as images or cpus, the lesser number
processes = [mp.Process(target=parallel_function,
args=(list_data, list_return))
for num_p in range(mp.cpu_count())]
for p in processes:
p.start()
for p in processes:
p.join(10)
So in my test I mock the function Process._init_ from the multiprocessing module to do my parallel_function instead create a new process.
In the test file, before any test you should define the same function you try to parallelize:
def fake_process(self, list_data, list_return):
while len(list_data) > 0:
# Do things and make sure to "pop" the data in list_data
list_return.append(return_data)
return None
And before the definition of any method which is going to execute this part of the code you have to define its decorators to overwrite the Process._init_ function.
#patch('multiprocessing.Process.__init__', new=fake_process)
#patch('multiprocessing.Process.start', new=lambda x: None)
#patch('multiprocessing.Process.join', new=lambda x, y: None)
def test_from_the_hell(self):
# Do things
If you use Manager data structures there is no need of use Locks or anything to control the access to the data, because those structures are thread safe.
I hope this will help any other lost soul who is trying to test multiprocessing code.
I am using Python multiprocessing, more precisely
from multiprocessing import Pool
p = Pool(15)
args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple
res = p.map_async(func, args) #func is some arbitrary function
p.close()
p.join()
This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that df is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using multiprocessing.Value to share the dataframe without copying
shared_df = multiprocessing.Value(pandas.DataFrame, df)
args = [(shared_df, config1), (shared_df, config2), ...]
(as suggested in Python multiprocessing shared memory), but that gives me TypeError: this type has no size (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer).
I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is multiprocessing.Value actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?
The first argument to Value is typecode_or_type. That is defined as:
typecode_or_type determines the type of the returned object: it is
either a ctypes type or a one character typecode of the kind used by
the array module. *args is passed on to the constructor for the type.
Emphasis mine. So, you simply cannot put a pandas dataframe in a Value, it has to be a ctypes type.
You could instead use a multiprocessing.Manager to serve your singleton dataframe instance to all of your processes. There's a few different ways to end up in the same place - probably the easiest is to just plop your dataframe into the manager's Namespace.
from multiprocessing import Manager
mgr = Manager()
ns = mgr.Namespace()
ns.df = my_dataframe
# now just give your processes access to ns, i.e. most simply
# p = Process(target=worker, args=(ns, work_unit))
Now your dataframe instance is accessible to any process that gets passed a reference to the Manager. Or just pass a reference to the Namespace, it's cleaner.
One thing I didn't/won't cover is events and signaling - if your processes need to wait for others to finish executing, you'll need to add that in. Here is a page with some Event examples which also cover with a bit more detail how to use the manager's Namespace.
(note that none of this addresses whether multiprocessing is going to result in tangible performance benefits, this is just giving you the tools to explore that question)
You can use Array instead of Value for storing your dataframe.
The solution below converts a pandas dataframe to an object that stores its data in shared memory:
import numpy as np
import pandas as pd
import multiprocessing as mp
import ctypes
# the origingal dataframe is df, store the columns/dtypes pairs
df_dtypes_dict = dict(list(zip(df.columns, df.dtypes)))
# declare a shared Array with data from df
mparr = mp.Array(ctypes.c_double, df.values.reshape(-1))
# create a new df based on the shared array
df_shared = pd.DataFrame(np.frombuffer(mparr.get_obj()).reshape(df.shape),
columns=df.columns).astype(df_dtypes_dict)
If now you share df_shared across processes, no additional copies will be made. For you case:
pool = mp.Pool(15)
def fun(config):
# df_shared is global to the script
df_shared.apply(config) # whatever compute you do with df/config
config_list = [config1, config2]
res = p.map_async(fun, config_list)
p.close()
p.join()
This is also particularly useful if you use pandarallel, for example:
# this will not explode in memory
from pandarallel import pandarallel
pandarallel.initialize()
df_shared.parallel_apply(your_fun, axis=1)
Note: with this solution you end up with two dataframes (df and df_shared), which consume twice the memory and are long to initialise. It might be possible to read the data directly in shared memory.
At least Python 3.6 supports to store a pandas DataFrame as a multiprocessing.Value. See below a working example:
import ctypes
import pandas as pd
from multiprocessing import Value
df = pd.DataFrame({'a': range(0,9),
'b': range(10,19),
'c': range(100,109)})
k = Value(ctypes.py_object)
k.value = df
print(k.value)
You can share a pandas dataframe between processes without any memory overhead by creating a data_handler child process. This process receives calls from the other children with specific data requests (i.e. a row, a specific cell, a slice etc..) from your very large dataframe object. Only the data_handler process keeps your dataframe in memory unlike a Manager like Namespace which causes the dataframe to be copied to all child processes. See below for a working example. This can be converted to pool.
Need a progress bar for this? see my answer here: https://stackoverflow.com/a/55305714/11186769
import time
import Queue
import numpy as np
import pandas as pd
import multiprocessing
from random import randint
#==========================================================
# DATA HANDLER
#==========================================================
def data_handler( queue_c, queue_r, queue_d, n_processes ):
# Create a big dataframe
big_df = pd.DataFrame(np.random.randint(
0,100,size=(100, 4)), columns=list('ABCD'))
# Handle data requests
finished = 0
while finished < n_processes:
try:
# Get the index we sent in
idx = queue_c.get(False)
except Queue.Empty:
continue
else:
if idx == 'finished':
finished += 1
else:
try:
# Use the big_df here!
B_data = big_df.loc[ idx, 'B' ]
# Send back some data
queue_r.put(B_data)
except:
pass
# big_df may need to be deleted at the end.
#import gc; del big_df; gc.collect()
#==========================================================
# PROCESS DATA
#==========================================================
def process_data( queue_c, queue_r, queue_d):
data = []
# Save computer memory with a generator
generator = ( randint(0,x) for x in range(100) )
for g in generator:
"""
Lets make a request by sending
in the index of the data we want.
Keep in mind you may receive another
child processes return call, which is
fine if order isnt important.
"""
#print(g)
# Send an index value
queue_c.put(g)
# Handle the return call
while True:
try:
return_call = queue_r.get(False)
except Queue.Empty:
continue
else:
data.append(return_call)
break
queue_c.put('finished')
queue_d.put(data)
#==========================================================
# START MULTIPROCESSING
#==========================================================
def multiprocess( n_processes ):
combined = []
processes = []
# Create queues
queue_data = multiprocessing.Queue()
queue_call = multiprocessing.Queue()
queue_receive = multiprocessing.Queue()
for process in range(n_processes):
if process == 0:
# Load your data_handler once here
p = multiprocessing.Process(target = data_handler,
args=(queue_call, queue_receive, queue_data, n_processes))
processes.append(p)
p.start()
p = multiprocessing.Process(target = process_data,
args=(queue_call, queue_receive, queue_data))
processes.append(p)
p.start()
for i in range(n_processes):
data_list = queue_data.get()
combined += data_list
for p in processes:
p.join()
# Your B values
print(combined)
if __name__ == "__main__":
multiprocess( n_processes = 4 )
I was pretty surprised that joblib's Parallel (since 1.0.1 at least) supports sharing pandas dataframes with multiprocess workers out of the box already. At least with the 'loky' backend.
One thing I figured out experimentally: parameters you pass to the function should not contain any large dict. If they do, turn the dict into a Series or Dataframe.
Some additional memory for sure gets used by each worker, but much less than the size of your supposedly 'big' dataframe residing in the main process. And the computation begins right away in all workers. Otherwise, joblib starts all your requested workers, but they are hanging idle while objects are copied into each one sequentially, which is taking a long time. I can provide a code sample if someone needs it. I have tested dataframes processing only in read-only mode. The feature is not mentioned in the docs but it works for Pandas.