Shared-memory objects in multiprocessing

Shared-memory objects in multiprocessing - python

Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:
def func(arr, param):
# do stuff to arr, param
# build array arr
pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]
If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.
Is there a way to let different processes share the same array? This array object is read-only and will never be modified.
What's more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?
[EDITED]
I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:
from multiprocessing import Pool, Manager
import numpy as np;
import time
def f(arr):
return len(arr)
t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;
pool = Pool(processes = 6)
t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;
output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):
construct array = 0.0178790092468
multiprocessing overhead = 0.252444982529
Why is there such huge overhead, if we didn't copy the array? And what part does the shared memory save me?

If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don't alter the object).
The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.
If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).
The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.
There are a wealth of parallel processing libraries and approaches available in Python. multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better.

I run into the same problem and wrote a little shared-memory utility class to work around it.
I'm using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.
With the solution I get speedups by a factor of approx 3 on a quad-core i7.
Here's the code:
Feel free to use and improve it, and please report back any bugs.
'''
Created on 14.05.2013
#author: martin
'''
import multiprocessing
import ctypes
import numpy as np
class SharedNumpyMemManagerError(Exception):
pass
'''
Singleton Pattern
'''
class SharedNumpyMemManager:
_initSize = 1024
_instance = None
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super(SharedNumpyMemManager, cls).__new__(
cls, *args, **kwargs)
return cls._instance
def __init__(self):
self.lock = multiprocessing.Lock()
self.cur = 0
self.cnt = 0
self.shared_arrays = [None] * SharedNumpyMemManager._initSize
def __createArray(self, dimensions, ctype=ctypes.c_double):
self.lock.acquire()
# double size if necessary
if (self.cnt >= len(self.shared_arrays)):
self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)
# next handle
self.__getNextFreeHdl()
# create array in shared memory segment
shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))
# convert to numpy array vie ctypeslib
self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)
# do a reshape for correct dimensions
# Returns a masked array containing the same data, but with a new shape.
# The result is a view on the original array
self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)
# update cnt
self.cnt += 1
self.lock.release()
# return handle to the shared memory numpy array
return self.cur
def __getNextFreeHdl(self):
orgCur = self.cur
while self.shared_arrays[self.cur] is not None:
self.cur = (self.cur + 1) % len(self.shared_arrays)
if orgCur == self.cur:
raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')
def __freeArray(self, hdl):
self.lock.acquire()
# set reference to None
if self.shared_arrays[hdl] is not None: # consider multiple calls to free
self.shared_arrays[hdl] = None
self.cnt -= 1
self.lock.release()
def __getArray(self, i):
return self.shared_arrays[i]
#staticmethod
def getInstance():
if not SharedNumpyMemManager._instance:
SharedNumpyMemManager._instance = SharedNumpyMemManager()
return SharedNumpyMemManager._instance
#staticmethod
def createArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)
#staticmethod
def getArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)
#staticmethod
def freeArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)
# Init Singleton on module load
SharedNumpyMemManager.getInstance()
if __name__ == '__main__':
import timeit
N_PROC = 8
INNER_LOOP = 10000
N = 1000
def propagate(t):
i, shm_hdl, evidence = t
a = SharedNumpyMemManager.getArray(shm_hdl)
for j in range(INNER_LOOP):
a[i] = i
class Parallel_Dummy_PF:
def __init__(self, N):
self.N = N
self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)
self.pool = multiprocessing.Pool(processes=N_PROC)
def update_par(self, evidence):
self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))
def update_seq(self, evidence):
for i in range(self.N):
propagate((i, self.arrayHdl, evidence))
def getArray(self):
return SharedNumpyMemManager.getArray(self.arrayHdl)
def parallelExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_par(5)
print(pf.getArray())
def sequentialExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_seq(5)
print(pf.getArray())
t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")
print("Sequential: ", t1.timeit(number=1))
print("Parallel: ", t2.timeit(number=1))

This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.
The code would look like the following.
import numpy as np
import ray
ray.init()
#ray.remote
def func(array, param):
# Do stuff.
return 1
array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)
result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)
If you don't call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.
Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.
You can compare the performance of serialization in Ray versus pickle by running the following in IPython.
import numpy as np
import pickle
import ray
ray.init()
x = {i: np.ones(10**7) for i in range(20)}
# Time Ray.
%time x_id = ray.put(x) # 2.4s
%time new_x = ray.get(x_id) # 0.00073s
# Time pickle.
%time serialized = pickle.dumps(x) # 2.6s
%time deserialized = pickle.loads(serialized) # 1.9s
Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).
See the Ray documentation. You can read more about fast serialization using Ray and Arrow. Note I'm one of the Ray developers.

Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.
I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app. It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle'd bytestrings generated by pickle.dumps(...).
The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables' values by calling the name from any Brain object.
$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma
from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/')
brain['a'] = [1]*10000
brain['a']
# >>> [1,1,1,1,...]

Related

PicklingError: Can't pickle <class 'ctypes.c_char_Array_X'>: attribute lookup c_char_Array_X on ctypes failed

There is problem using ctypes structures with multiprocessing
I can use simple ctypes variables with multiprocessing, but when I use structures passing to funcs there is problem with pickling it
Here is some code which demonstrate this problem
import concurrent.futures
from ctypes import *
def test_c_val(c_val):
print(c_val.value)
return c_val.value
test_int = c_int(55)
test_char = c_char(str(6).encode())
arr = [str(i).encode() for i in range(4)]
test_c_array = (c_char * len(arr))(*arr)
futures = []
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
futures.append(executor.submit(test_c_val, test_int))
futures.append(executor.submit(test_c_val, test_char))
futures.append(executor.submit(test_c_val, test_c_array))
time.sleep(5)
print(futures[2])
print(futures)
print(futures[2].exception())
How can i solve it?

ctypes pointers can point to memory of any size or location that python doesn't know about, it may even point to null, and it's unsafe for python to attempt to pickle it, so python doesn't pickle it.
the closest thing is to either have a shared memory array using multiprocessing.Array or using python array which can be pickled.
this is how it would be done using an array, as it is a python object that can be pickled, the array is sent and a pointer to it is used to form a ctypes object.
import concurrent.futures
import ctypes
import array
import time
def test_c_val(c_val):
print(c_val.value)
return c_val.value
def test_py_array(py_arr:array.array):
c_array_size = py_arr.buffer_info()[1]
c_array = (ctypes.c_char*c_array_size).from_address(py_arr.buffer_info()[0])
print(c_array.value)
return py_arr
if __name__ == "__main__":
test_int = ctypes.c_int(55)
test_char = ctypes.c_char(str(6).encode())
arr = [str(i).encode() for i in range(4)]
test_c_array = (ctypes.c_char * len(arr))(*arr)
# test_py_c_array = array.array('b', b''.join(arr))
test_py_c_array = array.array('b', test_c_array.value)
futures = []
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
futures.append(executor.submit(test_c_val, test_int))
futures.append(executor.submit(test_c_val, test_char))
futures.append(executor.submit(test_py_array, test_py_c_array))
time.sleep(1)
print(futures[2])
print(futures)
print(futures[2].exception())
if you want to use shared memory then you can use multiprocessing.shared_memory which is more flexible than multiprocessing.Array, note that you can't pickle multiprocessing.Array, it can only be inherited, unlike python array.
alternatively there's numpy which allows easier and serializable c arrays which are easier to use than python arrays.

Pytorch's share_memory_() vs built-in Python's shared_memory: Why in Pytorch we don't need to access the shared memory-block?

Trying to learn about the built-in multiprocessing and Pytorch's multiprocessing packages, I have observed a different behavior between both. I find this to be strange since Pytorch's package is fully-compatible with the built-in package.
Concretely, I'm refering to the way variables are shared between processes. In Pytorch, tensor's are moved to shared_memory via the inplace operation share_memory_(). On the other hand, we can get the same result with the built-in package by using the shared_memory module.
The difference between both that I'm struggling to understand is that, with the built-in version, we have to explicitely access the shared memory-block inside the launched process. However, we don't need to do that with the Pytorch version.
Here is a Pytorch's toy example showing this:
import time
import torch
# the same behavior happens when importing:
# import multiprocessing as mp
import torch.multiprocessing as mp
def get_time(s):
return round(time.time() - s, 1)
def foo(a):
# wait ~1sec to print the value of the tensor.
time.sleep(1.0)
with lock:
#-------------------------------------------------------------------
# WITHOUT explicitely accessing the shared memory block, we can observe
# that the tensor has changed:
#-------------------------------------------------------------------
print(f"{__name__}\t{get_time(s)}\t\t{a}")
# global variables.
lock = mp.Lock()
s = time.time()
if __name__ == '__main__':
print("Module\t\tTime\t\tValue")
print("-"*50)
# create tensor and assign it to shared memory.
a = torch.zeros(2).share_memory_()
print(f"{__name__}\t{get_time(s)}\t\t{a}")
# start child process.
p0 = mp.Process(target=foo, args=(a,))
p0.start()
# modify the value of the tensor after ~0.5sec.
time.sleep(0.5)
with lock:
a[0] = 1.0
print(f"{__name__}\t{get_time(s)}\t\t{a}")
time.sleep(1.5)
p0.join()
which outputs (as expected):
Module Time Value
--------------------------------------------------
__main__ 0.0 tensor([0., 0.])
__main__ 0.5 tensor([1., 0.])
__mp_main__ 1.0 tensor([1., 0.])
And here is a toy example with the built-in package:
import time
import multiprocessing as mp
from multiprocessing import shared_memory
import numpy as np
def get_time(s):
return round(time.time() - s, 1)
def foo(shm_name, shape, type_):
#-------------------------------------------------------------------
# WE NEED TO explicitely access the shared memory block to observe
# that the array has changed:
#-------------------------------------------------------------------
existing_shm = shared_memory.SharedMemory(name=shm_name)
a = np.ndarray(shape, type_, buffer=existing_shm.buf)
# wait ~1sec to print the value.
time.sleep(1.0)
with lock:
print(f"{__name__}\t{get_time(s)}\t\t{a}")
# global variables.
lock = mp.Lock()
s = time.time()
if __name__ == '__main__':
print("Module\t\tTime\t\tValue")
print("-"*35)
# create numpy array and shared memory block.
a = np.zeros(2,)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
a_shared = np.ndarray(a.shape, a.dtype, buffer=shm.buf)
a_shared[:] = a[:]
print(f"{__name__}\t{get_time(s)}\t\t{a_shared}")
# start child process.
p0 = mp.Process(target=foo, args=(shm.name, a.shape, a.dtype))
p0.start()
# modify the value of the vaue after ~0.5sec.
time.sleep(0.5)
with lock:
a_shared[0] = 1.0
print(f"{__name__}\t{get_time(s)}\t\t{a_shared}")
time.sleep(1.5)
p0.join()
which equivalently outputs, as expected:
Module Time Value
-----------------------------------
__main__ 0.0 [0. 0.]
__main__ 0.5 [1. 0.]
__mp_main__ 1.0 [1. 0.]
So what I'm strugging to understand is why we don't need to follow the same steps in both versions, built-in and Pytorch's, i.e. how Pytorch is able to avoid the need to explicitely access the shared memory-block?
P.S. I'm using a Windows OS and Python 3.9

You are writing a love letter to the pytorch authors.
That is, you are patting them on the back,
congratulating their wrapper efforts as "a job well done!"
It's a lovely library.
Let's take a step back and use a very simple
data structure, a dictionary d.
If parent initializes d with some values,
and then kicks off a pair of worker children,
each child has a copy of d.
How did that happen?
The multiprocessing module forked off
the workers, looked at the set of defined
variables which includes d, and serialized
those (key, value) pairs from parent down to
the children.
So at this point we have 3 independent copies
of d. If parent or either child modifies d,
the other 2 copies are completely unaffected.
Now switch gears to the pytorch wrapper.
You offered some nice concise code that demos
the little .SharedMemory() dance an app would
need to do if we want 3 references to same shared structure
rather than 3 independent copies.
The pytorch wrapper serializes references
to common data structure, rather than producing copies.
Under the hood it's doing exactly the dance that you did.
But with no repeated verbiage up at the app level,
as the details have nicely been abstracted away, FTW!
Why in Pytorch we don't need to access the shared memory-block?
tl;dr: We do need to access it. But the library shoulders the burden of worrying about the details, so we don't have to.

pytorch has a simple wrapper around shared memory, python's shared memory module is only a wrapper around the underlying OS dependent functions.
the way it can be done is that you don't serialize the array or the shared memory themselves, and only serialize what's needed to create them by using the __getstate__ and __setstate__ methods from the docs, so that your object acts as both a proxy and a container at the same time.
the following bar class can double for a proxy and a container this way, which is useful if the user shouldn't have to worry about the shared memory part.
import time
import multiprocessing as mp
from multiprocessing import shared_memory
import numpy as np
class bar:
def __init__(self):
self._size = 10
self._type = np.uint8
self.shm = shared_memory.SharedMemory(create=True, size=self._size)
self._mem_name = self.shm.name
self.arr = np.ndarray([self._size], self._type, buffer=self.shm.buf)
def __getstate__(self):
"""Return state values to be pickled."""
return (self._mem_name, self._size, self._type)
def __setstate__(self, state):
"""Restore state from the unpickled state values."""
self._mem_name, self._size, self._type = state
self.shm = shared_memory.SharedMemory(self._mem_name)
self.arr = np.ndarray([self._size], self._type, buffer=self.shm.buf)
def get_time(s):
return round(time.time() - s, 1)
def foo(shm, lock):
# -------------------------------------------------------------------
# without explicitely access the shared memory block we observe
# that the array has changed:
# -------------------------------------------------------------------
a = shm
# wait ~1sec to print the value.
time.sleep(1.0)
with lock:
print(f"{__name__}\t{get_time(s)}\t\t{a.arr}")
# global variables.
s = time.time()
if __name__ == '__main__':
lock = mp.Lock() # to work on windows/mac.
print("Module\t\tTime\t\tValue")
print("-" * 35)
# create numpy array and shared memory block.
a = bar()
print(f"{__name__}\t{get_time(s)}\t\t{a.arr}")
# start child process.
p0 = mp.Process(target=foo, args=(a, lock))
p0.start()
# modify the value of the vaue after ~0.5sec.
time.sleep(0.5)
with lock:
a.arr[0] = 1.0
print(f"{__name__}\t{get_time(s)}\t\t{a.arr}")
time.sleep(1.5)
p0.join()
python just makes it much easier to hide such details inside the class without bothering the user with such details.
Edit: i wish they'd make locks non-inheritable so your code can raise an error on the lock, instead you'll find out one day that it doesn't actually lock ... After it crashes your application in production.

Python3: Heap memory not getting released when expected

I have a class that has a cache implemented as a dict for numpy arrays, which can occupy GBs of data.
class WorkOperations(object):
def __init__(self):
self.data_cache: Dict[str, Dict[str, Tuple[np.ndarray, np.ndarray]]] = {}
def get_data(key):
if key not in data_cache:
add_data(key)
return self.data_cache[key]
def add_data(key)
result = run_heavy_calculation(key)
self.data_cache[key] = result
I am testing the code with this function -
import gc
def perform_operations()
work_operations = WorkOperations()
# input_keys gives a list of keys to process
for keys in input_keys():
data = work_operations.get_data(key)
do_some_operation(data)
del work_operations
perform_operations()
gc.collect()
The result of run_heavy_calculation is heavy in memory and soon data_cache grows and occupies memory in GBs (which is expected).
But memory does not get released even after perform_operations() is done. I tried adding del work_operations and invoking gc.collect() but that did not help either. I checked memory of the process after several hours, but the memory was still not freed up.
If I don't use caching (data_cache) at all (at the cost of latency), memory never goes high.
I am wondering what is it that is taking memory. I tried running tracemalloc, but it just showed lines occupying memory in KBs. I also took a memory dump with gdb by looking at memory address from process pmap and /proc/<pid>/smaps, but that is really long and even with hexeditor I couldn't figure out much.
I am measuring memory used by the process using top command and looking at RES. I also tried outputting memory in the end from within python process as well with -
import psutils
import gc
import logging
GIGABYTE = 1024.0 * 1024.0 * 1024.0
perform_operations()
gc.collect()
memory_full_info = psutil.Process().memory_full_info()
logging.info(f"process memory after running the process {memory_full_info.uss / GIGABYTE}")

Could not reproduce on Ubuntu with this :
import itertools
import time
import os
from typing import Dict, Tuple
import numpy as np
import psutil # installed with pip
process = psutil.Process(os.getpid())
SIZE = 10**7
def run_heavy_calculation(key):
array1 = np.zeros(SIZE)
array2 = np.zeros(SIZE)
# Linux use "virtual memory", which means that the memory required for the arrays was allocated, but will not
# be actually deducted until we use them, so we write some 1 into them
# cf: https://stackoverflow.com/q/29516888/11384184
for i in range(0, SIZE, 1000):
array1[i] = 1
array2[i] = 1
return {key: (array1, array2)}
class WorkOperations(object):
def __init__(self):
self.data_cache: Dict[str, Dict[str, Tuple[np.ndarray, np.ndarray]]] = {}
def get_data(self, key):
if key not in self.data_cache:
self.add_data(key)
return self.data_cache[key]
def add_data(self, key):
result = run_heavy_calculation(key)
self.data_cache[key] = result
def perform_operations(input_keys):
work_operations = WorkOperations()
for key in input_keys():
data = work_operations.get_data(key)
time.sleep(0.2)
print(key, process.memory_info().rss / 10**9)
del work_operations
perform_operations(lambda: map(str, itertools.product("abcdefgh", "0123456789"))) # dummy keys
print("after operations", process.memory_info().rss / 10**9)
input("pause")
('a', '0') 0.113106944
('a', '1') 0.195014656
('a', '2') 0.276926464
...
('h', '7') 6.421118976
('h', '8') 6.503030784
('h', '9') 6.584942592
after operations 0.031363072
pause
It climbed up to using 6,5 Go of RAM, then returned from the function and all of it was released.
You can add a finalizer (__del__) to the class WorkOperations :
def __del__(self):
print("deleted")
I see it printed between the last operation's print and the one after.
Although it does not guarantee that this will always be the case (cf this question), it strongly indicates that everything is working as intended : event without the del, the functions returns (hence the scope is lost), so the reference count for work_operations gets to 0, and then gets GC'ed.
It can be checked with sys.getrefcount too :
print(sys.getrefcount(work_operations) - 1) # see https://stackoverflow.com/a/510417/11384184
del work_operations
which for me print 1.
Please provide a Minimal Reproducible Example and info on your system.

Determining Memory Consumption with Python Multiprocessing and Shared Arrays

I've spent some time research the Python multiprocessing module, it's use of os.fork, and shared memory using Array in multiprocessing for a piece of code I'm writing.
The project itself boils down to this: I have several MxN arrays (let's suppose I have 3 arrays called A, B, and C) that I need to process to calculate a new MxN array (called D), where:
Dij = f(Aij, Bij, Cij)
The function f is such that standard vector operations cannot be applied. This task is what I believe is called "embarrassing parallel". Given the overhead involved in multiprocessing, I am going to break the calculation of D into blocks. For example, if D was 8x8 and I had 4 processes, each processor would be responsible for solving a 4x4 "chunk" of D.
Now, the size of the arrays has the potential to be very big (on the order of several GB), so I want all arrays to use shared memory (even array D, which will have sub-processes writing to it). I believe I have a solution to the shared array issue using a modified version of what is presented here.
However, from an implementation perspective it'd be nice to place arrays A, B, and C into a dictionary. What is unclear to me is if doing this will cause the arrays to be copied in memory when the reference counter for the dictionary is incremented within each sub-process.
To try and answer this, I wrote a little test script (see below) and tried running it using valgrind --tool=massif to track memory usage. However, I am not quite clear how to intemperate the results from it. Specifically, whether each massiff.out file (where the number of files is equal to the number of sub-processes created by my test script + 1) denotes the memory used by that process (i.e. I need to sum them all up to get the total memory usage) or if I just need to consider the massif.out associated with the parent process.
On a side note: One of my shared memory arrays has the sub-processes writing to it. I know that this sound be avoided, especially since I am not using locks to limit only one sub-process writing to the array at any given time. Is this a problem? My thought is that since the order that the array is filled out is irrelevant, the calculation of any index is independent of any other index, and that any given sub-process will never write to the same array index as any other process, there will not be any sort of race conditions. Is this correct?
#! /usr/bin/env python
import multiprocessing as mp
import ctypes
import numpy as np
import time
import sys
import timeit
def shared_array(shape=None, lock=False):
"""
Form a shared memory numpy array.
https://stackoverflow.com/questions/5549190/is-shared-readonly-data-copied-to-different-processes-for-python-multiprocessing
"""
shared_array_base = mp.Array(ctypes.c_double, shape[0]*shape[1], lock=lock)
# Create a locked or unlocked array
if lock:
shared_array = np.frombuffer(shared_array_base.get_obj())
else:
shared_array = np.frombuffer(shared_array_base)
shared_array = shared_array.reshape(*shape)
return shared_array
def worker(indices=None, queue=None, data=None):
# Loop over each indice and "crush" some data
for i in indices:
time.sleep(0.01)
if data is not None:
data['sink'][i, :] = data['source'][i, :] + i
# Place ID for completed indice into the queue
queue.put(i)
if __name__ == '__main__':
# Set the start time
begin = timeit.default_timer()
# Size of arrays (m x n)
m = 1000
n = 1000
# Number of Processors
N = 2
# Create a queue to use for tracking progress
queue = mp.Queue()
# Create dictionary and shared arrays
data = dict()
# Form a shared array without a lock.
data['source'] = shared_array(shape=(m, n), lock=True)
data['sink'] = shared_array(shape=(m, n), lock=False)
# Create a list of the indices associated with the m direction
indices = range(0, m)
# Parse the indices list into range blocks; each process will get a block
indices_blocks = [int(i) for i in np.linspace(0, 1000, N+1)]
# Initialize a list for storing created sub-processes
procs = []
# Print initialization time-stap
print 'Time to initialize time: {}'.format(timeit.default_timer() - begin)
# Create and start each sbu-process
for i in range(1, N+1):
# Start of the block
start = indices_blocks[i-1]
# End of the block
end = indices_blocks[i]
# Create the sub-process
procs.append(mp.Process(target=worker,
args=(indices[start:end], queue, data)))
# Kill the sub-process if/when the parent is killed
procs[-1].daemon=True
# Start the sub-process
procs[-1].start()
# Initialize a list to store the indices that have been processed
completed = []
# Entry a loop dependent on whether any of the sub-processes are still alive
while any(i.is_alive() for i in procs):
# Read the queue, append completed indices, and print the progress
while not queue.empty():
done = queue.get()
if done not in completed:
completed.append(done)
message = "\rCompleted {:.2%}".format(float(len(completed))/len(indices))
sys.stdout.write(message)
sys.stdout.flush()
print ''
# Join all the sub-processes
for p in procs:
p.join()
# Print the run time and the modified sink array
print 'Running time: {}'.format(timeit.default_timer() - begin)
print data['sink']
Edit: I seems I've run into another issue; specifically, an value of n equal to 3 million will result in the kernel killing the process (I assume it's due to a memory issue). This appears to be with how shared_array() works (I can create np.zeros arrays of the same size and not have an issue). After playing with it a bit I get the traceback shown below. I'm not entirely sure what is causing the memory allocation error, but a quick Google search gives discussions about how mmap maps virtual address space, which I'm guessing is smaller than the amount of physical memory a machine has?
Traceback (most recent call last):
File "./shared_array.py", line 66, in <module>
data['source'] = shared_array(shape=(m, n), lock=True)
File "./shared_array.py", line 17, in shared_array
shared_array_base = mp.Array(ctypes.c_double, shape[0]*shape[1], lock=lock)
File "/usr/apps/python/lib/python2.7/multiprocessing/__init__.py", line 260, in Array
return Array(typecode_or_type, size_or_initializer, **kwds)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 120, in Array
obj = RawArray(typecode_or_type, size_or_initializer)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 88, in RawArray
obj = _new_value(type_)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 68, in _new_value
wrapper = heap.BufferWrapper(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 243, in __init__
block = BufferWrapper._heap.malloc(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 223, in malloc
(arena, start, stop) = self._malloc(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 120, in _malloc
arena = Arena(length)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 82, in __init__
self.buffer = mmap.mmap(-1, size)
mmap.error: [Errno 12] Cannot allocate memory

How do I pass large numpy arrays between python subprocesses without saving to disk?

Is there a good way to pass a large chunk of data between two python subprocesses without using the disk? Here's a cartoon example of what I'm hoping to accomplish:
import sys, subprocess, numpy
cmdString = """
import sys, numpy
done = False
while not done:
cmd = raw_input()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
data = numpy.zeros(1000000, dtype=numpy.uint8)
data.dump('data.pkl')
sys.stdout.write('data.pkl' + '\\n')
sys.stdout.flush()"""
proc = subprocess.Popen( #python vs. pythonw on Windows?
[sys.executable, '-c %s'%cmdString],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for i in range(3):
proc.stdin.write('data\n')
print proc.stdout.readline().rstrip()
a = numpy.load('data.pkl')
print a.shape
proc.stdin.write('done\n')
This creates a subprocess which generates a numpy array and saves the array to disk. The parent process then loads the array from disk. It works!
The problem is, our hardware can generate data 10x faster than the disk can read/write. Is there a way to transfer data from one python process to another purely in-memory, maybe even without making a copy of the data? Can I do something like passing-by-reference?
My first attempt at transferring data purely in-memory is pretty lousy:
import sys, subprocess, numpy
cmdString = """
import sys, numpy
done = False
while not done:
cmd = raw_input()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
data = numpy.zeros(1000000, dtype=numpy.uint8)
##Note that this is NFG if there's a '10' in the array:
sys.stdout.write(data.tostring() + '\\n')
sys.stdout.flush()"""
proc = subprocess.Popen( #python vs. pythonw on Windows?
[sys.executable, '-c %s'%cmdString],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for i in range(3):
proc.stdin.write('data\n')
a = numpy.fromstring(proc.stdout.readline().rstrip(), dtype=numpy.uint8)
print a.shape
proc.stdin.write('done\n')
This is extremely slow (much slower than saving to disk) and very, very fragile. There's got to be a better way!
I'm not married to the 'subprocess' module, as long as the data-taking process doesn't block the parent application. I briefly tried 'multiprocessing', but without success so far.
Background: We have a piece of hardware that generates up to ~2 GB/s of data in a series of ctypes buffers. The python code to handle these buffers has its hands full just dealing with the flood of information. I want to coordinate this flow of information with several other pieces of hardware running simultaneously in a 'master' program, without the subprocesses blocking each other. My current approach is to boil the data down a little bit in the subprocess before saving to disk, but it'd be nice to pass the full monty to the 'master' process.

While googling around for more information about the code Joe Kington posted, I found the numpy-sharedmem package. Judging from this numpy/multiprocessing tutorial it seems to share the same intellectual heritage (maybe largely the same authors? -- I'm not sure).
Using the sharedmem module, you can create a shared-memory numpy array (awesome!), and use it with multiprocessing like this:
import sharedmem as shm
import numpy as np
import multiprocessing as mp
def worker(q,arr):
done = False
while not done:
cmd = q.get()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
rnd=np.random.randint(100)
print('rnd={0}'.format(rnd))
arr[:]=rnd
q.task_done()
if __name__=='__main__':
N=10
arr=shm.zeros(N,dtype=np.uint8)
q=mp.JoinableQueue()
proc = mp.Process(target=worker, args=[q,arr])
proc.daemon=True
proc.start()
for i in range(3):
q.put('data')
# Wait for the computation to finish
q.join()
print arr.shape
print(arr)
q.put('done')
proc.join()
Running yields
rnd=53
(10,)
[53 53 53 53 53 53 53 53 53 53]
rnd=15
(10,)
[15 15 15 15 15 15 15 15 15 15]
rnd=87
(10,)
[87 87 87 87 87 87 87 87 87 87]

Basically, you just want to share a block of memory between processes and view it as a numpy array, right?
In that case, have a look at this (Posted to numpy-discussion by Nadav Horesh awhile back, not my work). There are a couple of similar implementations (some more flexible), but they all essentially use this principle.
# "Using Python, multiprocessing and NumPy/SciPy for parallel numerical computing"
# Modified and corrected by Nadav Horesh, Mar 2010
# No rights reserved
import numpy as N
import ctypes
import multiprocessing as MP
_ctypes_to_numpy = {
ctypes.c_char : N.dtype(N.uint8),
ctypes.c_wchar : N.dtype(N.int16),
ctypes.c_byte : N.dtype(N.int8),
ctypes.c_ubyte : N.dtype(N.uint8),
ctypes.c_short : N.dtype(N.int16),
ctypes.c_ushort : N.dtype(N.uint16),
ctypes.c_int : N.dtype(N.int32),
ctypes.c_uint : N.dtype(N.uint32),
ctypes.c_long : N.dtype(N.int64),
ctypes.c_ulong : N.dtype(N.uint64),
ctypes.c_float : N.dtype(N.float32),
ctypes.c_double : N.dtype(N.float64)}
_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(), _ctypes_to_numpy.keys()))
def shmem_as_ndarray(raw_array, shape=None ):
address = raw_array._obj._wrapper.get_address()
size = len(raw_array)
if (shape is None) or (N.asarray(shape).prod() != size):
shape = (size,)
elif type(shape) is int:
shape = (shape,)
else:
shape = tuple(shape)
dtype = _ctypes_to_numpy[raw_array._obj._type_]
class Dummy(object): pass
d = Dummy()
d.__array_interface__ = {
'data' : (address, False),
'typestr' : dtype.str,
'descr' : dtype.descr,
'shape' : shape,
'strides' : None,
'version' : 3}
return N.asarray(d)
def empty_shared_array(shape, dtype, lock=True):
'''
Generate an empty MP shared array given ndarray parameters
'''
if type(shape) is not int:
shape = N.asarray(shape).prod()
try:
c_type = _numpy_to_ctypes[dtype]
except KeyError:
c_type = _numpy_to_ctypes[N.dtype(dtype)]
return MP.Array(c_type, shape, lock=lock)
def emptylike_shared_array(ndarray, lock=True):
'Generate a empty shared array with size and dtype of a given array'
return empty_shared_array(ndarray.size, ndarray.dtype, lock)

From the other answers, it seems that numpy-sharedmem is the way to go.
However, if you need a pure python solution, or installing extensions, cython or the like is a (big) hassle, you might want to use the following code which is a simplified version of Nadav's code:
import numpy, ctypes, multiprocessing
_ctypes_to_numpy = {
ctypes.c_char : numpy.dtype(numpy.uint8),
ctypes.c_wchar : numpy.dtype(numpy.int16),
ctypes.c_byte : numpy.dtype(numpy.int8),
ctypes.c_ubyte : numpy.dtype(numpy.uint8),
ctypes.c_short : numpy.dtype(numpy.int16),
ctypes.c_ushort : numpy.dtype(numpy.uint16),
ctypes.c_int : numpy.dtype(numpy.int32),
ctypes.c_uint : numpy.dtype(numpy.uint32),
ctypes.c_long : numpy.dtype(numpy.int64),
ctypes.c_ulong : numpy.dtype(numpy.uint64),
ctypes.c_float : numpy.dtype(numpy.float32),
ctypes.c_double : numpy.dtype(numpy.float64)}
_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(),
_ctypes_to_numpy.keys()))
def shm_as_ndarray(mp_array, shape = None):
'''Given a multiprocessing.Array, returns an ndarray pointing to
the same data.'''
# support SynchronizedArray:
if not hasattr(mp_array, '_type_'):
mp_array = mp_array.get_obj()
dtype = _ctypes_to_numpy[mp_array._type_]
result = numpy.frombuffer(mp_array, dtype)
if shape is not None:
result = result.reshape(shape)
return numpy.asarray(result)
def ndarray_to_shm(array, lock = False):
'''Generate an 1D multiprocessing.Array containing the data from
the passed ndarray. The data will be *copied* into shared
memory.'''
array1d = array.ravel(order = 'A')
try:
c_type = _numpy_to_ctypes[array1d.dtype]
except KeyError:
c_type = _numpy_to_ctypes[numpy.dtype(array1d.dtype)]
result = multiprocessing.Array(c_type, array1d.size, lock = lock)
shm_as_ndarray(result)[:] = array1d
return result
You would use it like this:
Use sa = ndarray_to_shm(a) to convert the ndarray a into a shared multiprocessing.Array.
Use multiprocessing.Process(target = somefunc, args = (sa, ) (and start, maybe join) to call somefunc in a separate process, passing the shared array.
In somefunc, use a = shm_as_ndarray(sa) to get an ndarray pointing to the shared data. (Actually, you may want to do the same in the original process, immediately after creating sa, in order to have two ndarrays referencing the same data.)
AFAICS, you don't need to set lock to True, since shm_as_ndarray will not use the locking anyhow. If you need locking, you would set lock to True and call acquire/release on sa.
Also, if your array is not 1-dimensional, you might want to transfer the shape along with sa (e.g. use args = (sa, a.shape)).
This solution has the advantage that it does not need additional packages or extension modules, except multiprocessing (which is in the standard library).

Use threads. But I guess you are going to get problems with the GIL.
Instead: Choose your poison.
I know from the MPI implementations I work with, that they use shared memory for on-node-communications. You will have to code your own synchronization in that case.
2 GB/s sounds like you will get problems with most "easy" methods, depending on your real-time constraints and available main memory.

One possibility to consider is to use a RAM drive for the temporary storage of files to be shared between processes. A RAM drive is where a portion of RAM is treated as a logical hard drive, to which files can be written/read as you would with a regular drive, but at RAM read/write speeds.
This article describes using the ImDisk software (for MS Win) to create such disk and obtains file read/write speeds of 6-10 Gigabytes/second:
https://www.tekrevue.com/tip/create-10-gbs-ram-disk-windows/
An example in Ubuntu:
https://askubuntu.com/questions/152868/how-do-i-make-a-ram-disk#152871
Another noted benefit is that files with arbitrary formats can be passed around with such method: e.g. Picke, JSON, XML, CSV, HDF5, etc...
Keep in mind that anything stored on the RAM disk is wiped on reboot.

Use threads. You probably won't have problems with the GIL.
The GIL only affects Python code, not C/Fortran/Cython backed libraries. Most numpy operations and a good chunk of the C-backed Scientific Python stack release the GIL and can operate just fine on multiple cores. This blogpost discusses the GIL and scientific Python in more depth.
Edit
Simple ways to use threads include the threading module and multiprocessing.pool.ThreadPool.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shared-memory objects in multiprocessing - python

Related

PicklingError: Can't pickle <class 'ctypes.c_char_Array_X'>: attribute lookup c_char_Array_X on ctypes failed

Pytorch's share_memory_() vs built-in Python's shared_memory: Why in Pytorch we don't need to access the shared memory-block?

Python3: Heap memory not getting released when expected

Determining Memory Consumption with Python Multiprocessing and Shared Arrays

How do I pass large numpy arrays between python subprocesses without saving to disk?

Categories

Resources