Zero simulation time - python

I am creating some kind of RAM memory. Idea was firstly to create RAM "write" functionality, as you can see in code below. Beside RAM memory, there is RAM model driver, which was suposed to write data to RAM (just to briefly verify if write functionality works properly).
RAM model driver and RAM model are connected to each other and some transaction should occur, but problem is that simulation is completed within zero simulation seconds.
Anybody has idea what could be a problem?
#gear
def ram_model(write_addr: Uint,
write_data: Queue['dtype'],*,
ram_mem = None,
dtype = b'dtype',
mem_granularity_in_bytes = 1) -> (Queue['dtype']):
if(ram_mem is None and type(ram_mem) is not dict):
ram_mem = {}
ram_write_op(write_addr = write_addr,
write_data = write_data,
ram_memory = ram_mem)
#gear
async def ram_write_op(write_addr: Uint,
write_data: Queue,*,
ram_memory = None,
mem_granularity_in_bytes = 1):
if(ram_memory is None and type(ram_mem) is not dict):
SystemError("Ram memory is %s but it should be dictionary",(type(ram_memory)))
byte_t = Array[Uint[8], mem_granularity_in_bytes]
async with write_addr as addr:
async for data, _ in write_data:
for b in code(data, byte_t):
ram_memory[addr] = b
addr += 1
#gear
async def ram_model_drv(*,addr_bus_width = b'asize',
data_type = b'dtype') -> (Uint[8], Queue['data_type']):
num_of_w_comnds = 15
matrix = np.random.randint(10, size = (num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
for i in range(matrix[command_id].size):
yield (command_id, (matrix[command_id][i], i == matrix[command_id].size))
stimul = ram_model_drv(addr_bus_width = 8, data_type = Fixp[8,8])
out = ram_model(stimul[0], stimul[1])
sim()
Here is the output message:
python ram_model.py
- [INFO]: Running sim with seed: 3934280405122873233
0 [INFO]: -------------- Simulation start --------------
0 [INFO]: ----------- Simulation done ---------------
0 [INFO]: Elapsed: 0.00

Yeah, this one is a bit convoluted. Gist of the issue is that in the ram_model_drv module you are synchronously outputting data on both of its output interfaces with they yield statement. For PyGears this means that you would like data on both of these interfaces acknowledged before continuing. The ram_write_op module is connected to both of these interfaces via write_addr and write_data agruments. Inside that module, you acknowledge data from write_addr interface only after you've read multiple data from write_data interface, hence there's a deadlock and PyGears simulator detects that no further simulation steps are possible and exits at the end of the step 0.
There are also two additional issues in the driver:
It will never generate an eot for the output data Queue. Instead eot should be generated when i == matrix[command_id].size - 1.
The async modules are run in an endless loop by PyGears, so your ram_model_drv will generate the data endlessly unless you explicitly generate a GearDone exception.
OK, back to the main issue. There are several possibilities to circumvent it:
Use decoupling
For this to work, you first need to split data output in two yield statements, one for the write_addr and the other for the write_data, since your ram_write_op will use only one address per several write data.
#gear
async def ram_model_drv(*, addr_bus_width, data_type) -> (Uint[8], Queue['data_type']):
num_of_w_comnds = 15
matrix = np.random.randint(10, size=(num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
yield (command_id, None)
for i in range(matrix[command_id].size):
yield (None, (matrix[command_id][i], i == matrix[command_id].size - 1))
raise GearDone
You can use either dreg or decouple modules to temporarily store output data from ram_model_drv before they are consumed by ram_write_op.
out = ram_model(stimul[0] | decouple, stimul[1] | decouple)
Split driver into two modules, one driving each of the two interfaces
Use low level synchronization API for interfaces
Underneath the yield mechanism, there is a lower level API for communicating via PyGears interfaces. Handles to output interfaces can be obtained via module().dout field. Data can be sent via interface without waiting for it to be acknowledged using put_nb() method. Later, in order to wait for the aknowledgment, ready() method can be awaited. Finally, put() method combines the two in one call, so it will both send the data and wait for the acknowledgement.
#gear
async def ram_model_drv(*,
addr_bus_width=b'asize',
data_type=b'dtype') -> (Uint[8], Queue['data_type']):
addr, data = module().dout
num_of_w_comnds = 15
matrix = np.random.randint(10, size=(num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
addr.put_nb(command_id)
for i in range(matrix[command_id].size):
await data.put((matrix[command_id][i], i == matrix[command_id].size - 1))
await addr.ready()
raise GearDone

Related

Shared Memory between Julia and Python seems very slow (60 micros for round trip)

I am generating strings in Julia to use in Python. I would like to use Shared Memory (InterProcessCommunication.jl and Multiprocessing in Python). Currently, Julia generates strings then sends them to Python, which then reads the first number (so determine string length) before converting the rest into an encoded string.
I thought that shared memory would be much faster, but my method of timing (see below) seems to give 60-65 micros to:
Send the string and string length
Detect change in python, read the message and convert to bytes.
Send back an indication for julia to detect.
I am using Ubuntu. Comparatively, using TCP sockets gives 200 micros (so only a 3x speedup).
GSTiming comes from here:
#How can I get millisecond and microsecond-resolution timestamps in Python?
Julia:
using InterProcessCommunication
using Random
function copy_string_to_shared_memory(Aptr::Ptr{UInt8}, p::Vector{UInt8})
for i = 1:length(p)
unsafe_store!(Aptr, p[i], i + 4)
end
end
function main()
A = SharedMemory("myid"; readonly=false)
Aptr = convert(Ptr{UInt8}, pointer(A))
Bptr = convert(Ptr{UInt32}, pointer(A))
u = []
for i = 1:105
# p = Vector{UInt8}(randstring(rand(50:511)))
p = Vector{UInt8}("Hello Stack Exchange" * randstring(rand(1:430)))
a = time_ns()
copy_string_to_shared_memory(Aptr, p)
unsafe_store!(Bptr, length(p), 1)
# Make sure we can write to it
while unsafe_load(Aptr, 1) != 1
nanosleep(10e-7)
end
b = time_ns()
println((b - a) / 1000) # How i get times
sleep(0.01)
end
end
main()
Python Code:
from multiprocessing import shared_memory
import array
import time
import GSTiming
def main():
shm_a = shared_memory.SharedMemory("myid", create=True, size=512)
shm_a.buf[0] = 1 # Modify single byte at a time
u = []
for i in range(105):
while shm_a.buf[0] == 1:
GSTiming.delayMicroseconds(1)
s = bytes(shm_a.buf[4:shm_a.buf[0]+4])
shm_a.buf[0] = 1
shm_a.close()
shm_a.unlink() # Call unlink only once to release the shared memory
main()

Reading large file with Python Multiprocessing

I am trying to read a large text file > 20Gb with python.
File contains positions of atoms for 400 frames and each frame is independent in terms of my computations in this code. In theory I can split the job to 400 tasks without any need of communication. Each frame has 1000000 lines so the file has 1000 000* 400 lines of text.
My initial approach is using multiprocessing with pool of workers:
def main():
""" main function
"""
filename=sys.argv[1]
nump = int(sys.argv[2])
f = open(filename)
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
cursor = 0
framelocs=[]
start = time.time()
print (mp.cpu_count())
chunks = []
while True:
initial = s.find(b'ITEM: TIMESTEP', cursor)
if initial == -1:
break
cursor = initial + 14
final = s.find(b'ITEM: TIMESTEP', cursor)
framelocs.append([initial,final])
#readchunk(s[initial:final])
chunks.append(s[initial:final])
if final == -1:
break
Here basically I am seeking file to find frame begins and ends with opening file with python mmap module to avoid reading everything into memory.
def readchunk(chunk):
start = time.time()
part = chunk.split(b'\n')
timestep= int(part[1])
print(timestep)
Now I would like to send chunks of file to pool of workers to process.
Read part should be more complex but those lines will be implemented later.
print('Seeking file took %8.6f'%(time.time()-start))
pool = mp.Pool(nump)
start = time.time()
results= pool.map(readchunk,chunks[0:16])
print('Reading file took %8.6f'%(time.time()-start))
If I run this with sending 8 chunks to 8 cores it would take 0.8 sc to read.
However
If I run this with sending 16 chunks to 16 cores it would take 1.7 sc. Seems like parallelization does not speed up. I am running this on Oak Ridge's Summit supercomputer if it is relevant, I am using this command:
jsrun -n1 -c16 -a1 python -u ~/Developer/DipoleAnalyzer/AtomMan/readlargefile.py DW_SET6_NVT.lammpstrj 16
This supposed to create 1 MPI task and assign 16 cores to 16 threads.
Am I missing here something?
Is there a better approach?
As others have said, there is some overhead when making processes so you could see a slowdown if testing with small samples.
Something like this might be neater. Make sure you understand what the generator function is doing.
import multiprocessing as mp
import sys
import mmap
def do_something_with_frame(frame):
print("processing a frame:")
return 100
def frame_supplier(filename):
"""A generator for frames"""
f = open(filename)
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
cursor = 0
while True:
initial = s.find(b'ITEM: TIMESTEP', cursor)
if initial == -1:
break
cursor = initial + 14
final = s.find(b'ITEM: TIMESTEP', cursor)
yield s[initial:final]
if final == -1:
break
def main():
"""Process a file of atom frames
Args:
filename: the file to process
processes: the size of the pool
"""
filename = sys.argv[1]
nump = int(sys.argv[2])
frames = frame_supplier(filename)
pool = mp.Pool(nump)
# play around with the chunksize
for result in pool.imap(do_something_with_frame, frames, chunksize=10):
print(result)
Disclaimer: this is a suggestion. There may be some syntax errors. I haven't tested it.
EDIT:
It sounds like your script is becoming I/O limited (i.e. limited by the rate at which you can read from disk). You should be able to verify this by setting the body of do_something_with_frame to pass. If the program is I/O bound, it will still take nearly as long.
I don't think MPI is going to make any difference here. I think that file-read speed is probably a limiting factor and I don't see how MPI will help.
It's worth doing some profiling at this point to find out which function calls are taking the longest.
It is also worth trying without mmap():
frame = []
with open(filename) as file:
for line in file:
if line.beginswith('ITEM: TIMESTEP'):
yield frame
else:
frame.append(line)

Implement realtime signal processing in Python - how to capture audio continuously?

I'm planning to implement a "DSP-like" signal processor in Python. It should capture small fragments of audio via ALSA, process them, then play them back via ALSA.
To get things started, I wrote the following (very simple) code.
import alsaaudio
inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NORMAL)
inp.setchannels(1)
inp.setrate(96000)
inp.setformat(alsaaudio.PCM_FORMAT_U32_LE)
inp.setperiodsize(1920)
outp = alsaaudio.PCM(alsaaudio.PCM_PLAYBACK, alsaaudio.PCM_NORMAL)
outp.setchannels(1)
outp.setrate(96000)
outp.setformat(alsaaudio.PCM_FORMAT_U32_LE)
outp.setperiodsize(1920)
while True:
l, data = inp.read()
# TODO: Perform some processing.
outp.write(data)
The problem is, that the audio "stutters" and is not gapless. I tried experimenting with the PCM mode, setting it to either PCM_ASYNC or PCM_NONBLOCK, but the problem remains. I think the problem is that samples "between" two subsequent calls to "inp.read()" are lost.
Is there a way to capture audio "continuously" in Python (preferably without the need for too "specific"/"non-standard" libraries)? I'd like the signal to always get captured "in the background" into some buffer, from which I can read some "momentary state", while audio is further being captured into the buffer even during the time, when I perform my read operations. How can I achieve this?
Even if I use a dedicated process/thread to capture the audio, this process/thread will always at least have to (1) read audio from the source, (2) then put it into some buffer (from which the "signal processing" process/thread then reads). These two operations will therefore still be sequential in time and thus samples will get lost. How do I avoid this?
Thanks a lot for your advice!
EDIT 2: Now I have it running.
import alsaaudio
from multiprocessing import Process, Queue
import numpy as np
import struct
"""
A class implementing buffered audio I/O.
"""
class Audio:
"""
Initialize the audio buffer.
"""
def __init__(self):
#self.__rate = 96000
self.__rate = 8000
self.__stride = 4
self.__pre_post = 4
self.__read_queue = Queue()
self.__write_queue = Queue()
"""
Reads audio from an ALSA audio device into the read queue.
Supposed to run in its own process.
"""
def __read(self):
inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NORMAL)
inp.setchannels(1)
inp.setrate(self.__rate)
inp.setformat(alsaaudio.PCM_FORMAT_U32_BE)
inp.setperiodsize(self.__rate / 50)
while True:
_, data = inp.read()
self.__read_queue.put(data)
"""
Writes audio to an ALSA audio device from the write queue.
Supposed to run in its own process.
"""
def __write(self):
outp = alsaaudio.PCM(alsaaudio.PCM_PLAYBACK, alsaaudio.PCM_NORMAL)
outp.setchannels(1)
outp.setrate(self.__rate)
outp.setformat(alsaaudio.PCM_FORMAT_U32_BE)
outp.setperiodsize(self.__rate / 50)
while True:
data = self.__write_queue.get()
outp.write(data)
"""
Pre-post data into the output buffer to avoid buffer underrun.
"""
def __pre_post_data(self):
zeros = np.zeros(self.__rate / 50, dtype = np.uint32)
for i in range(0, self.__pre_post):
self.__write_queue.put(zeros)
"""
Runs the read and write processes.
"""
def run(self):
self.__pre_post_data()
read_process = Process(target = self.__read)
write_process = Process(target = self.__write)
read_process.start()
write_process.start()
"""
Reads audio samples from the queue captured from the reading thread.
"""
def read(self):
return self.__read_queue.get()
"""
Writes audio samples to the queue to be played by the writing thread.
"""
def write(self, data):
self.__write_queue.put(data)
"""
Pseudonymize the audio samples from a binary string into an array of integers.
"""
def pseudonymize(self, s):
return struct.unpack(">" + ("I" * (len(s) / self.__stride)), s)
"""
Depseudonymize the audio samples from an array of integers into a binary string.
"""
def depseudonymize(self, a):
s = ""
for elem in a:
s += struct.pack(">I", elem)
return s
"""
Normalize the audio samples from an array of integers into an array of floats with unity level.
"""
def normalize(self, data, max_val):
data = np.array(data)
bias = int(0.5 * max_val)
fac = 1.0 / (0.5 * max_val)
data = fac * (data - bias)
return data
"""
Denormalize the data from an array of floats with unity level into an array of integers.
"""
def denormalize(self, data, max_val):
bias = int(0.5 * max_val)
fac = 0.5 * max_val
data = np.array(data)
data = (fac * data).astype(np.int64) + bias
return data
debug = True
audio = Audio()
audio.run()
while True:
data = audio.read()
pdata = audio.pseudonymize(data)
if debug:
print "[PRE-PSEUDONYMIZED] Min: " + str(np.min(pdata)) + ", Max: " + str(np.max(pdata))
ndata = audio.normalize(pdata, 0xffffffff)
if debug:
print "[PRE-NORMALIZED] Min: " + str(np.min(ndata)) + ", Max: " + str(np.max(ndata))
print "[PRE-NORMALIZED] Level: " + str(int(10.0 * np.log10(np.max(np.absolute(ndata)))))
#ndata += 0.01 # When I comment in this line, it wreaks complete havoc!
if debug:
print "[POST-NORMALIZED] Level: " + str(int(10.0 * np.log10(np.max(np.absolute(ndata)))))
print "[POST-NORMALIZED] Min: " + str(np.min(ndata)) + ", Max: " + str(np.max(ndata))
pdata = audio.denormalize(ndata, 0xffffffff)
if debug:
print "[POST-PSEUDONYMIZED] Min: " + str(np.min(pdata)) + ", Max: " + str(np.max(pdata))
print ""
data = audio.depseudonymize(pdata)
audio.write(data)
However, when I even perform the slightest modification to the audio data (e. g. comment that line in), I get a lot of noise and extreme distortion at the output. It seems like I don't handle the PCM data correctly. The strange thing is that the output of the "level meter", etc. all appears to make sense. However, the output is completely distorted (but continuous) when I offset it just slightly.
EDIT 3: I just found out that my algorithms (not included here) work when I apply them to wave files. So the problem really appears to actually boil down to the ALSA API.
EDIT 4: I finally found the problems. They were the following.
1st - ALSA quietly "fell back" to PCM_FORMAT_U8_LE upon requesting PCM_FORMAT_U32_LE, thus I interpreted the data incorrectly by assuming that each sample was 4 bytes wide. It works when I request PCM_FORMAT_S32_LE.
2nd - The ALSA output seems to expect period size in bytes, even though they explicitely state that it is expected in frames in the specification. So you have to set the period size four times as high for output if you use 32 bit sample depth.
3rd - Even in Python (where there is a "global interpreter lock"), processes are slow compared to Threads. You can get latency down a lot by changing to threads, since the I/O threads basically don't do anything that's computationally intensive.
When you
read one chunk of data,
write one chunk of data,
then wait for the second chunk of data to be read,
then the buffer of the output device will become empty if the second chunk is not shorter than the first chunk.
You should fill up the output device's buffer with silence before starting the actual processing. Then small delays in either the input or output processing will not matter.
You can do that all manually, as #CL recommend in his/her answer, but I'd recommend just using
GNU Radio instead:
It's a framework that takes care of doing all the "getting small chunks of samples in and out your algorithm"; it scales very well, and you can write your signal processing either in Python or C++.
In fact, it comes with an Audio Source and an Audio Sink that directly talk to ALSA and just give/take continuous samples. I'd recommend reading through GNU Radio's Guided Tutorials; they explain exactly what is necessary to do your signal processing for an audio application.
A really minimal flow graph would look like:
You can substitute the high pass filter for your own signal processing block, or use any combination of the existing blocks.
There's helpful things like file and wav file sinks and sources, filters, resamplers, amplifiers (ok, multipliers), …
I finally found the problems. They were the following.
1st - ALSA quietly "fell back" to PCM_FORMAT_U8_LE upon requesting PCM_FORMAT_U32_LE, thus I interpreted the data incorrectly by assuming that each sample was 4 bytes wide. It works when I request PCM_FORMAT_S32_LE.
2nd - The ALSA output seems to expect period size in bytes, even though they explicitely state that it is expected in frames in the specification. So you have to set the period size four times as high for output if you use 32 bit sample depth.
3rd - Even in Python (where there is a "global interpreter lock"), processes are slow compared to Threads. You can get latency down a lot by changing to threads, since the I/O threads basically don't do anything that's computationally intensive.
Audio is gapless and undistorted now, but latency is far too high.

Async ping with multiprocessing.pool

I am trying to ping a few hundred devices in Python using multiprocessing.pool on Windows as part of a larger program. The responses are parsed into success cases and failure cases (i.e request times out, or the host is unreachable).
The code below works fine, however, another part of the program takes the average from the response and calculates a rolling average with previous results fetched from a database.
The rolling average function fails very occasionally on int(new_average) because the new_average passed in is a None type. Note that the rolling average function is only calculated in success cases.
I think the error must be in the parse function (this seems unlikely), or with how I am using multiprocessing.pool.
My question: am I using multiprocessing correctly? More generally, is there a better way to implement this asynchronous ping? I have looked at using Twisted, but I didn't see any ICMP protocol implementation (there is txnettools on GitHub, but I am not sure of the correctness of this, and it doesn't look maintained anymore).
A node object looks like:
class Node(object):
def __init__(self, ip):
self.ip = ip
self.ping_result = None
# other attributes and methods...
Here is the idea of the async ping code:
import os
from multiprocessing.pool import ThreadPool
def parse_ping_response(response):
'''Parses a response into a list of the format:
[ip_address, packets_lost, average, success_or_failure]
Ex: ['10.10.10.10', '0', '90', 'success']
Ex: [None, 5, None, 'failure']
'''
reply = re.compile("Reply\sfrom\s(.*?)\:")
lost = re.compile("Lost\s=\s(\d*)\s")
average = re.compile("Average\s=\s(\d+)m")
results = [x.findall(response) for x in [reply, lost, average]]
# Get reply, if it was found. Set [] to None.
results = [x[0] if len(x) > 0 else None for x in results]
# Check for host unreachable error.
# If we cannot find an ip address, the request timed out.
if results[0] is None:
return results + ['failure']
elif 'Destination host unreachable' in response:
return results + ['failure']
else:
return results + ['success']
def ping_ip(node):
ping = os.popen("ping -n 5 "+node.ip,"r")
node.ping_result = parse_ping_response(ping.read())
return
def run_ping_tests(nodelist):
pool = ThreadPool(processes=100)
pool.map(ping_ip, nodelist)
return
if __name__ == "__main__":
# nodelist is a list of node objects
run_ping_tests(nodelist)
An example ping response for reference (from the Microsoft docs):
Pinging 131.107.8.1 with 1450 bytes of data:
Reply from 131.107.8.1: bytes=1450 time<10ms TTL=32
Reply from 131.107.8.1: bytes=1450 time<10ms TTL=32
Ping statistics for 131.107.8.1:
Packets: Sent = 2, Received = 2, Lost = 0 (0% loss),
Approximate roundtrip times in milliseconds:
Minimum = 0ms, Maximum = 10ms, Average = 2ms
I recommend you to use gevent (http://www.gevent.org, asynchronous I/O library based on libev and greenlet coroutines) instead of multiprocessing.
It turns out there an implementation of ICMP for gevent:
https://github.com/mastahyeti/gping

Shared-memory objects in multiprocessing

Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:
def func(arr, param):
# do stuff to arr, param
# build array arr
pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]
If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.
Is there a way to let different processes share the same array? This array object is read-only and will never be modified.
What's more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?
[EDITED]
I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:
from multiprocessing import Pool, Manager
import numpy as np;
import time
def f(arr):
return len(arr)
t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;
pool = Pool(processes = 6)
t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;
output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):
construct array = 0.0178790092468
multiprocessing overhead = 0.252444982529
Why is there such huge overhead, if we didn't copy the array? And what part does the shared memory save me?
If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don't alter the object).
The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.
If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).
The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.
There are a wealth of parallel processing libraries and approaches available in Python. multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better.
I run into the same problem and wrote a little shared-memory utility class to work around it.
I'm using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.
With the solution I get speedups by a factor of approx 3 on a quad-core i7.
Here's the code:
Feel free to use and improve it, and please report back any bugs.
'''
Created on 14.05.2013
#author: martin
'''
import multiprocessing
import ctypes
import numpy as np
class SharedNumpyMemManagerError(Exception):
pass
'''
Singleton Pattern
'''
class SharedNumpyMemManager:
_initSize = 1024
_instance = None
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super(SharedNumpyMemManager, cls).__new__(
cls, *args, **kwargs)
return cls._instance
def __init__(self):
self.lock = multiprocessing.Lock()
self.cur = 0
self.cnt = 0
self.shared_arrays = [None] * SharedNumpyMemManager._initSize
def __createArray(self, dimensions, ctype=ctypes.c_double):
self.lock.acquire()
# double size if necessary
if (self.cnt >= len(self.shared_arrays)):
self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)
# next handle
self.__getNextFreeHdl()
# create array in shared memory segment
shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))
# convert to numpy array vie ctypeslib
self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)
# do a reshape for correct dimensions
# Returns a masked array containing the same data, but with a new shape.
# The result is a view on the original array
self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)
# update cnt
self.cnt += 1
self.lock.release()
# return handle to the shared memory numpy array
return self.cur
def __getNextFreeHdl(self):
orgCur = self.cur
while self.shared_arrays[self.cur] is not None:
self.cur = (self.cur + 1) % len(self.shared_arrays)
if orgCur == self.cur:
raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')
def __freeArray(self, hdl):
self.lock.acquire()
# set reference to None
if self.shared_arrays[hdl] is not None: # consider multiple calls to free
self.shared_arrays[hdl] = None
self.cnt -= 1
self.lock.release()
def __getArray(self, i):
return self.shared_arrays[i]
#staticmethod
def getInstance():
if not SharedNumpyMemManager._instance:
SharedNumpyMemManager._instance = SharedNumpyMemManager()
return SharedNumpyMemManager._instance
#staticmethod
def createArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)
#staticmethod
def getArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)
#staticmethod
def freeArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)
# Init Singleton on module load
SharedNumpyMemManager.getInstance()
if __name__ == '__main__':
import timeit
N_PROC = 8
INNER_LOOP = 10000
N = 1000
def propagate(t):
i, shm_hdl, evidence = t
a = SharedNumpyMemManager.getArray(shm_hdl)
for j in range(INNER_LOOP):
a[i] = i
class Parallel_Dummy_PF:
def __init__(self, N):
self.N = N
self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)
self.pool = multiprocessing.Pool(processes=N_PROC)
def update_par(self, evidence):
self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))
def update_seq(self, evidence):
for i in range(self.N):
propagate((i, self.arrayHdl, evidence))
def getArray(self):
return SharedNumpyMemManager.getArray(self.arrayHdl)
def parallelExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_par(5)
print(pf.getArray())
def sequentialExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_seq(5)
print(pf.getArray())
t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")
print("Sequential: ", t1.timeit(number=1))
print("Parallel: ", t2.timeit(number=1))
This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.
The code would look like the following.
import numpy as np
import ray
ray.init()
#ray.remote
def func(array, param):
# Do stuff.
return 1
array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)
result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)
If you don't call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.
Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.
You can compare the performance of serialization in Ray versus pickle by running the following in IPython.
import numpy as np
import pickle
import ray
ray.init()
x = {i: np.ones(10**7) for i in range(20)}
# Time Ray.
%time x_id = ray.put(x) # 2.4s
%time new_x = ray.get(x_id) # 0.00073s
# Time pickle.
%time serialized = pickle.dumps(x) # 2.6s
%time deserialized = pickle.loads(serialized) # 1.9s
Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).
See the Ray documentation. You can read more about fast serialization using Ray and Arrow. Note I'm one of the Ray developers.
Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.
I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app. It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle'd bytestrings generated by pickle.dumps(...).
The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables' values by calling the name from any Brain object.
$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma
from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/')
brain['a'] = [1]*10000
brain['a']
# >>> [1,1,1,1,...]

Categories