Here's a timed example of multiple image arrays of different sizes being saved in a loop as well as concurrently using threads / processes:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
t1 = perf_counter()
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
with executor(workers) as ex:
futures = [
ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
And I get these durations on my i5 mbp:
Time for 100: 0.09495482999999982 seconds
Time for 100 (ThreadPoolExecutor): 0.14151873999999998 seconds
Time for 100 (ProcessPoolExecutor): 1.5136184309999998 seconds
Time for 1000: 0.36972280300000016 seconds
Time for 1000 (ThreadPoolExecutor): 0.619205703 seconds
Time for 1000 (ProcessPoolExecutor): 2.016624468 seconds
Time for 10000: 4.232915643999999 seconds
Time for 10000 (ThreadPoolExecutor): 7.251599262 seconds
Time for 10000 (ProcessPoolExecutor): 13.963426469999998 seconds
Aren't threads / processes expected to require less time to achieve the same thing? and why not in this case?
The timings in the code are wrong because the timer t is not reset before testing the Pools. Nevertheless, the relative order of the timings are correct. A possible code with a timer reset is:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
t = perf_counter()
with executor(workers) as ex:
futures = [
ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
Multithreading is faster specially for I/O bound processes. In this case, compressing the images is cpu-intensive, so depending on the implementation of OpenCV and of the python wrapper, multithreading can be much slower. In many cases the culprit is CPython's GIL, but I am not sure if this is the case (I do not know if the GIL is released during the imwrite call). In my setup (i7 8th gen), Threading is as fast as the loop for 100 images and barely faster for 1000 and 10000 images. If ThreadPoolExecutor reuses threads, there is an overhead involved in assigning a new task to an existing thread. If it does not reuses threads, there is an overhead involved in launching a new thread.
Multiprocessing circumvents the GIL issue, but has some other problems. First, pickling the data to pass between processes takes some time, and in the case of images it can be very expensive. Second, in the case of windows, spawning a new process takes a lot of time. A simple test to see the overhead (both for processes and threads) is to change the save_image function by one that does nothing, but still need pickling, etc:
def save_img(idx, image, dst):
if idx != idx:
print("impossible!")
and by a similar one without parameters to see the overhead of spawning the processes, etc.
The timings in my setup show that 2.3 seconds are needed just to spawn the 10000 processes and 0.6 extra seconds for pickling, which is much more than the time needed for processing.
A way to improve the throughput and keep the overhead to a minimum is to break the work on chunks, and submit each chunk to the worker:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
def multi_save_img(idx_start, images, dst):
for idx, image in zip(range(idx_start, idx_start + len(images)), images):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
chunk_size = len(ll)//workers
ends = [chunk_size * (_+1) for _ in range(workers)]
ends[-1] += len(ll) % workers
starts = [chunk_size * _ for _ in range(workers)]
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
t = perf_counter()
with executor(workers) as ex:
futures = [
ex.submit(multi_save_img, start, ll[start:end], temp_dir) for (start, end) in zip(starts, ends)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
This should give you a significant boost over a simple for, both for a multiprocessing and multithreading approach.
Related
Whenever I try to use shared memory with pythons 'multiprocessing' module to fill a huge array in parallel I use something like:
import numpy as np
from multiprocessing import Process, RawArray
def tf(x, arr):
arr = np.reshape( np.frombuffer( arr, dtype=np.float32 ), -1 ).reshape((10, 10, 10))
arr[x] = np.random.random((10, 10))
mpa = RawArray('f', 1000)
ncpu = 4
procs = []
for i in range(10):
procs.append(Process(target=tf, args=(i, mpa)))
procs[-1].start()
if len(procs) == ncpu:
procs[0].join()
procs.pop(0)
for p in procs:
p.join()
arr = np.reshape( np.frombuffer( mpa, dtype=np.uint32 ), -1).reshape((10, 10, 10))
to ensure that only as many processes are active as I have cpus. If I try to use 'Pool' and 'apply_async' the array is not altered for some reason. So I wonder if it is possible to either use 'Pool' or any other intended way to manage the amount of active processes.
The above code is working but is not the most efficient since I only if the process I added first is finished to decide if I should add another process.
I'm refactoring stitching_detailed.py of OpenCVs Stitcher API. The User of this Program can choose the amount of Images which are going to be stitched together. I have historical high resolution scans (TIF) with each of up to 300 MP to be stitched, so memory management is important since I might run into an cv::OutOfMemoryError. Additionally the program should be as effective as possible in regard to computation time and memory usage.
I need to work with different image sizes in different parts of the Workflow. I need to work with the medium resolution first to make some estimations, then the low resolutions for other estimations and finally the full images. However, I have performance issues with my actual approach. I measure the performance with:
import time
from functools import wraps
import tracemalloc
import cv2
import numpy as np
def performance_test(func):
#wraps(func)
def wrapper(*args, **kwargs):
tracemalloc.start()
start = time.time()
result = func(*args, **kwargs)
current, peak = tracemalloc.get_traced_memory()
print(f"Peak was {peak / 10**6} MB")
end = time.time()
print(f"Time was {end - start} s")
tracemalloc.stop()
return result
return wrapper
Create some dummy images:
img_names = ["dummy1.png", "dummy2.png", "dummy3.png"]
image = np.zeros((10000, 10000, 3), np.uint8)
image[:] = (0, 0, 255)
for img_name in img_names:
cv2.imwrite(img_name, image)
The most memory friendly approach is to load the images where I need them:
def process_medium_resolution_images(img_names):
for big_image in img_names:
high = cv2.imread(big_image)
medium = cv2.resize(high, (5000, 5000))
print(medium.shape)
def process_low_resolution_images(img_names):
for big_image in img_names:
high = cv2.imread(big_image)
low = cv2.resize(high, (1000, 1000))
print(low.shape)
def process_high_resolution_images(img_names):
for big_image in img_names:
high = cv2.imread(big_image)
print(high.shape)
#performance_test
def main(img_names):
process_medium_resolution_images(img_names)
process_low_resolution_images(img_names)
process_high_resolution_images(img_names)
main(img_names)
# Peak was 675.149574 MB
# Time was 8.497036933898926 s
I can also read the files in one place and store them in lists. This improves the computation time, but comes with the downside of more memory used, which can lead to problems in my case.
def img_reading_and_processing(img_names):
high_imgs, medium_imgs, low_imgs = [], [], []
for big_image in img_names:
high = cv2.imread(big_image)
high_imgs.append(high)
medium = cv2.resize(high, (5000, 5000))
medium_imgs.append(medium)
low = cv2.resize(high, (1000, 1000))
low_imgs.append(low)
return high_imgs, medium_imgs, low_imgs
def process_imgs(imgs):
for img in imgs:
print(img.shape)
#performance_test
def main2(img_names):
high_imgs, medium_imgs, low_imgs = img_reading_and_processing(img_names)
process_imgs(medium_imgs)
process_imgs(low_imgs)
process_imgs(high_imgs)
main2(img_names)
# Peak was 1134.155195 MB
# Time was 2.8256263732910156 s
I'm looking for a way to combine the advantages of both solutions.
EDIT: Based on #balmiy's comment I tried to write the intermediate results on disk and reload them later:
def img_reading_and_processing(img_names):
for idx, big_image in enumerate(img_names):
high = cv2.imread(big_image)
medium = cv2.resize(high, (5000, 5000))
low = cv2.resize(high, (1000, 1000))
cv2.imwrite(f"medium{idx}.png", medium)
cv2.imwrite(f"low{idx}.png", low)
def process_medium_resolution_images2(img_names):
for idx, img in enumerate(img_names):
medium = cv2.imread(f"medium{idx}.png")
print(medium.shape)
def process_low_resolution_images2(img_names):
for idx, img in enumerate(img_names):
low = cv2.imread(f"low{idx}.png")
print(low.shape)
#performance_test
def main3(img_names):
process_medium_resolution_images2(img_names)
process_low_resolution_images2(img_names)
process_high_resolution_images(img_names)
main3(img_names)
# Peak was 678.149341 MB
# Time was 6.7737205028533936 s
We see that computation time reduced a bit while the Memory Usage remained small. However not close to how fast it is to keep it in lists
So using the multiprocess module it is easy to run a function in parallel with different arguments like this:
from multiprocessing import Pool
def f(x):
return x**2
p = Pool(2)
print(p.map(f, [1, 2]))
But I'm interested in executing a list of functions on the same argument. Suppose I have the following two functions:
def f(x):
return x**2
def g(x):
return x**3 + 2
How can I execute them in parallel for the same argument (e.g. x=1)?
You can use Pool.apply_async() for that. You bundle up tasks in the form of (function, argument_tuple) and feed every task to apply_async().
from multiprocessing import Pool
from itertools import repeat
def f(x):
for _ in range(int(50e6)): # dummy computation
pass
return x ** 2
def g(x):
for _ in range(int(50e6)): # dummy computation
pass
return x ** 3
def parallelize(n_workers, functions, arguments):
# if you need this multiple times, instantiate the pool outside and
# pass it in as dependency to spare recreation all over again
with Pool(n_workers) as pool:
tasks = zip(functions, repeat(arguments))
futures = [pool.apply_async(*t) for t in tasks]
results = [fut.get() for fut in futures]
return results
if __name__ == '__main__':
N_WORKERS = 2
functions = f, g
results = parallelize(N_WORKERS, functions, arguments=(10,))
print(results)
Example Output:
[100, 1000]
Process finished with exit code 0
You can get a tuple returned. This could be done quite easily and in a very compact way using the lightweight module: joblib. I recommend joblib because it is lightweight
from joblib import Parallel, delayed
import multiprocessing
import timeit
# Implementation 1
def f(x):
return x**2, x**3 + 2
#Implementation 2 for a more sophisticated second or more functions
def g(x):
return x**3 + 2
def f(x):
return x**2, g(x)
if __name__ == "__main__":
inputs = [i for i in range(32)]
num_cores = multiprocessing.cpu_count()
t1 = timeit.Timer()
result = Parallel(n_jobs=num_cores)(delayed(f)(i) for i in inputs)
print(t1.timeit(1))
Using multiprocessing.Pool as you already have in the question
from multiprocessing import Pool, cpu_count
import timeit
def g(x):
return x**3 + 2
def f(x):
return x**2, g(x)
if __name__ == "__main__":
inputs = [i for i in range(32)]
num_cores = cpu_count()
p = Pool(num_cores)
t1 = timeit.Timer()
result = p.map(f, inputs)
print(t1.timeit(1))
print(result)
Example Output:
print(result)
[(0, 2), (1, 3), (4, 10), (9, 29), (16, 66), (25, 127), (36, 218), (49, 345),
(64, 514), (81, 731), (100, 1002), (121, 1333), (144, 1730), (169, 2199),
(196, 2746), (225, 3377), (256, 4098), (289, 4915), (324, 5834), (361, 6861),
(400, 8002), (441, 9263), (484, 10650), (529, 12169), (576, 13826), (625,
15627), (676, 17578), (729, 19685), (784, 21954), (841, 24391), (900, 27002),
(961, 29793)]
print(t1.timeit(1))
5.000001692678779e-07 #(with 16 cpus and 64 Gb RAM)
for: inputs = range(2000), it took the time:
1.100000190490391e-06
I am working on video editor for raspberry pi, and I have a problem with speed of placing image over image. Currently, using imagemagick it takes up to 10 seconds just to place one image over another, using 1080x1920 png images, on raspberry pi, and that's way too much. With the number of images time goes up as well. Any ideas on how to speed it up?
Imagemagick code:
composite -blend 90 img1.png img2.png new.png
Video editor with yet slow opacity support here
--------EDIT--------
slightly faster way:
import numpy as np
from PIL import Image
size_X, size_Y = 1920, 1080# put images resolution, else output may look wierd
image1 = np.resize(np.asarray(Image.open('img1.png').convert('RGB')), (size_X, size_Y, 3))
image2 = np.resize(np.asarray(Image.open('img2.png').convert('RGB')), (size_X, size_Y, 3))
output = image1*transparency+image2*(1-transparency)
Image.fromarray(np.uint8(output)).save('output.png')
My Raspberry Pi is unavailable at the moment - all I am saying is that there was some smoke involved and I do software, not hardware! As a result, I have only tested this on a Mac. It uses Numba.
First I used your Numpy code on these 2 images:
and
Then I implemented the same thing using Numba. The Numba version runs 5.5x faster on my iMac. As the Raspberry Pi has 4 cores, you could try experimenting with:
#jit(nopython=True,parallel=True)
def method2(image1,image2,transparency):
...
Here is the code:
#!/usr/bin/env python3
import numpy as np
from PIL import Image
import numba
from numba import jit
def method1(image1,image2,transparency):
result = image1*transparency+image2*(1-transparency)
return result
#jit(nopython=True)
def method2(image1,image2,transparency):
h, w, c = image1.shape
for y in range(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency + (image2[y][x][z]*(1-transparency))
return image1
i1 = np.array(Image.open('image1.jpg').convert('RGB'))
i2 = np.array(Image.open('image2.jpg').convert('RGB'))
res = method1(i1,i2,0.4)
res = method2(i1,i2,0.4)
Image.fromarray(np.uint8(res)).save('result.png')
The result is:
Other thoughts... I did the composite in-place, overwriting the input image1 to try and save cache space. That may help or hinder - please experiment. I may not have processed the pixels in the optimal order - please experiment.
Just as another option, I tried in pyvips (full disclosure: I'm the pyvips maintainer, so I'm not very neutral):
#!/usr/bin/python3
import sys
import time
import pyvips
start = time.time()
a = pyvips.Image.new_from_file(sys.argv[1], access="sequential")
b = pyvips.Image.new_from_file(sys.argv[2], access="sequential")
out = a * 0.2 + b * 0.8
out.write_to_file(sys.argv[3])
print("pyvips took {} milliseconds".format(1000 * (time.time() - start)))
pyvips is a "pipeline" image processing library, so that code will execute the load, processing and save all in parallel.
On this two core, four thread i5 laptop using Mark's two test images I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
took 39.156198501586914 milliseconds
So 39ms for two jpg loads, processing and one jpg save.
You can time just the blend part by copying the source images and the result to memory, like this:
a = pyvips.Image.new_from_file(sys.argv[1]).copy_memory()
b = pyvips.Image.new_from_file(sys.argv[2]).copy_memory()
start = time.time()
out = (a * 0.2 + b * 0.8).copy_memory()
print("pyvips between memory buffers took {} milliseconds"
.format(1000 * (time.time() - start)))
I see:
$ ./overlay-vips.py blobs.jpg ships.jpg x.jpg
pyvips between memory buffers took 15.432596206665039 milliseconds
numpy is about 60ms on this same test.
I tried a slight variant of Mark's nice numba example:
#!/usr/bin/python3
import sys
import time
import numpy as np
from PIL import Image
import numba
from numba import jit, prange
#jit(nopython=True, parallel=True)
def method2(image1, image2, transparency):
h, w, c = image1.shape
for y in prange(h):
for x in range(w):
for z in range(c):
image1[y][x][z] = image1[y][x][z] * transparency \
+ (image2[y][x][z] * (1 - transparency))
return image1
# run once to force a compile
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
res = method2(i1, i2, 0.2)
# run again and time it
i1 = np.array(Image.open(sys.argv[1]).convert('RGB'))
i2 = np.array(Image.open(sys.argv[2]).convert('RGB'))
start = time.time()
res = method2(i1, i2, 0.2)
print("numba took {} milliseconds".format(1000 * (time.time() - start)))
Image.fromarray(np.uint8(res)).save(sys.argv[3])
And I see:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba took 8.110523223876953 milliseconds
So on this laptop, numba is about 2x faster than pyvips.
If you time load and save as well, it's quite a bit slower:
$ ./overlay-numba.py blobs.jpg ships.jpg x.jpg
numba plus load and save took 272.8157043457031 milliseconds
But that seems unfair, since almost all that time is in PIL load and save.
I am using a permutation test (pulling random sub-samples) to test the difference between 2 experiments. Each experiment was carried out 100 times (=100 replicas of each). Each replica consists of 801 measurement points over time. Now I would like to perform a kind of permutation (or boot strapping) in order to test how many replicas per experiment (and how many (time) measurement points) I need to obtain a certain reliability level.
For this purpose I have written a code from which I have extracted the minimal working example (with lots of things hard-coded) (please see below). The input data is generated as random numbers. Here np.random.rand(100, 801) for 100 replicas and 801 time points.
This code works in principle however the produced curves are sometimes not smoothly falling as one would expect if choosing random sub-samples for 5000 times. Here is the output of the code below:
It can be seen that at 2 of the x-axis there is a peak up which should not be there. If I change the random seed from 52389 to 324235 it is gone and the curve is smooth. It seems there is something wrong with the way the random numbers are chosen?
Why is this the case? I have the semantically similar code in Matlab and there the curves are completely smooth at already 1000 permutations (here 5000).
Do I have a coding mistake or is the numpy random number generator not good?
Does anyone see the problem here?
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import current_process, cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel (queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put(allResults)
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas, nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel, args=(queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter))
p.start()
jobs.append(p)
print('Process {} started work on time \"{} ns\"'.format(timesOfInterestFramesIterIdx, timesOfInterestNs[timesOfInterestFramesIterIdx]), end='\n', flush=True)
# collect the results
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
oneResult = queue.get()
allResults[timesOfInterestFramesIterIdx, :, :] = oneResult
print('Process number {} returned the results.'.format(timesOfInterestFramesIterIdx), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join back the resources in the main thread
for proc in jobs:
proc.join()
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5, label="nReps="+str(lineIdx+1))
ctr+=1
ax[axId].set_xticks(range(nrOfTimesOfInterest)) # careful: this is not the same as plt.xticks!!
ax[axId].set_xticklabels(timesOfInterestNs)
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1]) # increase x max by 2
plt.show()
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
------------- UPDATE ---------------
Changing the random number generator from numpy to "from random import randint" does not fix the problem:
from:
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
to:
d1Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
d2Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
--- UPDATE 2 ---
timesOfInterestNs can just be set to:
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50]
to speed it up on machines with fewer cores.
--- UPDATE 3 ---
Re-initialising the random seed generator in each child process (Random seed is replication across child processes) does also not fix the problem:
pid = str(current_process())
pid = int(re.split("(\W)", pid)[6])
ms = int(round(time.time() * 1000))
mySeed = np.mod(ms, 4294967295)
mySeed = mySeed + 25000 * pid + 100 * pid + pid
mySeed = np.mod(mySeed, 4294967295)
np.random.seed(seed=mySeed)
--- UPDATE 4 ---
On a windows machine you will need a:
if __name__ == '__main__':
to avoid creating subprocesses recursively (and a crash).
I guess this is the classical multiprocessing mistake. Nothing guarantees that the processes will finish in the same order as the one they started. This means that you cannot be sure that the instruction allResults[timesOfInterestFramesIterIdx, :, :] = oneResult will store the result of process timesOfInterestFramesIterIdx at the location timesOfInterestFramesIterIdx in allResults. To make it clearer, let's say timesOfInterestFramesIterIdx is 2, then you have absolutely no guarantee that oneResult is the output of process 2.
I have implemented a very quick fix below. The idea is to track the order in which the processes have been launched by adding an extra argument to groupDiffsInParallel which is then stored in the queue and thereby serves as a process identifier when the results are gathered.
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel(queue, d1, d2, nrOfReplicas, nrOfPermuts,
timesOfInterestFramesIter,
timesOfInterestFramesIterIdx):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put({'allResults': allResults,
'number': timesOfInterestFramesIterIdx})
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70,
80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas,
nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel,
args=(queue, d1, d2, nrOfReplicas, nrOfPermuts,
timesOfInterestFramesIter,
timesOfInterestFramesIterIdx))
p.start()
jobs.append(p)
print('Process {} started work on time \"{} ns\"'.format(
timesOfInterestFramesIterIdx,
timesOfInterestNs[timesOfInterestFramesIterIdx]),
end='\n', flush=True)
# collect the results
resultdict = {}
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
resultdict.update(queue.get())
allResults[resultdict['number'], :, :] = resultdict['allResults']
print('Process number {} returned the results.'.format(
resultdict['number']), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join
# back the resources in the main thread
for proc in jobs:
proc.join()
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40,
50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5,
label="nReps="+str(lineIdx+1))
ctr += 1
ax[axId].set_xticks(range(nrOfTimesOfInterest))
# careful: this is not the same as plt.xticks!!
ax[axId].set_xticklabels(timesOfInterestNs)
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot "
+ "strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1])
# increase x max by 2
plt.show()
# #### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
This is the output I get, which obviously shows that the order in which the processes return is shuffled compared to the starting order.
20 cores available
Starting ...
Process 0 started work on time "0.25 ns"
Process 1 started work on time "0.5 ns"
Process 2 started work on time "1 ns"
Process 3 started work on time "2 ns"
Process 4 started work on time "3 ns"
Process 5 started work on time "4 ns"
Process 6 started work on time "5 ns"
Process 7 started work on time "10 ns"
Process 8 started work on time "20 ns"
Process 9 started work on time "30 ns"
Process 10 started work on time "40 ns"
Process 11 started work on time "50 ns"
Process 12 started work on time "60 ns"
Process 13 started work on time "70 ns"
Process 14 started work on time "80 ns"
Process 15 started work on time "90 ns"
Process 16 started work on time "100 ns"
Process number 3 returned the results.
Process number 0 returned the results.
Process number 4 returned the results.
Process number 7 returned the results.
Process number 1 returned the results.
Process number 2 returned the results.
Process number 5 returned the results.
Process number 8 returned the results.
Process number 6 returned the results.
Process number 9 returned the results.
Process number 10 returned the results.
Process number 11 returned the results.
Process number 12 returned the results.
Process number 13 returned the results.
Process number 14 returned the results.
Process number 15 returned the results.
Process number 16 returned the results.
All parallel done.
And the figure which is produced.
not sure if you're still hung up on this issue, but I just ran your code on my machine (MacBook Pro (15-inch, 2018)) in Jupyter 4.4.0 and my graphs are smooth with the exact same seed values you originally posted:
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
Perhaps there's nothing wrong with your code and nothing special about the 324235 seed and you just need to double check your module versions since any changes to the source code that have been made in more recent versions could affect your results. For reference I'm using numpy 1.15.4, matplotlib 3.0.2 and multiprocessing 2.6.2.1.