If statement over dask array - python

Hi everybody can you tell me why a If statement over dask array is so slow and how to solve it ?
import dask.array as da
import time
x = da.random.binomial(1, 0.5, 200, 200)
s = time.time()
if da.any(x):
e = time.time()
print('duration = ', e-s)
output: duration = 0.368

Dask array is lazy by default, so no work happens until you call .compute() on your array.
In your case you are implicitly calling .compute() when you place your dask array into an if statement, which converts things into booleans.
x = da.random.random(...) # this is free
y = x + x.T # this is free
z = y.any() # this is free
if z: # everything above happens now,

I took at look at the dask source code. Essentially, when you call functions on dask arrays it performs a "reduction" of the array. Intuitively this is necessary because, behind the scenes, dask arrays are stored as separate "blocks" that can live individually in memory, on disk, etc. but you need to somehow pull pieces of them together for function calls.
So the time you are noticing is in the initial overhead of performing the reduction. Note that if you increase the size of the array to 2M, it takes about the same time as for 200. At 20M it only takes about 1s.
import dask.array as da
import time
# 200 case
x = da.random.binomial(1, 0.5, 200, 200)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 0.362557172775
# 2M case
x = da.random.binomial(1, 0.5, 2000000, 2000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 0.132781982422
# 20M case
x = da.random.binomial(1, 0.5, 20000000, 20000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 1.08430886269
# 200M case
x = da.random.binomial(1, 0.5, 200000000, 200000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 8.83682179451


Parallelizing with itertools and numba

I've been working on a project for a while now that requires calculating some very large datasets, and very quickly have moved beyond anything that my meager Excel knowledge could handle. In the last few days I've started learning Python, which has helped with handling the size of data I'm dealing with, but the estimated processing time for these datasets is looking to be incredibly long (possibly a couple hundred years on my laptop).
The bottleneck here is an equation that could produce trillions or quadrillions of results, since it is calculating every combination of 6 different lists and running it through an equation that you'll see in the code. The code works just fine, as is, but is isn't feasible for larger datasets than the example I included. A real dataset would be something more like Set1S, 2S, and 3S being 50 items each, and Sets12A...being about 2500 items each (50x50 in this case. These sets always have a length equal to the square of the first 3 lists, but I'm keeping things short and simple here.).
I'm well aware that the amount of results is absolutely huge, but want to start with as large a dataset as I can, so I can see how much I can reduce the input sizes without greatly impacting the results when I plot a cumulative% histogram.
import numpy as np
Set1S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set2S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set3S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set12A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set23A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set13A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
'Define an empty array to add results'
BlockVol = []
from itertools import product
'itertools iterates through all combinations of lists'
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
'This is the bottleneck equation, with large input datasets'
arr = np.array(BlockVol)
'manipulate the result list a couple ways'
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
'quick check to size of results list'
This took me about 3 minutes or so for 11.3M results, just from eyeballing the clock.
I've learned about #njit, prange in the last day or so, but am a bit stuck in trying to translate my work into this format. I do have a desktop PC with a pretty good GPU, so I think I could speed things up by a lot. I'm well aware that the code below is a big garbage fire that doesn't do anything, but I'm hoping that I'm at least getting the point across on what I'm trying to do.
It seems that the way to go is to define a function with my 6 input lists, but i'm just not sure how to fuse the itertools product and the njit together.
import numpy as np
from itertools import product
from numba import njit, prange
#njit(parallel = True)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
numRows =Len(Set12A)
BlockVol = np.zeros(numRows)
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
arr = np.array(BlockVol)
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
Any help is much appreciated, as this is all very new and overwhelming.
Thank you!
I solved your task just by NumPy code, it is always nicer to use just NumPy instead of heavy Numba if possible. Next NumPy-only code will be as fast as same solution using Numba.
My code is 2800 times faster than your reference code, time is measured at the end of code.
In next code BlockValCalcRef(...) function is just your reference code organized as function. And BlockVolCalc(...) is my NumPy based function that should give a lot of speedup. At the end I do assert np.allclose(...) in order to check that both solutions give same results.
Also I simplified a bit sets creation to use one N param to generate sets, in your real world you just provide necessary sets.
In order to solve task I did several things:
Instead of computing np.sin(...) many times for same values I precomputed them just once for Set12A, Set23A, Set13A. Also precomputed np.abs(...) for all sets.
In order to compute cross-product I used special way of numpy arrays indexing like [None, None, :, None, None, None] this allows us to use so-called popular numpy arrays broadcasting.
I have also idea how to improve code even more, to make it around 6 times even faster, but I think even with current huge speed you'll fill whole RAM of your machine in matter of seconds. The idea how to improve is next, currently cross product computes on each step product of 6 numbers, instead of this one can compute product of K - 1 sets and then multiply this array by K-th set in order to get K sets product. This will give 6 time more speedup (because there are 6 sets) because you'll need just one multiplication instead of 6.
Update: I've implemented second improved version of function BlockVolCalc2(...) according to paragraph above. It has 2800x speedup, for larger N it will be probably even more faster.
Try it online!
import numpy as np, time
N = 7
Set1S = np.arange(1, N + 1)
Set2S = np.arange(1, N + 1)
Set3S = np.arange(1, N + 1)
Set12A = np.arange(1, N + 1)
Set23A = np.arange(1, N + 1)
Set13A = np.arange(1, N + 1)
def BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol = []
from itertools import product
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
return np.array(BlockVol)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
return (
Set1S[:, None, None, None, None, None] *
Set2S[None, :, None, None, None, None] *
Set3S[None, None, :, None, None, None] *
Set12A[None, None, None, :, None, None] *
Set23A[None, None, None, None, :, None] *
Set13A[None, None, None, None, None, :]
def BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
prod = np.ones((1,), dtype = np.float32)
for s in reversed([Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]):
prod = (s[:, None] * prod[None, :]).ravel()
return prod
# -------- Testing Correctness and Time Measuring --------
tb = time.time()
a0 = BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A),
t0 = time.time() - tb
print(f'base time {round(t0, 4)} sec')
tb = time.time()
a1 = BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t1 = time.time() - tb
print(f'improved time {round(t1, 4)} sec, speedup {round(t0 / t1, 2)}x')
tb = time.time()
a2 = BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t2 = time.time() - tb
print(f'improved2 time {round(t2, 4)} sec, speedup {round(t0 / t2, 2)}x')
assert np.allclose(a0, a1)
assert np.allclose(a0, a2)
base time 2.7569 sec
improved time 0.0015 sec, speedup 1834.83x
improved2 time 0.001 sec, speedup 2755.09x
My function embedded into your initial first code will look like here in this code.
Also I created TensorFlow-based variant of code, which will use all of your CPU cores and GPU, this code needs installing tensorflow one time by python -m pip install --upgrade numpy tensorflow:
import numpy as np
N = 18
Set1S, Set2S, Set3S, Set12A, Set23A, Set13A = [np.arange(1 + i, N + 1 + i) for i in range(6)]
dtype = np.float32
def Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
import numpy as np
Set12A, Set23A, Set13A = np.sin(Set12A), np.sin(Set23A), np.sin(Set13A)
return [np.abs(s).astype(dtype) for s in [Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]]
sets = Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
def ProcessNP(sets):
import numpy as np
res = np.ones((1,), dtype = dtype)
for s in reversed(sets):
res = (s[:, None] * res[None, :]).ravel()
res = np.cbrt(res) * 12
return res
def ProcessTF(sets, *, state = {}):
if 'graph' not in state:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import numpy as np, tensorflow as tf
cpus = tf.config.list_logical_devices('CPU')
#print(f"CPUs: {[e.name for e in cpus]}")
gpus = tf.config.list_logical_devices('GPU')
#print(f"GPUs: {[e.name for e in gpus]}")
print(f"GPU: {len(gpus) > 0}")
state['graph'] = tf.Graph()
state['sess'] = tf.compat.v1.Session(graph = state['graph'])
#tf.device(cpus[0].name if len(gpus) == 0 else gpus[0].name)
with state['sess'].as_default(), state['graph'].as_default():
res = tf.ones((1,), dtype = dtype)
state['inp'] = []
for s in reversed(sets):
sph = tf.compat.v1.placeholder(dtype, s.shape)
state['inp'].insert(0, sph)
res = sph[:, None] * res[None, :]
res = tf.reshape(res, (tf.size(res),))
res = tf.math.pow(res, 1 / 3) * 12
state['out'] = res
def Run(sets):
with state['sess'].as_default(), state['graph'].as_default():
return tf.compat.v1.get_default_session().run(
state['out'], {ph: s for ph, s in zip(state['inp'], sets)}
state['run'] = Run
return state['run'](sets)
# ------------ Testing ------------
npa, tfa = ProcessNP(sets), ProcessTF(sets)
assert np.allclose(npa, tfa)
from timeit import timeit
print('Nums:', round(npa.size / 10 ** 6, 3), 'M')
timeit_num = 2
print('NP:', round(timeit(lambda: ProcessNP(sets), number = timeit_num) / timeit_num, 3), 'sec')
print('TF:', round(timeit(lambda: ProcessTF(sets), number = timeit_num) / timeit_num, 3), 'sec')
On my 2-cores CPU it prints:
GPU: False
Nums: 34.012 M
NP: 3.487 sec
TF: 1.185 sec

Monte Carlo with Metropolis algorithm extremely slow in Python

I'm trying to implement a simple Monte Carlo in Python (to which I'm fairly new). Coming from C I'm probably following the wrongest path since my code is far too slow for what I'm asking: I have a potential hard sphere-like (see V_pot(r) in the code) for 60 3d particles and periodic boundary conditions (PBC), so I defined the following functions
import timeit
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from numpy import inf
L, kb, d, eps, DIM = 100, 1, 1, 1, 3
r_c, T = L/2, eps/(.5*kb)
beta = 1/(kb*T)
def dist(A, B):
d = A - B
d -= L*np.around(d/L)
return np.sqrt(np.sum(d**2))
def V_pot(r):
V = -eps*(d**6/r**6 - d**6/r_c**6)
if r > r_c:
V = 0
elif r < d:
V = inf
return V
def ener(config):
V_jk_val, j = 0, N
while (j > 0):
j -= 1
i = 0
while (i < j):
V_jk_val += V_pot(dist(config[j,:], config[i,:]))
i += 1
return V_jk_val
def acc(en_n, en_o):
d_en = en_n-en_o
if (d_en <= 0):
acc_val = 1
acc_val = np.exp(-beta*(d_en))
return acc_val
then, starting from the configuration (where every line of the array represents the coordinates of a 3D particle)
config = np.array([[16.24155657, 57.41672173, 94.39565792],
[76.38121764, 55.88334066, 5.72255163],
[38.41393783, 58.09432145, 6.26448054],
[86.44286438, 61.37100899, 91.97737383],
[37.7315366 , 44.52697269, 23.86320444],
[ 0.59231801, 39.20183376, 89.63974115],
[38.00998141, 3.84363202, 52.74021401],
[99.53480756, 69.97688928, 21.43528924],
[49.62030291, 93.60889503, 15.73723259],
[54.49195524, 0.6431965 , 25.37401196],
[33.82527814, 25.37776021, 67.4320553 ],
[64.61952893, 46.8407798 , 4.93960443],
[60.47322732, 16.48140136, 33.26481306],
[19.71667792, 46.56999616, 35.61044526],
[ 5.33252557, 4.44393836, 60.55759256],
[44.95897856, 7.81728046, 10.26000715],
[86.5548395 , 49.74079452, 4.80480133],
[52.47965686, 42.831448 , 22.03890639],
[ 2.88752006, 59.84605062, 22.75760029],
[ 9.49231045, 42.08653603, 40.63380097],
[13.90093641, 74.40377984, 32.62917915],
[97.44839233, 90.47695772, 91.60794836],
[51.29501624, 27.03796277, 57.09525454],
[10.30180295, 21.977336 , 69.54173272],
[59.61327648, 14.29582325, 11.70942289],
[89.52722796, 26.87758644, 76.34934637],
[82.03736088, 78.5665713 , 23.23587395],
[79.77571695, 66.140968 , 53.6784269 ],
[82.86070472, 40.82189833, 51.48739072],
[99.05647523, 98.63386809, 6.33888993],
[31.02997123, 66.99709163, 95.88332332],
[97.71654767, 59.24793618, 5.20183793],
[ 6.79964473, 45.01258652, 48.69477807],
[93.34977049, 55.20537774, 82.35693526],
[17.35577815, 20.45936211, 29.27981422],
[55.51942207, 52.22875901, 3.6616131 ],
[61.45612224, 36.50170405, 62.89796773],
[23.55822368, 7.09069623, 37.38274914],
[39.57082799, 58.95457592, 48.0304924 ],
[93.94997617, 64.34383203, 77.63346308],
[17.47989107, 90.01113402, 81.00648645],
[86.79068539, 66.35768515, 56.64402907],
[98.71924121, 38.33749023, 73.4715132 ],
[ 0.42356139, 78.32172925, 15.19883322],
[77.75572529, 2.60088767, 56.4683935 ],
[49.76486142, 3.01800153, 93.48019286],
[42.54483899, 4.27174457, 4.38942325],
[66.75777178, 41.1220603 , 19.64484167],
[19.69520773, 41.09230171, 2.51986091],
[73.20493772, 73.16590392, 99.19174281],
[94.16756184, 72.77653334, 10.32128552],
[29.95281655, 27.58596604, 85.12791195],
[ 2.44803886, 32.82333962, 41.6654683 ],
[23.9665915 , 49.94906612, 37.42701059],
[30.40282934, 39.63854309, 47.16572743],
[56.04809276, 30.19705527, 29.15729635],
[ 2.50566522, 70.37965564, 16.78016719],
[28.39713572, 4.04948368, 27.72615789],
[26.11873563, 41.49557167, 14.38703697],
[81.91731981, 12.10514972, 12.03083427]])
I make the 5000 time steps of the simulation with the following code
N = 60
TIME_MC = 5000
#d/6, d/3, d, 2*d, 3*d
en_mc_delta = np.zeros((TIME_MC, len(DELTA_LIST)))
start = timeit.default_timer()
config_tmp = config
for iD, Delta in enumerate(DELTA_LIST):
while (t < TIME_MC):
for k in range(N):
RND = np.random.rand()
config_tmp[k,:] = config[k,:] + Delta*(np.random.random_sample((1,3))-.5)
en_o, en_n = ener(config), ener(config_tmp)
ACC = acc(en_n, en_o)
if (RND < ACC):
config[k,:] = config_tmp[k,:]
en_o = en_n
en_mc_delta[t][iD] = en_o
t += 1
stop = timeit.default_timer()
print('Time: ', stop-start)
following the rule of the Metropolis algorithm for the acceptance of the proposed move extracted with config_tmp[k,:] = config[k,:] + Delta*(np.random.random_sample((1,3))-.5).
I made some attempts to check where the code get stuck and I found that the function ener (also because of the function dist) is extremely slow: it takes something like ~0.02s to calculate the energy of a configuration, which means something around ~6000s to run the complete simulation (60 particles, 5000 proposed moves).
The outer for it's just to calculate the results for different values of Delta.
Running this code with TIME_MC=60 can make you an idea of how much slow is this code (~218s) which takes just some seconds if implemented in C. I read some other question about how to speed up Python codes but I can't understand how to do it here.
I'm now almost sure that the problem is in the function dist, since just to calculate PBC distance between two 3D vectors it takes around ~0.0012s which gives crazy long times when you calculate it 5000*60 times.
Note that this is a partial answer continued from comments on the original question.
Here's an example of how "unrolling" numpy's function can improve performance when replaced with a more direct calculation of the distance. Note that this was not verified to be equivalent, especially concerning the rounding. The principle still applies, I think.
import random
import time
import numpy as np
L = 100
inv_L = 0.01
vec_length = 10
repetitions = 100000
def dist_np(A, B):
d = A - B
d -= L*np.around(d/L)
return np.sqrt(np.sum(d**2))
def dist_direct(A, B):
sum = 0
for i in range(0, len(A)):
diff = (A[0,i] - B[0,i])
diff -= L * int(diff * inv_L)
sum += diff * diff
return np.sqrt(sum)
vec1 = np.zeros((1,vec_length))
vec2 = np.zeros((1,vec_length))
for i in range(0, vec_length):
vec1[0,i] = random.random()
vec2[0,i] = random.random()
print("with numpy method:")
start = time.time()
for i in range(0, repetitions):
dist_np(vec1, vec2)
print("done in {}".format(time.time() - start))
print("with direct method:")
start = time.time()
for i in range(0, repetitions):
dist_direct(vec1, vec2)
print("done in {}".format(time.time() - start))
with numpy method:
done in 6.332799911499023
with direct method:
done in 1.0938000679016113
Play around with the average vector length and the repetitions to see where the sweet spot is. I expect the performance gain is not constant when varying these meta-parameters.

Why is the curve of my permutation test analysis not smooth?

I am using a permutation test (pulling random sub-samples) to test the difference between 2 experiments. Each experiment was carried out 100 times (=100 replicas of each). Each replica consists of 801 measurement points over time. Now I would like to perform a kind of permutation (or boot strapping) in order to test how many replicas per experiment (and how many (time) measurement points) I need to obtain a certain reliability level.
For this purpose I have written a code from which I have extracted the minimal working example (with lots of things hard-coded) (please see below). The input data is generated as random numbers. Here np.random.rand(100, 801) for 100 replicas and 801 time points.
This code works in principle however the produced curves are sometimes not smoothly falling as one would expect if choosing random sub-samples for 5000 times. Here is the output of the code below:
It can be seen that at 2 of the x-axis there is a peak up which should not be there. If I change the random seed from 52389 to 324235 it is gone and the curve is smooth. It seems there is something wrong with the way the random numbers are chosen?
Why is this the case? I have the semantically similar code in Matlab and there the curves are completely smooth at already 1000 permutations (here 5000).
Do I have a coding mistake or is the numpy random number generator not good?
Does anyone see the problem here?
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import current_process, cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel (queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas, nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel, args=(queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter))
print('Process {} started work on time \"{} ns\"'.format(timesOfInterestFramesIterIdx, timesOfInterestNs[timesOfInterestFramesIterIdx]), end='\n', flush=True)
# collect the results
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
oneResult = queue.get()
allResults[timesOfInterestFramesIterIdx, :, :] = oneResult
print('Process number {} returned the results.'.format(timesOfInterestFramesIterIdx), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join back the resources in the main thread
for proc in jobs:
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5, label="nReps="+str(lineIdx+1))
ax[axId].set_xticks(range(nrOfTimesOfInterest)) # careful: this is not the same as plt.xticks!!
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1]) # increase x max by 2
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
------------- UPDATE ---------------
Changing the random number generator from numpy to "from random import randint" does not fix the problem:
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
d2Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
--- UPDATE 2 ---
timesOfInterestNs can just be set to:
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50]
to speed it up on machines with fewer cores.
--- UPDATE 3 ---
Re-initialising the random seed generator in each child process (Random seed is replication across child processes) does also not fix the problem:
pid = str(current_process())
pid = int(re.split("(\W)", pid)[6])
ms = int(round(time.time() * 1000))
mySeed = np.mod(ms, 4294967295)
mySeed = mySeed + 25000 * pid + 100 * pid + pid
mySeed = np.mod(mySeed, 4294967295)
--- UPDATE 4 ---
On a windows machine you will need a:
if __name__ == '__main__':
to avoid creating subprocesses recursively (and a crash).
I guess this is the classical multiprocessing mistake. Nothing guarantees that the processes will finish in the same order as the one they started. This means that you cannot be sure that the instruction allResults[timesOfInterestFramesIterIdx, :, :] = oneResult will store the result of process timesOfInterestFramesIterIdx at the location timesOfInterestFramesIterIdx in allResults. To make it clearer, let's say timesOfInterestFramesIterIdx is 2, then you have absolutely no guarantee that oneResult is the output of process 2.
I have implemented a very quick fix below. The idea is to track the order in which the processes have been launched by adding an extra argument to groupDiffsInParallel which is then stored in the queue and thereby serves as a process identifier when the results are gathered.
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel(queue, d1, d2, nrOfReplicas, nrOfPermuts,
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put({'allResults': allResults,
'number': timesOfInterestFramesIterIdx})
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70,
80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas,
nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel,
args=(queue, d1, d2, nrOfReplicas, nrOfPermuts,
print('Process {} started work on time \"{} ns\"'.format(
end='\n', flush=True)
# collect the results
resultdict = {}
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
allResults[resultdict['number'], :, :] = resultdict['allResults']
print('Process number {} returned the results.'.format(
resultdict['number']), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join
# back the resources in the main thread
for proc in jobs:
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40,
50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5,
ctr += 1
# careful: this is not the same as plt.xticks!!
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot "
+ "strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1])
# increase x max by 2
# #### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
This is the output I get, which obviously shows that the order in which the processes return is shuffled compared to the starting order.
20 cores available
Starting ...
Process 0 started work on time "0.25 ns"
Process 1 started work on time "0.5 ns"
Process 2 started work on time "1 ns"
Process 3 started work on time "2 ns"
Process 4 started work on time "3 ns"
Process 5 started work on time "4 ns"
Process 6 started work on time "5 ns"
Process 7 started work on time "10 ns"
Process 8 started work on time "20 ns"
Process 9 started work on time "30 ns"
Process 10 started work on time "40 ns"
Process 11 started work on time "50 ns"
Process 12 started work on time "60 ns"
Process 13 started work on time "70 ns"
Process 14 started work on time "80 ns"
Process 15 started work on time "90 ns"
Process 16 started work on time "100 ns"
Process number 3 returned the results.
Process number 0 returned the results.
Process number 4 returned the results.
Process number 7 returned the results.
Process number 1 returned the results.
Process number 2 returned the results.
Process number 5 returned the results.
Process number 8 returned the results.
Process number 6 returned the results.
Process number 9 returned the results.
Process number 10 returned the results.
Process number 11 returned the results.
Process number 12 returned the results.
Process number 13 returned the results.
Process number 14 returned the results.
Process number 15 returned the results.
Process number 16 returned the results.
All parallel done.
And the figure which is produced.
not sure if you're still hung up on this issue, but I just ran your code on my machine (MacBook Pro (15-inch, 2018)) in Jupyter 4.4.0 and my graphs are smooth with the exact same seed values you originally posted:
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
Perhaps there's nothing wrong with your code and nothing special about the 324235 seed and you just need to double check your module versions since any changes to the source code that have been made in more recent versions could affect your results. For reference I'm using numpy 1.15.4, matplotlib 3.0.2 and multiprocessing

How can I speed up this double for loop?

This is code that multiply each column by column number.
like this.
And I want to reduce the time it takes to run this code.
The time should be a half of this code's running time.
(now it is about 0.04 sec, but it should be about 0.02 sec!)
But, also it must still be "double for loop".
How can I speed up this with keeping the structure??
import numpy as np
import time
x = np.ones((10000,3), dtype = np.int32)
start_time = time.clock()
y1 = np.empty((10000,3), dtype = np.int32)
for a in range(10000):
for b in range(3):
y1[a,b] = x[a,b]*(b)
print(time.clock() - start_time)

Faster performance for normalizing a numpy array?

I am currently normalizing an numpy array in python created by splicing an image by windows with a stride which creates about 20K patches. The current normalization implementation is a big pain point in my runtime, and I'm trying to replace it with the same functionality done maybe in a C extension or something. I am looking to see what advice the community has to get this done easily and simple?
Current runtime is about 0.34s just for the normalization part, I'm trying to get below 0.1s or better. You can see creating patches is extremely efficient with view_as_windows, and I am looking for something similar for normalization. Note you can simply comment/uncomment the lines labeled " # ---- Normalization" to see the runtimes yourself for the different implementations.
Here is the current implementation:
import gc
import cv2, time
from libraries import GCN
from skimage.util.shape import view_as_windows
def create_imageArray(patch_list):
returnImageArray = numpy.zeros(shape=(len(patch_list), 1, 40, 60))
idx = 0
for patch, name, coords in patch_list:
imgArray = numpy.asarray(patch[:,:], dtype=numpy.float32)
imgArray = imgArray[numpy.newaxis, ...]
returnImageArray[idx] = imgArray
idx += 1
return returnImageArray
# print "normImgArray[0]:",normImgArray[0]
def NormalizeData(imageArray):
tempImageArray = imageArray
# Normalize the data in batches
batchSize = 25000
dataSize = tempImageArray.shape[0]
imageChannels = tempImageArray.shape[1]
imageHeight = tempImageArray.shape[2]
imageWidth = tempImageArray.shape[3]
for i in xrange(0, dataSize, batchSize):
stop = i + batchSize
print("Normalizing data [{0} to {1}]...".format(i, stop))
dataTemp = tempImageArray[i:stop]
dataTemp = dataTemp.reshape(dataTemp.shape[0], imageChannels * imageHeight * imageWidth)
#print("Performing GCN [{0} to {1}]...".format(i, stop))
dataTemp = GCN(dataTemp)
#print("Reshaping data again [{0} to {1}]...".format(i, stop))
dataTemp = dataTemp.reshape(dataTemp.shape[0], imageChannels, imageHeight, imageWidth)
#print("Updating data with new values [{0} to {1}]...".format(i, stop))
tempImageArray[i:stop] = dataTemp
del dataTemp
return tempImageArray
start_time = time.time()
img1_path = "777628-1032-0048.jpg"
img_list = ["images/1.jpg", "images/2.jpg", "images/3.jpg", "images/4.jpg", "images/5.jpg"]
patchWidth = 60
patchHeight = 40
channels = 1
stride = patchWidth/6
multiplier = 1.31
finalImgArray = []
vaw_time = 0
norm_time = 0
array_time = 0
for im_path in img_list:
start = time.time()
baseFileWithExt = os.path.basename(im_path)
baseFile = os.path.splitext(baseFileWithExt)[0]
img = cv2.imread(im_path, cv2.IMREAD_GRAYSCALE)
nxtWidth = 800
nxtHeight = 1200
patchesList = []
for i in xrange(7):
img = cv2.resize(img, (nxtWidth, nxtHeight))
nxtWidth = int(nxtWidth//multiplier)
nxtHeight = int(nxtHeight//multiplier)
patches = view_as_windows(img, (patchHeight, patchWidth), stride)
cols = patches.shape[0]
rows = patches.shape[1]
patchCount = cols*rows
print "patchCount:",patchCount, " patches.shape:",patches.shape
returnImageArray = numpy.zeros(shape=(patchCount, channels, patchHeight, patchWidth))
idx = 0
for col in xrange(cols):
for row in xrange(rows):
patch = patches[col][row]
imageName = "{0}-patch{1}-{2}.jpg".format(baseFile, i, idx)
patchCoodrinates = (0, 1, 2, 3) # don't need these for example
patchesList.append((patch, imageName, patchCoodrinates))
# ---- Normalization inside 7 iterations <> Part 1
# imgArray = numpy.asarray(patch[:,:], dtype=numpy.float32)
# imgArray = patch.astype(numpy.float32)
# imgArray = imgArray[numpy.newaxis, ...] # Add a new axis for channel so goes from shape [40,60] to [1,40,60]
# returnImageArray[idx] = imgArray
idx += 1
# if i == 0: finalImgArray = returnImageArray
# else: finalImgArray = numpy.concatenate((finalImgArray, returnImageArray), axis=0)
vaw_time += time.time() - start
# ---- Normalizaion inside 7 iterations <> Part 2
# start = time.time()
# normImageArray = NormalizeData(finalImgArray)
# norm_time += time.time() - start
# print "returnImageArray.shape:", finalImgArray.shape
# ---- Normalization outside 7 iterations
start = time.time()
imgArray = create_imageArray(patchesList)
array_time += time.time() - start
start = time.time()
normImgArray = NormalizeData( imgArray )
norm_time += time.time() - start
print "len(patchesList):",len(patchesList)
total_time = (time.time() - start_time)/len(img_list)
print "\npatches_time per img: {0:.3f} s".format(vaw_time/len(img_list))
print "create imgArray per img: {0:.3f} s".format(array_time/len(img_list))
print "normalization_time per img: {0:.3f} s".format(norm_time/len(img_list))
print "total time per image: {0:.3f} s \n".format(total_time)
Here is the GCN code in case you need to download it to use it: http://pastebin.com/RdVMD2P3
Details on code inside GCN
I am calling GCN using the default params.
At high level it is taking the average of all of the pixels and dividing all the pixels by that average. So if there's an an image array that looks like this [1 2 3], then the average is 2. Therefore we divide each number by 2 and get [0.5, 1, 1.5]. That's what the normalize does. I forgot to highlight in the image above the mean = X.mean(axis=1).
If you are wondering why I am re-iterating and creating a new imgArray to normalize instead of doing it in the original patch creation it is to keep data transfer to a minimum. I am implementing this with the multiprocess library, and serializing data takes a LOOONG time, so trying to keep the data serialization to a minimum (meaning pass as little data back from the process). I have measured the difference between doing inside the 7 loops or outside, and notes are below so I can deal with that. However if you know of a faster implementation, do let me know.
Runtimes for creating imageArray inside 7 loops:
patches_time per img: 0.560 s
normalization_time per img: 0.336 s
total time per image: 0.896 s
Runtimes for creating imageArray and normalizing outside of 7 iterations:
patches_time per img: 0.040 s
create imgArray per img: 0.146 s
normalization_time per img: 0.339 s
total time per image: 0.524 s
I didn't see this before, but it seems creating the array is also taking some time.
