There is already a question on this but the answer contains a broken link, and being over two years old, I'm hoping there's a better solution now :)
Low discrepancy quasi-random sequences, e.g. Sobol sequences, fill a space more uniformly than uniformly random sequences. Is there a good/easy way to generate them in python?
I think the best alternative for Low Discrepancy sequences in Python is Sensitivity Analysis Library (SALib):
https://github.com/SALib/SALib
I think this is an active project and you can contact the author to check if the functionalities you need are already implemented. If that doesn't solve your problem, Corrado Chisari ported a SOBOL version made in Matlab (by John Burkardt) to Python, you can access it here:
http://people.sc.fsu.edu/~jburkardt/py_src/sobol/sobol.html
Someone cleaned up the comments in these sources and put them in the format of docstrings. It's much more readable and you can access it here:
https://github.com/naught101/sobol_seq
Scipy has this option now http://scipy.github.io/devdocs/generated/scipy.stats.qmc.Sobol.html
PyTorch also proves option of generating sobol random numbers. It allows upto a dimension of ~1k and has an option to switch on scrambling.
https://pytorch.org/docs/stable/generated/torch.quasirandom.SobolEngine.html
Chaospy is also a valid option. One can select several approaches for low-discrepancy sampling (including 'Sobol, latin hypercube, etc) - for more details see the documentation.
I think the easiest way to do it now (as of SciPy version >= 1.7.1) is how I'm doing it here. It's good for up to 21,201 dimensions as they implemented the Joe and Kuo algorithm, which is the highest number of dimensions you can get (opensource). https://web.maths.unsw.edu.au/~fkuo/sobol/
Here I show how to use the base2 method (with Owen Scrambling) and the random method (which generates an arbitrary number of points from the sequence), and how to skip the first point.
Note that this routine can be quite slow (due to the ndtri, or inverse normal distribution conversion of the points to shocks), especially in high dimensions + high simulation counts. Point generation from the Sobol sequence itself is quite fast, but for most Monte Carlo simulations, you convert them into shocks (you may be using another distribution other than the standard normal).
This at least lets you generate the points in Python code directly.
Also, in the QMCgenerate routine, I'm skipping the first point (which is 0s) - while this is commonly done, some papers suggest not doing so (but I haven't seen a good alternative, if you have one, feel free to comment). I transpose them just so I can paste them in Excel later and examine the generated shocks. Anyway, hope those of you who need this algorithm find it useful.
from scipy.stats import qmc # needs SciPy >= 1.7.1
from scipy.special import ndtri
import numpy as np
import timeit
time_periods = 252
factors = 12
# IF using base2 generation, need a pow(2,m)
sims = 8192
dimensions = factors*time_periods
def RQMCgenerate (dimensions, sims, seed):
start_time = timeit.default_timer()
m=10 # start at 1024 sims
while pow(2,m) < sims: #m = 17 # 131,072 sims; M = 16 # 65,536 sims
m = m+1
RQMCgenerator = qmc.Sobol(dimensions, scramble=True, seed=seed)
RQMCsamples = RQMCgenerator.random_base2(m)
print('\n' + 'Time after sample generation RQMC:', (timeit.default_timer() - start_time), 'seconds');
sobol = ndtri(RQMCsamples).T # get normsinv(points) and transpose to dimensions * sims
del RQMCsamples
print('\n' + 'Time after ndtri (normsinv) of', sims,'sims x dimensions', dimensions, 'Randomized Sobol points): ', (timeit.default_timer() - start_time), 'seconds');
return sobol
def QMCgenerate(dimensions, sims):
start_time = timeit.default_timer()
QMCgenerator = qmc.Sobol(dimensions, scramble=False)
QMCgenerator.fast_forward(1) #skip first point where normsinv(0) = -Inf
QMCsamples = QMCgenerator.random(sims) #this generates points not having to be powers of 2
print('\n' + 'Time after sample generation QMC:', (timeit.default_timer() - start_time), 'seconds');
sobol = ndtri(QMCsamples).T # get normsinv(points) and transpose to dimensions * sims
del QMCsamples
print('\n' + 'Time after ndtri (normsinv) of', sims,'sims x dimensions', dimensions, 'Sobol points):', (timeit.default_timer() - start_time), 'seconds');
return sobol
RQMCsobol = RQMCgenerate(dimensions, sims, seed=0) #note sims changed with pow(2,m) if a power of 2 was not passed
sobol = QMCgenerate(dimensions, sims)
Time after sample generation RQMC: 0.4269224999952712 seconds
Time after ndtri (normsinv) of 8092 sims x dimensions 3024 Randomized Sobol points): 1.0048970999996527 seconds
Time after sample generation QMC: 0.0630135999963386 seconds
Time after ndtri (normsinv) of 8092 sims x dimensions 3024 Sobol points): 0.5444753999981913 seconds
This gets much slower at higher sims*dimensions, although I haven't found a faster conversion of points to normally distributed shocks than ndtri in Python:
Time after sample generation RQMC: 2.1779929000040283 seconds
Time after ndtri (normsinv) of 131072 sims x dimensions 3024 Randomized Sobol points): 10.617904700004146 seconds
Time after sample generation QMC: 1.079756200000702 seconds
Time after ndtri (normsinv) of 131072 sims x dimensions 3024 Sobol points): 9.545934699999634 seconds
Related
I have a need to do very very fast and efficient way of rolling linear regression.
I looked through these two threads :
Efficient way to do a rolling linear regression
Rolling linear regression
From them, I had inferred numpy was (computationally) the fastest. However, using my (limited) python skills, I found the time to compute the same set of rolling data, was *** the same ***.
Is there a faster way to compute than either of the 3 methods I post below? I would have thought the numpy way is much faster, but unfortunately, it wasn't.
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.empty(0)
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np = np.append(roll_np, grad_)
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
tl;dr
My answer is
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
And it takes 1.2 ms to compute, compared to 2 seconds for your pandas and numpy version, and 3.5 seconds for your stat version.
Long version
One method could be to use sliding_window_view to transform your tmp_ array, into an array of window (a fake one: it is just a view, not really a 10000x30 array of data. It is just tmp_ but viewed differenty. Hence the _view in the function name).
No direct advantage. But then, from there, you can try to take advantage of vectorization.
I do that two different way: an easy one, and one that takes a minute of thinking. Since I put the best answer first, the rest of this message can appear inconsistent chronologically (I say things like "in my previous answer" when the previous answer come later), but I tried to redact both answer consistently.
New answer : matrix operations
One method to do that (since lstsq is of the rare numpy method that wouldn't just do it naturally) is to go back to what lstsq(X,Y) does in reality: it computes (XᵀX)⁻¹Xᵀ Y
So let's just do that. In python, with xxx being the X array (of arange and 1 in your example) and view the array of windows to your data (that is view[i] is tmp_[i:i+win_]), that would be np.linalg.inv(xxx.T#xxx)#xxx.T#view[i] for i being each row. We could vectorize that operation with np.vectorize to avoid iterating i, as I did for my first solution (see below). But the thing is, we don't need to. That is just a matrix times a vector. And the operation computing a matrix times a vector for each vector in an array of vectors, is just matrix multiplication!
Hence my 2nd (and probably final) answer
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
roll_mat is still identical (with one extra row because your roll_np stopped one row short of the last possible one) to roll_np (see below for graphical proof with my first answer. I could provide a new image for this one, but it is indistinguishable from the one I already used). So same result (unsurprisingly I should say... but sometimes it is still a surprise when things work exactly like theory says they do)
But timing, is something else. As promised, my previous factor 4 was nothing compared to what real vectorization can do. See updated timing table:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize (see below)
0.46 s
numpy view/matmult
1.2 ms
The important part is 'ms', compared to other 's'.
So, this time factor is 1700 !
Old-answer : vectorize
A lame method, once we have this view could be to use np.vectorize from there. I call it lame because vectorize is not supposed to be efficient. It is just a for loop called by another name. Official documentation clearly says "not to be used for performance". And yet, it would be an improvement from your code
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
f = np.vectorize(lambda y: np.linalg.lstsq(xxx,y,rcond=None)[0][0], signature='(n)->()')
roll_vectorize=f(view)
Firt let's verify the result
plt.scatter(f(view)[:-1], roll_np))
So, obviously, same results as roll_np (which, I've checked the same way, are the same results as the two others. With also the same variation on indexing since all 3 methods have not the same strategy for border)
And the interesting part, timings:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize
0.46 s
So, you see, it is not supposed to be for performance, and yet, I gain more that x4 times with it.
I am pretty sure that a more vectorized method (alas, lstsq doesn't allow directly it, unlike most numpy functions) would be even faster.
First if you need some tips for optimizing your python code, I believe this playlist might help you.
For making it faster; "Append" is never a good way, you think of it in terms of memory, every time you append, python may create a completely new list with a bigger size (maybe n+1; where n is old size) and copy the last items (which will be n places) and for the last one will be added at last place.
So when I changed it to be as follows
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.zeros(len(tmp_)-win_) ### Change
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np[cnt_] = grad_ ### Change
# roll_np = np.append(roll_np, grad_) ### Change
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
I initialized the array from first place with the size of how it's expected to turn to be(len(tmp_)-win_ in range), and just assigned values to it later, and it was much faster.
there are also some other tips you can do, Python is interpreted language, meaning each time it takes a line, convert it to machine code, then execute it, and it does that for each line. Meaning if you can do multiple things at one line, meaning they will get converted at one time to machine code, it shall be faster, for example, think of list comprehension.
I have tried the following python median filtering on time-series signals to find the fastest and more efficient function.
sig is a numpy array of size 80×188 which contains 188 samples measured by 80 sensors.
import numpy as np
from scipy.ndimage import median_filter
from scipy.signal import medfilt
from scipy.signal import medfilt2d
import time
sig = np.random.rand(80,188).astype('f')
print(type(sig))
print(type(sig[0][0]))
window_length = 181
t = time.time()
sigFiltered = medfilt2d(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt2d: %g seconds' % elapsed)
t = time.time()
sigFiltered = median_filter(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.ndimage.median_filter: %g seconds' % elapsed)
t = time.time()
sigFiltered = medfilt(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt: %g seconds' % elapsed)
The code can be tried here.
The result of the filter is another time-series array of size 80×188 with smoothed time-points for each sensor.
MATLAB medfilt1(sig, 181, [], 2) performs the filtering on the same data 10 times faster compared to scipy.signal.medfilt2d, which was the fastest among other functions. On my machine, MATLAB=2ms vs Python=20 ms. I think MATLAB performs multithreading processing and python does not.
Is there any way to perform multithreading median filtering to speed up the process and assign sensors to different threads? Is there a more efficient median filtering available in python? Can I achieve the performance of MATLAB win python or at least get closer to it?
With such a long filter relative to the input most outputs using a standard medfilt are going to be the same. Where this to be a convolution this would be a "full" convolution. If you instead only give outputs for "valid" convolution, that will be much faster in this case:
t = time.time()
medians = []
for i in range(188-181):
sig2 = sig[:, i:i+window_length]
f = np.median(sig2, axis=1)
medians.append(f)
sigFiltered = np.stack(medians).T
elapsed = time.time() - t
print('numpy.median: %g seconds' % elapsed)
numpy.median: 0.0015518 seconds
This is in the ballpark of the requested 1 ms runtime per 188 sample size.
Considering that even each unique value here will change very slowly/rarely with new input samples. You could therefore speed this up considerably by using a hop larger than 1.
I'm wondering why you're using a median filter of 181 points for a data length of 188? The filter is so long that you're essentially just throwing away all the data and replacing it with the global median of the sensor's output. Typical median filter lengths would be a few samples, depending on what kind of transients you want to filter out.
The filter length also explains why it's so slow. On my machine, your median_filter example takes 46 ms. Running with a more normal filter size of 3 samples takes 0.7 ms.
My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)
How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.
The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.
There are several threads asking for a way to simulate time-inhomogenous poisson processes in python. The NeuroTools module offer a simple way to do so via the inh_poisson_generator () function. The help of this function is introduced at the bottom of this thread. The function was originally designed to simulate spike trains, and uses the thinning method.
I would like to simulate a spike train during 2000ms. The spike rate (in Hertz) changes every millisecond, and is comprised between 20 spikes/second and 160 spikes/second. I've tried to simulate this using the following code:
import NeuroTools
import numpy as np
from NeuroTools import stgen
import matplotlib.pyplot as plt
import random
st_gen = stgen.StGen()
time = np.arange(0, 2000)
t_rate = []
for i in range (2000):
t_rate.append(random.randrange(20, 161, 1))
t_rate = np.array(t_rate)
Psim = st_gen.inh_poisson_generator(rate = t_rate, t = time, t_stop = 2000, array = True)
However, the code returns very few timestamps (e.g., array([ 397.55345905, 1208.79804513, 1478.03525045, 1982.63643262]), which doesn't make sense to me. I would appreciate any help on this.
inh_poisson_generator(self, rate, t, t_stop, array=False) method of NeuroTools.stgen.StGen instance
Returns a SpikeTrain whose spikes are a realization of an inhomogeneous
poisson process (dynamic rate). The implementation uses the thinning
method, as presented in the references.
Inputs:
rate - an array of the rates (Hz) where rate[i] is active on interval
[t[i],t[i+1]]
t - an array specifying the time bins (in milliseconds) at which to
specify the rate
t_stop - length of time to simulate process (in ms)
array - if True, a numpy array of sorted spikes is returned,
rather than a SpikeList object.
Note:
t_start=t[0]
References:
Eilif Muller, Lars Buesing, Johannes Schemmel, and Karlheinz Meier
Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories
Neural Comput. 2007 19: 2958-3010.
Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-Verlag.
Examples:
>> time = arange(0,1000)
>> stgen.inh_poisson_generator(time,sin(time), 1000)enter code here
I don't really have an answer for you but because this post helped me to get started with NeuroTools, I thought I'd share my small example which is working fine.
For the inh_poisson_generator() the rate input is in unit Hz and all times are in ms. I use an average rate of 1.6 spikes/ms, so I expect to receive ~4000 events. The results confirm that just fine!
I guess it might be an issue that you are using a non-continuous rate. However I barely know anything about the algorithm implemented for this function..
I hope my example can help you somehow!
import NeuroTools
from NeuroTools import stgen
v0=1.6 #spikes/ms
Amp=1 # amplitude in spikes/ms
w=4/1000 # periodic frequency in spikes/ms
st_gen = stgen.StGen()
tstop=2500.0
intervals=np.arange(0,tstop,0.05)
rate=np.array([])
for tt in intervals:
v_next=v0+Amp*math.sin(2*math.pi*w*tt)
if (v_next>0.0):
rate=np.append(rate,v_next*1000)
else: rate=np.append(rate,0.0)
PSim=st_gen.inh_poisson_generator(rate=rate,t = intervals, t_stop = 2500.0, array = True) # important to have rate in Hz and all other times in ms
print len(PSim)
print np.mean(rate)/1000*tstop
I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks
The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.