Simulation of Markov chain slower than in Matlab

Simulation of Markov chain slower than in Matlab - python

I run the same test code in Python+Numpy and in Matlab and see that the Matlab code is faster by an order of magnitude. I want to know what is the bottleneck of the Python code and how to speed it up.
I run the following test code using Python+Numpy (the last part is the performance sensitive one):
# Packages
import numpy as np
import time
# Number of possible outcomes
num_outcomes = 20
# Dimension of the system
dim = 50
# Number of iterations
num_iterations = int(1e7)
# Possible outcomes
outcomes = np.arange(num_outcomes)
# Possible transition matrices
matrices = [np.random.rand(dim, dim) for k in outcomes]
matrices = [mat/np.sum(mat, axis=0) for mat in matrices]
# Initial state
state = np.random.rand(dim)
state = state/np.sum(state)
# List of samples
samples = np.random.choice(outcomes, size=(num_iterations,))
samples = samples.tolist()
# === PERFORMANCE-SENSITIVE PART OF THE CODE ===
# Update the state over all iterations
start_time = time.time()
for k in range(num_iterations):
sample = samples[k]
matrix = matrices[sample]
state = np.matmul(matrix, state)
end_time = time.time()
# Print the execution time
print(end_time - start_time)
I then run an equivalent code using Matlab (the last part is the performance sensitive one):
% Number of possible outcomes
num_outcomes = 20;
% Number of dimensions
dim = 50;
% Number of iterations
num_iterations = 1e7;
% Possible outcomes
outcomes = 1:num_outcomes;
% Possible transition matrices
matrices = rand(num_outcomes, dim, dim);
matrices = matrices./sum(matrices,2);
matrices = num2cell(matrices,[2,3]);
matrices = cellfun(#shiftdim, matrices, 'UniformOutput', false);
% Initial state
state = rand(dim,1);
state = state./sum(state);
% List of samples
samples = datasample(outcomes, num_iterations);
% === PERFORMANCE-SENSITIVE PART OF THE CODE ===
% Update the state over all iterations
tic;
for k = 1:num_iterations
sample = samples(k);
matrix = matrices{sample};
state = matrix * state;
end
toc;
The Python code is consistently slower than the Matlab code by an order of magnitude, and I am not sure why.
Any idea where to start?
I run the Python code with the Python 3.10 interpreter and Numpy 1.22.4. I run the Matlab code with Matlab R2022a. Both codes are run on Windows 11 Pro 64 bits on a Lenovo T14 ThinkPad with the following processors:
11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz, 2803 Mhz, 4 Core(s), 8 Logical Processor(s)
EDIT 1: I made some additional tests and it looks like the culprit is some type of Python-specific constant overhead at low matrix sizes:
As hpaulj and MSS suggest, this might mean that a JIT compiler could solve some of these issues. I will do my best to try this in the near future.
EDIT 2: I ran the code under Pypy 3.9-v7.3.11-win64 and although it does change the scaling and even beats Cpython at small matrix sizes, it generally incurs a big overhead for this particular code:
So a JIT compiler could help if there are ways to mitigate this overhead. Otherwise a Cython implementation is probably the remaining way to go...

In the loop, the main hindrance is np.matmul(matrix,state).
If we unroll the loop:
st[1] = m[0]#st[0]
st[2] = m[1]#st[1] = m[1]#m[0]#st[0]
st[3] = m[2]#m[1]#m[0]#st[0]
There is no obvious vectorized way to do looped np.matmul in a non loopy way.
A better way would be to do it in log_2(n) loops.
import numpy as np
outcomes = 20
dim = 50
num_iter = int(1e7)
mat = np.random.rand(outcomes,dim, dim)
mat = mat/mat.sum(axis=1)[...,None]
state = np.random.rand(dim)
state = state/np.sum(state)
samples = np.random.choice(np.arange(outcomes), size=(num_iter,))
a = mat[samples,...]
# This while loop takes log_2(num_iter) iterations
while len(a) > 1:
a = np.matmul(a[::2, ...], a[1::2, ...])
state = np.matmul(a,state)
The time may be further reduced by using numba jit.

Related

Is there any simple method to parallel np.einsum?

I would like to know, is there any simple method to parallel einsum in Numpy?
I found some discussions
Numpy np.einsum array multiplication using multiple cores
Any chance of making this faster? (numpy.einsum)
numpy.tensordot() only for binary contraction with a single axis, Numba needs to specify certain loops. Is there any simple and robust approach to parallel einsum (possibly including opt-einsum, tf-einsum etc) with arbitrary contractions?
A sample code is as following (if necessary I can use more complicated contraction as the example)
import numpy as np
import timeit
import time
na = nc = 1000
nb = 1000
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
start = time.time()
C = np.einsum('ij,jk->ik', A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)

Parallelize three nested loops

Context:
I have 3 3D arrays ("precursor arrays") that I am upsampling with an Inverse Distance Weighting method. To do that, I calculate a 3D weights array that I use in a for loop on each point of my precursor arrays.
Each 2D slice of my weights array is used to calculate a partial array. Once I generate all 28 of them, they are summed to give one final host array.
I would like to parallelize this for loop in order to reduce my computing time. I tried doing it but I can not manage to update correctly my host arrays.
Question:
How could I parallelize my main function (last section of my code) ?
EDIT: Or is there a way I could "slice" my i for loop (for example one core running between i = 0 to 5, and one core running on i = 6 to 9) ?
Summary:
3 precursor arrays (temperatures, precipitations, snow): 10x4x7 (10 is a time dimension)
1 weight array (w): 28x1101x2101
28x3 partial arrays: 1101x2101
3 host arrays (temp, prec, Eprec): 1101x2101
Here is my code (runable as it is aside from the MAIN ALGORITHM PARALLEL section, please see the MAIN ALGORITHM NOT PARALLEL section at the end for the non-parallelized version of my code):
import numpy as np
import multiprocessing as mp
import time
#%% ------ Create data ------ ###
temperatures = np.random.rand(10,4,7)*100
precipitation = np.random.rand(10,4,7)
snow = np.random.rand(10,4,7)
# Array of altitudes to "adjust" the temperatures
alt = np.random.rand(4,7)*1000
#%% ------ Functions to run in parallel ------ ###
# This function upsamples the precursor arrays and creates the partial arrays
def interpolator(i, k, mx, my):
T = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000)) * w[k,:,:]
P = (precipitation[i,mx,my])*w[k,:,:]
S = (snow[i,mx,my])*w[k,:,:]
return(T, P, S)
# We add each partial array to each other to create the host array
def get_results(results):
global temp, prec, Eprec
temp += results[0]
prec += results[1]
Eprec += results[2]
#%% ------ IDW Interpolation ------ ###
# We create a weight matrix that we use to upsample our temperatures, precipitations and snow matrices
# This part is not that important, it works well as it is
MX,MY = np.shape(temperatures[0])
N = 300
T = np.zeros([N*MX+1, N*MY+1])
# create NxM inverse distance weight matrices based on Gaussian interpolation
x = np.arange(0,N*MX+1)
y = np.arange(0,N*MY+1)
X,Y = np.meshgrid(x,y)
k = 0
w = np.zeros([MX*MY,N*MX+1,N*MY+1])
for mx in range(MX):
for my in range(MY):
# Gaussian
add_point = np.exp(-((mx*N-X.T)**2+(my*N-Y.T)**2)/N**2)
w[k,:,:] += add_point
k += 1
sum_weights = np.sum(w, axis=0)
for k in range(MX*MY):
w[k,:,:] /= sum_weights
#%% ------ MAIN ALGORITHM PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
# Start a timer
ts = time.time()
# Iterate over the time dimension
for i in range(temperatures.shape[0]):
# Initialize the host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
# Create the pool based on my amount of cores
pool = mp.Pool(mp.cpu_count())
# Loop through every weight slice, for every cell of the temperatures, precipitations and snow arrays
for k in range(0,w.shape[0]):
for mx in range(MX):
for my in range(MY):
# Upsample the temperatures, precipitations and snow arrays by adding the contribution of each weight slice
pool.apply_async(interpolator, args = (i, k, mx, my), callback = get_results)
pool.close()
pool.join()
# Print the time spent on the loop
print("Time spent: ", time.time()-ts)
#%% ------ MAIN ALGORITHM NOT PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
ts = time.time()
for i in range(temperatures.shape[0]):
# Create empty host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
k = 0
for mx in range(MX):
for my in range(MY):
get_results(interpolator(i, k, mx, my))
k += 1
print("Time spent:", time.time()-ts)

The problem with multiprocessing is that it creates many new processes taht execute the code before the main (ie. before if __name__ == '__main__'). This causes a very slow initialization (since all process does it) and a huge amount of RAM being used for nothing. You certainly should move everything in the main or if possible in functions (which generally results in a faster execution and is a good software engineering practice anyway, especially for parallel codes). Even with this, there is another huge problem with multiprocessing: inter-process communication is slow. One solution is to use a multi-threaded approach made possible by using Numba or Cython (you can disable the GIL with them as opposed to basic CPython threads). In fact, they are often simpler to use than multiprocessing. However, you should be more careful though since parallel access are unprotected and data-races can appear in bogus parallel codes.
In your case, the computation is mostly memory-bound. This means multiprocessing is pretty useless. In fact, parallelism is barely useful here unless you are running this code on a computing server with a high-throughput. Indeed, the memory is a shared resource and using more computing core does not help much since 1 core can almost saturate the memory bandwidth on a regular PC (while few cores are needed on computing servers).
The key to speed up memory-bound codes is to avoid creating temporary arrays and use cache-friendly algorithms. In your case, T, P and S are filled just to be read later so to update the temp, prec and Eprec arrays. This temporary step is pretty expensive and necessary here (especially filling the arrays). Removing this will increase the arithmetic intensity resulting in a code that will certainly be faster in sequential and that can better scale on multiple cores. This is the case on my machine.
Here is an example of code using Numba so to parallelize the code:
import numba as nb
# get_results + interpolator
#nb.njit('void(float64[:,::1], float64[:,::1], float64[:,::1], float64[:,:,::1], int_, int_, int_, int_)', parallel=True)
def interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my):
factor1 = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000))
factor2 = precipitation[i,mx,my]
factor3 = snow[i,mx,my]
for i in nb.prange(w.shape[1]):
for j in range(w.shape[2]):
val = w[k, i, j]
temp[i, j] += factor1 * val
prec[i, j] += factor2 * val
Eprec[i, j] += factor3 * val
# Example of usage:
interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my)
Note the string in nb.njit is called a signature and specify the type to the JIT so it can compile it eagerly.
This code is 4.6 times faster on my 6-core machine (while it was barely faster without the merge of get_results and interpolator). In fact, it is 3.8 times faster in sequential so threads does not help much since the computation is still memory-bound. Indeed, the cost of the multiply-add is negligible compared to the memory reads/writes.

Numeric Integration Python versus Matlab

My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)

How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.

The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.

Gram-Schmidt orthogonalization in pure Tensorflow: performance for iterative solution is much slower than numpy

I want to do Gram-Schmidt orthogonalization to fix big matrices which start to deviate slightly from orthogonality in pure Tensorflow (to do it on the graph within larger computation, without breaking it). The solutions I've seen like the one there are used "externally" (doing multiple sess.run inside).
So I wrote a simple and I think very inefficient implementation myself:
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# add batch dimension for matmul
v = tf.expand_dims(v,0)
w = v - tf.matmul(tf.matmul(v, tf.transpose(basis)), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, w/tf.norm(w)],axis=0)
return basis
But when I compare it with the same iterative external code, it is 3 times slower (on GPU !!!) (though has a bit better precision):
how much source differs from orthogonal matrix:
44.7176
tensorflow version:
0.034667
Time elapsed: 23365.9820557ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
0.057589
Time elapsed: 8540.5600071ms
(UPD 4: I had a small mistake in my example, but it didn't change timings at all, as ort_discrepancy() is a lightweight function):
Minimal example:
import tensorflow as tf
import numpy as np
import time
# found this code somewhere on stackoverflow
def np_gram_schmidt(vectors):
basis = []
for v in vectors:
w = v - np.sum( np.dot(v,b)*b for b in basis )
if (w > 1e-10).any():
basis.append(w/np.linalg.norm(w))
else:
basis.append(np.zeros(w.shape))
return np.array(basis)
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# add batch dimension for matmul
v = tf.expand_dims(v,0)
w = v - tf.matmul(tf.matmul(v, tf.transpose(basis)), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, w/tf.norm(w)],axis=0)
return basis
# how much matrix differs from orthogonal
# computes ||W*W^T - I||2
def ort_discrepancy(matrix):
wwt = tf.matmul(matrix, matrix, transpose_a=True)
rows = tf.shape(wwt)[0]
cols = tf.shape(wwt)[1]
return tf.norm((wwt - tf.eye(rows,cols)),ord='euclidean')
np.random.seed(0)
# white noise matrix
np_nearly_orthogonal = np.random.normal(size=(2000,2000))
# centered rows
np_nearly_orthogonal = np.array([row/np.linalg.norm(row) for row in np_nearly_orthogonal])
tf_nearly_orthogonal = tf.Variable(np_nearly_orthogonal,dtype=tf.float32)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print("how much source differs from orthogonal matrix:")
print(ort_discrepancy(tf_nearly_orthogonal).eval())
print("tensorflow version:")
start = time.time()
print(ort_discrepancy(tf_gram_schmidt(tf_nearly_orthogonal)).eval())
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
print("numpy version with tensorflow and variable re-assign to the result of numpy code:")
start = time.time()
tf_nearly_orthogonal = tf.Variable(np_gram_schmidt(tf_nearly_orthogonal.eval()),dtype=tf.float32)
sess.run(tf.variables_initializer([tf_nearly_orthogonal]))
# check that variable was updated
print(ort_discrepancy(tf_nearly_orthogonal).eval())
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
Is there a way to speed it up? I couldn't figure out how to do it for G-S which requires appending to the basis (so no tf.map_fn parallelization can help).
UPD: I have achieved difference in 2x by optimizing tf.matmul:
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# add batch dimension for matmul
v = tf.expand_dims(v,0)
w = v - tf.matmul(tf.matmul(v, basis, transpose_b=True), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, w/tf.norm(w)],axis=0)
return basis
how much source differs from orthogonal matrix:
44.7176
tensorflow version:
0.0335421
Time elapsed: 17004.458189ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
0.057589
Time elapsed: 8082.20791817ms
EDIT2:
Just for fun, tried to fully mimic numpy solution, and got extremely long working code:
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# like in numpy example
multiplied = tf.reduce_sum(tf.map_fn(lambda b: tf.scalar_mul(tf.tensordot(v,b,axes=[[0],[0]]),b), basis), axis=0)
w = v - multiplied
## add batch dimension for matmul
##v = tf.expand_dims(v,0)
##w = v - tf.matmul(tf.matmul(v, basis, transpose_b=True), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, tf.expand_dims(w/tf.norm(w),0)],axis=0)
return basis
(which seems to overfill GPU memory as well):
how much source differs from orthogonal matrix:
44.7176
tensorflow version:
2018-01-05 22:12:09.854505: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 14005 get requests, put_count=5105 evicted_count=1000 eviction_rate=0.195886 and unsatisfied allocation rate=0.714031
2018-01-05 22:12:09.854530: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
2018-01-05 22:12:13.090296: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 308520 get requests, put_count=314261 evicted_count=6000 eviction_rate=0.0190924 and unsatisfied allocation rate=0.00088487
2018-01-05 22:12:22.270822: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1485113 get requests, put_count=1500399 evicted_count=16000 eviction_rate=0.0106638 and unsatisfied allocation rate=0.000490198
2018-01-05 22:12:37.833056: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 3484575 get requests, put_count=3509407 evicted_count=26000 eviction_rate=0.00740866 and unsatisfied allocation rate=0.000339209
2018-01-05 22:12:59.995184: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 6315546 get requests, put_count=6349923 evicted_count=36000 eviction_rate=0.00566936 and unsatisfied allocation rate=0.000259202
0.0290728
Time elapsed: 136108.97398ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
0.057589
Time elapsed: 10618.8428402ms
UPD3: My GPU is GTX1050, it usually has speedup 5-7 times in comparison to my CPU. So the result is very strange for me.
UPD5: Ok, I found that GPU is almost not used for this code, while training neural network with manually written backpropagation which uses a lot of tf.matmul's and other matrix arithmetics fully exploits it. Why is it so?
UPD 6:
Following the given suggestion I have measured the time in a new way:
# Akshay's suggestion to measure performance correclty
orthogonalized = ort_discrepancy(tf_gram_schmidt(tf_nearly_orthogonal))
with tf.Session() as sess:
sess.run(init)
print("how much source differs from orthogonal matrix:")
print(ort_discrepancy(tf_nearly_orthogonal).eval())
print("tensorflow version:")
start = time.time()
tf_result = sess.run(orthogonalized)
end = time.time()
print(tf_result)
print("Time elapsed: %sms"%(1000*(end-start)))
print("numpy version with tensorflow and variable re-assign to the result of numpy code:")
start = time.time()
tf_nearly_orthogonal = tf.Variable(np_gram_schmidt(tf_nearly_orthogonal.eval()),dtype=tf.float32)
sess.run(tf.variables_initializer([tf_nearly_orthogonal]))
# check that variable was updated
print(ort_discrepancy(tf_nearly_orthogonal).eval())
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
Now I can see 4x speedup:
how much source differs from orthogonal matrix:
44.7176
tensorflow version:
0.018951
Time elapsed: 2594.85888481ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
0.057589
Time elapsed: 8851.86600685ms

TensorFlow appears slow because your benchmark is measuring both the time that it construct the graph and the time it takes to execute it; a fairer comparison between TensorFlow and NumPy would exclude graph construction from the benchmark. In particular, your benchmark should probably look something like this:
print("tensorflow version:")
# This line constructs the graph but does not execute it.
orthogonalized = ort_discrepancy(tf_gram_schmidt(tf_nearly_orthogonal))
start = time.time()
tf_result = sess.run(orthogonalized)
end = time.time()

Pearson correlation on big numpy matrices

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks

The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simulation of Markov chain slower than in Matlab - python

Related

Is there any simple method to parallel np.einsum?

Parallelize three nested loops

Numeric Integration Python versus Matlab

Gram-Schmidt orthogonalization in pure Tensorflow: performance for iterative solution is much slower than numpy

Pearson correlation on big numpy matrices

Categories

Resources