Pearson correlation on big numpy matrices - python

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks

The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.

Related

Parallelize three nested loops

Context:
I have 3 3D arrays ("precursor arrays") that I am upsampling with an Inverse Distance Weighting method. To do that, I calculate a 3D weights array that I use in a for loop on each point of my precursor arrays.
Each 2D slice of my weights array is used to calculate a partial array. Once I generate all 28 of them, they are summed to give one final host array.
I would like to parallelize this for loop in order to reduce my computing time. I tried doing it but I can not manage to update correctly my host arrays.
Question:
How could I parallelize my main function (last section of my code) ?
EDIT: Or is there a way I could "slice" my i for loop (for example one core running between i = 0 to 5, and one core running on i = 6 to 9) ?
Summary:
3 precursor arrays (temperatures, precipitations, snow): 10x4x7 (10 is a time dimension)
1 weight array (w): 28x1101x2101
28x3 partial arrays: 1101x2101
3 host arrays (temp, prec, Eprec): 1101x2101
Here is my code (runable as it is aside from the MAIN ALGORITHM PARALLEL section, please see the MAIN ALGORITHM NOT PARALLEL section at the end for the non-parallelized version of my code):
import numpy as np
import multiprocessing as mp
import time
#%% ------ Create data ------ ###
temperatures = np.random.rand(10,4,7)*100
precipitation = np.random.rand(10,4,7)
snow = np.random.rand(10,4,7)
# Array of altitudes to "adjust" the temperatures
alt = np.random.rand(4,7)*1000
#%% ------ Functions to run in parallel ------ ###
# This function upsamples the precursor arrays and creates the partial arrays
def interpolator(i, k, mx, my):
T = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000)) * w[k,:,:]
P = (precipitation[i,mx,my])*w[k,:,:]
S = (snow[i,mx,my])*w[k,:,:]
return(T, P, S)
# We add each partial array to each other to create the host array
def get_results(results):
global temp, prec, Eprec
temp += results[0]
prec += results[1]
Eprec += results[2]
#%% ------ IDW Interpolation ------ ###
# We create a weight matrix that we use to upsample our temperatures, precipitations and snow matrices
# This part is not that important, it works well as it is
MX,MY = np.shape(temperatures[0])
N = 300
T = np.zeros([N*MX+1, N*MY+1])
# create NxM inverse distance weight matrices based on Gaussian interpolation
x = np.arange(0,N*MX+1)
y = np.arange(0,N*MY+1)
X,Y = np.meshgrid(x,y)
k = 0
w = np.zeros([MX*MY,N*MX+1,N*MY+1])
for mx in range(MX):
for my in range(MY):
# Gaussian
add_point = np.exp(-((mx*N-X.T)**2+(my*N-Y.T)**2)/N**2)
w[k,:,:] += add_point
k += 1
sum_weights = np.sum(w, axis=0)
for k in range(MX*MY):
w[k,:,:] /= sum_weights
#%% ------ MAIN ALGORITHM PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
# Start a timer
ts = time.time()
# Iterate over the time dimension
for i in range(temperatures.shape[0]):
# Initialize the host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
# Create the pool based on my amount of cores
pool = mp.Pool(mp.cpu_count())
# Loop through every weight slice, for every cell of the temperatures, precipitations and snow arrays
for k in range(0,w.shape[0]):
for mx in range(MX):
for my in range(MY):
# Upsample the temperatures, precipitations and snow arrays by adding the contribution of each weight slice
pool.apply_async(interpolator, args = (i, k, mx, my), callback = get_results)
pool.close()
pool.join()
# Print the time spent on the loop
print("Time spent: ", time.time()-ts)
#%% ------ MAIN ALGORITHM NOT PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
ts = time.time()
for i in range(temperatures.shape[0]):
# Create empty host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
k = 0
for mx in range(MX):
for my in range(MY):
get_results(interpolator(i, k, mx, my))
k += 1
print("Time spent:", time.time()-ts)
The problem with multiprocessing is that it creates many new processes taht execute the code before the main (ie. before if __name__ == '__main__'). This causes a very slow initialization (since all process does it) and a huge amount of RAM being used for nothing. You certainly should move everything in the main or if possible in functions (which generally results in a faster execution and is a good software engineering practice anyway, especially for parallel codes). Even with this, there is another huge problem with multiprocessing: inter-process communication is slow. One solution is to use a multi-threaded approach made possible by using Numba or Cython (you can disable the GIL with them as opposed to basic CPython threads). In fact, they are often simpler to use than multiprocessing. However, you should be more careful though since parallel access are unprotected and data-races can appear in bogus parallel codes.
In your case, the computation is mostly memory-bound. This means multiprocessing is pretty useless. In fact, parallelism is barely useful here unless you are running this code on a computing server with a high-throughput. Indeed, the memory is a shared resource and using more computing core does not help much since 1 core can almost saturate the memory bandwidth on a regular PC (while few cores are needed on computing servers).
The key to speed up memory-bound codes is to avoid creating temporary arrays and use cache-friendly algorithms. In your case, T, P and S are filled just to be read later so to update the temp, prec and Eprec arrays. This temporary step is pretty expensive and necessary here (especially filling the arrays). Removing this will increase the arithmetic intensity resulting in a code that will certainly be faster in sequential and that can better scale on multiple cores. This is the case on my machine.
Here is an example of code using Numba so to parallelize the code:
import numba as nb
# get_results + interpolator
#nb.njit('void(float64[:,::1], float64[:,::1], float64[:,::1], float64[:,:,::1], int_, int_, int_, int_)', parallel=True)
def interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my):
factor1 = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000))
factor2 = precipitation[i,mx,my]
factor3 = snow[i,mx,my]
for i in nb.prange(w.shape[1]):
for j in range(w.shape[2]):
val = w[k, i, j]
temp[i, j] += factor1 * val
prec[i, j] += factor2 * val
Eprec[i, j] += factor3 * val
# Example of usage:
interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my)
Note the string in nb.njit is called a signature and specify the type to the JIT so it can compile it eagerly.
This code is 4.6 times faster on my 6-core machine (while it was barely faster without the merge of get_results and interpolator). In fact, it is 3.8 times faster in sequential so threads does not help much since the computation is still memory-bound. Indeed, the cost of the multiply-add is negligible compared to the memory reads/writes.

Scipy convolve2d with subsampling like Theano's conv2d?

I wish to perform 2D convolution on images of size 600 X 400 using a 10 X 10 filter. The filter is not separable. scipy.signal.convolve2d works well for me currently but, I am expecting a lot bigger images soon.
To counter that, I have two ideas
resizing images
subsampling (or striding)?
Focusing on the subsampling part, theano has a function which does convolution the same way as scipy convolve2d, see theano conv2d
It also has the subsampling option too. But, installing theano on windows has been painful to me. How do I get subsampling work with scipy.signal.convolve2d? Any other alternatives (which doesn't require me installing me some heavyweight library)?
You could implement subsampling by hand, I'll only sketch 1d for simplicity. Say you want to sample s = d * f on a regular subgrid with spacing k. Then your nth sample is s_nk = sum_i=0^10 f_i d_nk-i. The thing to observe here is that the indices of f and d always sum to a multiple of k. This suggests splitting it up into sub-sums s_nk = sum_j=0^k-1 sum_i=0^10/k f_j+ik d_-j+(n-i)k. So what you need to do is: subsample d and f at grids with spacing k at all offsets 0, ..., k-1. Convolve all pairs of subsampled d and f whose offsets sum to 0 or k and add the results.
Here's some code for 1d. It roughly implements the above, only the grids are placed slightly differently to make index management easier. The second function does it the stupid way, i.e. computes the full convolution and then decimates. It is for testing the first function against.
import numpy as np
from scipy import signal
def ss_conv(d1, d2, decimate):
n = (len(d1) + len(d2) - 1) // decimate
out = np.zeros((n,))
for i in range(decimate):
d1d = d1[i::decimate]
d2d = d2[decimate-i-1::decimate]
cv = signal.convolve(d1d, d2d, 'full')
out[:len(cv)] += cv
return out
def conv_ss(d1, d2, decimate):
return signal.convolve(d1, d2, 'full')[decimate-1::decimate]
Edit: 2d version:
import numpy as np
from scipy import signal
def ss_conv_2d(d1, d2, decy, decx):
ny = (d1.shape[0] + d2.shape[0] - 1) // decy
nx = (d1.shape[1] + d2.shape[1] - 1) // decx
out = np.zeros((ny, nx))
for i in range(decy):
for j in range(decx):
d1d = d1[i::decy, j::decx]
d2d = d2[decy-i-1::decy, decx-j-1::decx]
cv = signal.convolve2d(d1d, d2d, 'full')
out[:cv.shape[0], :cv.shape[1]] += cv
return out
def conv_ss_2d(d1, d2, decy, decx):
return signal.convolve2d(d1, d2, 'full')[decy-1::decy, decx-1::decx]

SVM with python and CPLEX, load the quadratic part of the objective function

''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!
Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.
Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).
EDIT: I just tried using scipy.spatial.distance.pdistbut it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.
features = np.array(dataframe)
distances = np.zeros((17000, 17000))
def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).
But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?
You would use it as:
from sklearn.metrics import pairwise
import numpy as np
a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)
The big thing with numpy is to avoid using loops and to let it do its magic with the vectorised operations, so there are a few basic improvements that will save you some computation time:
import numpy as np
import timeit
#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))
def sum_diff():
for i in range(n):
for j in range(n):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
for i in range(n):
for j in range(n):
diff = features[i] - features[j]
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Collapsing of the statements -> no improvement
def sum_diff_v1():
for i in range(n):
for j in range(n):
distances[i][j] = np.sum(np.square(features[i] - features[j]))
# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
for i in range(n):
distances[i] = np.sum(np.square(features[i] - features),axis=1)
# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
for i in range(n):
distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
distances[:] = distances + distances.T
print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))
Edit : For completeness I also timed Camilleri's solution that is much faster:
from sklearn.metrics import pairwise
def Camilleri_solution():
distances=pairwise.pairwise_distances(features)
Timing results (in seconds, function run 10 times with 1000*300 input):
original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531
So as you can see we can easily gain an order of magnitude by using the proper numpy syntax. Note that with only 1/20th of the data the function run in about one second so I would expect the whole thing to run in the tens of minutes as the scipt runs in N^2.

how to code vector to matrix hamming distance in Python?

I like to develop a query system that finds the most similar items to given one based on a binary signature extracted from the data. I probe for the most efficient way since I have runtime constraints. I tried to use scipy distance but it was too slow. Do you know any other useful library or trick to make it in a faster manner.
For being and example scenario,
I have a query vector with binary values with length 68, and I have a dataset with a matrix size 3000Kx68. I like to find the most similar item in this matrix to given query by using Hamming distance.
thanks for any comment
Nice problem, I liked the answers of Alex and Piotr. My first naive attempt resulted also in a solution time around 800ms (on my system). My second attempt, using numpy's (un)packbits, resulted in a 4x speed increase.
import numpy as np
LENGTH = 68
K = 1024
DATASIZE = 3000 * K
DATA = np.random.randint(0, 2, (DATASIZE, LENGTH)).astype(np.bool)
def RandomVect():
return np.random.randint(0, 2, (LENGTH)).astype(np.bool)
def HammingDist(vec1, vec2):
return np.sum(np.logical_xor(vec1, vec2))
def SmallestHamming(vec):
XorData = np.logical_xor(DATA, vec[np.newaxis, :])
Lengths = np.sum(XorData, axis=1)
return DATA[np.argmin(Lengths)] # returns first smallest
def main():
v1 = RandomVect()
v2 = SmallestHamming(v1)
print(HammingDist(v1, v2))
# oke, lets try make it faster... (using numpy.(un)packbits)
DATA2 = np.packbits(DATA, axis=1)
NBYTES = DATA2.shape[-1]
BYTE2ONES = np.zeros((256), dtype=np.uint8)
for i in range(0,256):
BYTE2ONES[i] = np.sum(np.unpackbits(np.uint8(i)))
def RandomVect2():
return np.packbits(RandomVect())
def HammingDist2(vec1, vec2):
v1 = np.unpackbits(vec1)
v2 = np.unpackbits(vec2)
return np.sum(np.logical_xor(v1, v2))
def SmallestHamming2(vec):
XorData = DATA2 ^ vec[np.newaxis, :]
Lengths = np.sum(BYTE2ONES[XorData], axis=1)
return DATA2[np.argmin(Lengths)] # returns first smallest
def main2():
v1 = RandomVect2()
v2 = SmallestHamming2(v1)
print(HammingDist2(v1, v2))
Use cdist from SciPy:
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'hamming')
Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean
Reference: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
I would be surprised if there was a significantly faster way than this: Put your data into a pandas DataFrame (M), each vector by columns, and your target vector into a pandas Series (x),
import numpy as np
import pandas as pd
rows = 68
columns=3000
M = pd.DataFrame(np.random.rand(rows,columns)>0.5)
x = pd.Series(np.random.rand(rows)>0.5)
then do the following
%timeit M.apply(lambda y: x==y).astype(int).sum().idxmax()
1 loop, best of 3: 746 ms per loop
Edit: Actually, I am surprised this is a much faster way
%timeit M.eq(x, axis=0).astype(int).sum().idxmax()
100 loops, best of 3: 2.68 ms per loop

Categories