Numpy matrix row comparison - python

The question is more focused on performance of calculation.
I have 2 matrix with the same number of columns and different number of rows. One matrix is the 'pattern' whose rows have to be compared separately with the other matrix rows (all rows), then to be able to extract statistical values of mean equal to pattern, std,...
So, I have the following matrix and the computation is the following one:
numCols = 10
pattern = np.random.randint(0,2,size=(7,numCols))
matrix = np.random.randint(0,2,size=(5,numCols))
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
comp_mean[i] = np.mean(np.sum(pattern[i,:] == matrix, axis=1))
print comp_mean # Output example: [ 1.6 1. 1.6 2.2 2. 2. 1.6]
This is clear. The problem is that the number of matrix rows of both is much bigger (~1.000.000). So this code goes very slow. I tryed to implement numpy syntaxis as sometimes it surprises me improving the calculation time. So I did the following code (it could be strange, but it works!):
comp_mean = np.mean( np.sum( (pattern[np.repeat(np.arange(pattern.shape[0]), matrix.shape[0])].ravel() == np.tile(matrix.ravel(),pattern.shape[0])).reshape(pattern.shape[0],matrix.shape[0],matrix.shape[1]), axis=2 ),axis=1)
print comp_mean
However, this code is slower than the previous one where the 'for' bucle is used. So I would like to know if there is any possibility to speed up the calculation.
EDIT
I have checked the runtime of the different approaches for the real matrix and the result is the following:
Me - Approach 1: 18.04 seconds
Me - Approach 2: 303.10 seconds
Divakar - Approach 1: 18.79 seconds
Divakar - Approach 2: 65.11 seconds
Divakar - Approach 3.1: 137.78 seconds
Divakar - Approach 3.2: 59.59 seconds
Divakar - Approach 4: 6.06 seconds
EDIT(2)
Previous runs where performed in a laptop. I have run the code on a desktop. I have avoided the worst results, and the new runtimes are now different:
Me - Approach 1: 6.25 seconds
Divakar - Approach 1: 4.01 seconds
Divakar - Approach 2: 3.66 seconds
Divakar - Approach 4: 3.12 seconds

Few approaches with broadcasting could be suggested here.
Approach #1
out = np.mean(np.sum(pattern[:,None,:] == matrix[None,:,:],2),1)
Approach #2
mrows = matrix.shape[0]
prows = pattern.shape[0]
out = (pattern[:,None,:] == matrix[None,:,:]).reshape(prows,-1).sum(1)/mrows
Approach #3
mrows = matrix.shape[0]
prows = pattern.shape[0]
out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:]).astype(int))/mrows
# OR out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:])+0)/mrows
Approach #4
If the number of rows in matrix is a huge number, it could be better to stick to a for-loop to avoid the huge memory requirements for such a case, that might also lead to slow runtimes. Instead, we could do some optimizations within each loop iteration. Here's one such possible optimization shown -
mrows = matrix.shape[0]
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
comp_mean[i] = (pattern[i,:] == matrix).sum()
comp_mean = comp_mean/mrows

could you have a try at this:
import scipy.ndimage.measurements
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
m = scipy.ndimage.measurements.histogram(matrix,0,1,2,pattern[i],[0,1])
comp_mean[i] = m[0][0]+m[1][1]
comp_mean /= matrix.shape[0]
Regards.

Related

Python: how to speed up this function and make it more scalable?

I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.
Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.
You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.
As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)

Numpy: get array where index greater than value and condition is true

I have the following array:
a = np.array([6,5,4,3,4,5,6])
Now I want to get all elements which are greater than 4 but also have in index value greater than 2.
The way that I have found to do that was the following:
a[2:][a[2:]>4]
Is there a better or more readable way to accomplish this?
UPDATE:
This is a simplified version. In reality the indexing is done with arithmetic operation over several variables like this:
a[len(trainPredict)+(look_back*2)+1:][a[len(trainPredict)+(look_back*2)+1:]>4]
trainPredict ist a numpy array, look_back an integer.
I wanted to see if there is an established way or how others do that.
If you're worried about the complexity of the slice and/or the number of conditions, you can always separate them:
a = np.array([6,5,4,3,4,5,6])
a_slice = a[2:]
cond_1 = a_slice > 4
res = a_slice[cond_1]
Is your example very simplified? There might be better solutions for more complex manipulations.
#AlexanderCécile's answer is not only more legible than the one liner you posted, but is also removes the redundant computation of a temp array. Despite that, it does not appear to be any faster than your original approach.
The timings below are all run with a preliminary setup of
import numpy as np
np.random.seed(0xDEADBEEF)
a = np.random.randint(8, size=N)
N varies from 1e3 to 1e8 in factors of 10. I tried four variants of the code:
CodePope: result = a[2:][a[2:] > 4]
AlexanderCécile: s = a[2:]; result = s[s > 4]
MadPhysicist1: result = a[np.flatnonzero(a[2:]) + 2]
MadPhysicist2: result = a[(a > 4) & (np.arange(a.size) >= 2)]
In all cases, the timing was obtained on the command line by running
python -m timeit -s 'import numpy as np; np.random.seed(0xDEADBEEF); a = np.random.randint(8, size=N)' '<X>'
Here, N was a power of 10 between 3 and 8, and <X> one of the expressions above. Timings are as follows:
Methods #1 and #2 are virtually indistinguishable. What is surprising is that in the range between ~5e3 and ~1e6 elements, method #3 seems to be slightly, but noticeably faster. I would not normally expect that from fancy indexing. Method #4 is of course going to be the slowest.
Here is the data, for completeness:
CodePope AlexanderCécile MadPhysicist1 MadPhysicist2
1000 3.77e-06 3.69e-06 5.48e-06 6.52e-06
10000 4.6e-05 4.59e-05 3.97e-05 5.93e-05
100000 0.000484 0.000483 0.0004 0.000592
1000000 0.00513 0.00515 0.00503 0.00675
10000000 0.0529 0.0525 0.0617 0.102
100000000 0.657 0.658 0.782 1.09

Fast calculation of sum for function defined over range of integers - (0,2^52)

I was looking at the code for a particular cryptocurrency casino game (EthCrash - if you're interested). The game generates crash points using a function (I call this crash(x)) where x is an integer that is randomly drawn from the space of integers (0,2^52).
I'd like to calculate the expected value of the crash points. The code below should explain everything, but a clean picture of the function is here: https://i.imgur.com/8dPBALa.png, and what I'm trying to calculate is here: https://i.imgur.com/nllykDQ.png (apologies - can't paste pictures yet).
I wrote the following code:
import math
two52 = 2**52
def crash(x):
crash_point = math.floor((100*two52-x)/(two52-x))
return(crash_point/100)
crashes_sum = 0
for i in range(two52+1):
crashes_sum += crash(i)
expected_crash = crashes_sum/two52
Unfortunately, the loop is taking too long to run - any ideas for how I can do this faster?
ok, if you cannot do it straightforward, time to get smart, right?
So idea to get ranges where whole sum could be computed fast. I will put some pseudocode which not even compiles, could have bugs etc. Use it as illustration.
First, lets rewrite the term in the sum as
floor( 100 + 99*x/(252 - x) )
First idea - get ranges where floor is not changing due to the fact that term
n =< 99*x/(252 - x) < n+1. Obviously, for this whole range we could add to sum range_length*(100 + n), no need to do it term by term
sum = 0
r_lo = 0
for k in range(0, 2*52): # LOOP OVER RANGES
r_hi = floor(2**52/(1 + 99/n))
sum += (100 + n -1)*(r_hi - r_lo)
if r_hi-r_lo == 1:
break
r_lo = r_hi + 1
Obviously, range size will shrink till it is equal to 1, and then this method will be useless, we break out. Obviously, by that time each term would be different from previous one by 1 or more.
Ok, second idea - again ranges, where sum is arithmetic series. First we have to find range where increment is equal to 1. Then range where increment is equal to 2, etc. Looks like you have to find roots of quadratic equation for this, but code would be about the same
r_lo = pos_for_increment(1)
t_lo = ... # term at r_lo
for n in range(2, 2*52): # LOOP OVER RANGES
r_hi = pos_for_increment(n) - 1
t_hi = ... # term at r_lo
sum += (t_lo + t_hi)*(r_hi - r_lo) / 2 # arith.series sum
if r_hi > 2**52:
break
r_lo = r_hi + 1
t_lo = t_hi + n
might think about something else, but those tricks are worth trying
Using the map function might help increase the speed since it makes the computation in parallel
import math
two52 = 2**52
def crash(x):
crash_point = math.floor((100*two52-x)/(two52-x))
return(crash_point/100)
crashes_sum = sum(map(crash,range(two52)))
expected_crash = crashes_sum/two52
I have been able to speed up your code by taking advantage of numpy vectorization:
import numpy as np
import time
two52 = 2**52
crash = lambda x: np.floor( ( 100 * two52 - x ) / ( two52 - x ) ) / 100
starttime = time.time()
icur = 0
ispan = 100000
crashes_sum = 0
while icur < two52-ispan:
i = np.arange(icur, icur+ispan, 1)
crashes_sum += np.sum(crash(i))
icur += ispan
crashes_sum += np.sum(crash(np.arange(icur, two52, 1)))
expected_crash = crashes_sum / two52
print(time.time() - starttime)
The trick is to compute the sum on a moving windows to take advantage of numpy's vectorization (written in C). I tried up to 2**30 and it takes 9 seconds on my laptop (and too long for your code to be able to benchmark).
Python is probably not the most suitable language for what you want to do, you may want to try C or Fortran for that (and take advantage of threading).
You will have to use a powerful GPU if you wan't the result within some hours.
A possible CPU implementation
import numpy as np
import numba as nb
import time
two52 = 2**52
loop_to=2**30
#nb.njit(fastmath=True,parallel=True)
def sum_over_crash(two52,loop_to): #loop_to is only for testing performance
crashes_sum = nb.float64(0)
for i in nb.prange(loop_to):#nb.prange(two52+1):
crashes_sum += np.floor((100*two52-i)/(two52-i))/100
return crashes_sum/two52
sum_over_crash(two52,2)#don't measure static compilation overhead
t1=time.time()
sum_over_crash(two52,2**30)
print(time.time()-t1)
This takes 0.57s for on my quadcore i7. eg. 28 days for the whole calculation.
As the calculation can not be minimized mathematically, the only option is to calculate it step by step.
This takes a long time (as stated in other answers). Your best bet on calculating it fast is to use a lower level language than python. Since python is an interpreted language, it is rather slow to calculate this kind of thing.
Additionally you can use multithreading (if availible in the chosen language) to make it even faster.
Cloud Computing is also an option that could be suitable for this, as you are only going to calculate the number once. Amazon and Google (and many more) provide this kind of service for a relatively small fee.
But before performing any of the calculations you need to adjust your formula, as with the way it stands right now, you're going to get a ZeroDivisionError at the very last iteration of your loop.

Speeding up Numpy Masking

I'm still an amature when it comes to thinking about how to optimize. I have this section of code that takes in a list of found peaks and finds where these peaks,+/- some value, are located in a multidimensional array. It then adds +1 to their indices of a zeros array. The code works well, but it takes a long time to execute. For instance it is taking close to 45min to run if ind has 270 values and refVals has a shape of (3050,3130,80). I understand that its a lot of data to churn through, but is there a more efficient way of going about this?
maskData = np.zeros_like(refVals).astype(np.int16)
for peak in ind:
tmpArr = np.ma.masked_outside(refVals,x[peak]-2,x[peak]+2).astype(np.int16)
maskData[tmpArr.mask == False ] += 1
tmpArr = None
maskData = np.sum(maskData,axis=2)
Approach #1 : Memory permitting, here's a vectorized approach using broadcasting -
# Craate +,-2 limits usind ind
r = x[ind[:,None]] + [-2,2]
# Use limits to get inside matches and sum over the iterative and last dim
mask = (refVals >= r[:,None,None,None,0]) & (refVals <= r[:,None,None,None,1])
out = mask.sum(axis=(0,3))
Approach #2 : If running out of memory with the previous one, we could use a loop and use NumPy boolean arrays and that could be more efficient than masked arrays. Also, we would perform one more level of sum-reduction, so that we would be dragging less data with us when moving across iterations. Thus, the alternative implementation would look something like this -
out = np.zeros(refVals.shape[:2]).astype(np.int16)
x_ind = x[ind]
for i in x_ind:
out += ((refVals >= i-2) & (refVals <= i+2)).sum(-1)
Approach #3 : Alternatively, we could replace that limit based comparison with np.isclose in approach #2. Thus, the only step inside the loop would become -
out += np.isclose(refVals,i,atol=2).sum(-1)

Finding the correlation matrix

I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:
for i in xrange(rows): # rows are the number of rows in the matrix.
for j in xrange(i, rows):
r = scipy.stats.pearsonr(data[i,:], data[j,:])
print r
Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?
Thanks!
New Solution
After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.
ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
temp = np.dot(datam[i:],datam[i].T)
rs = temp / (datass[i:]*datass[i])
Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.
Old Solution
First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.
Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.
Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.
You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.
Have you tried just using numpy.corrcoef? Seeing as how you're not using the p-values, it should do exactly what you want, with as little fuss as possible. (Unless I'm mis-remembering exactly what pearson's R is, which is quite possible.)
Just quickly checking the results on random data, it returns exactly the same thing as #Justin Peel's code above and runs ~100x faster.
For example, testing things with 1000 rows and 10 columns of random data...:
import numpy as np
import scipy as sp
import scipy.stats
def main():
data = np.random.random((1000, 10))
x = corrcoef_test(data)
y = justin_peel_test(data)
print 'Maximum difference between the two results:', np.abs((x-y)).max()
return data
def corrcoef_test(data):
"""Just using numpy's built-in function"""
return np.corrcoef(data)
def justin_peel_test(data):
"""Justin Peel's suggestion above"""
rows = data.shape[0]
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = sp.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
r[j,i] = r[i,j]
return r
data = main()
Yields a maximum absolute difference of ~3.3e-16 between the two results
And timings:
In [44]: %timeit corrcoef_test(data)
10 loops, best of 3: 71.7 ms per loop
In [45]: %timeit justin_peel_test(data)
1 loops, best of 3: 6.5 s per loop
numpy.corrcoef should do just what you want, and it's a lot faster.
you can use the python multiprocess module, chunk up your rows into 10 sets, buffer your results and then print the stuff out (this would only speed it up on a multicore machine though)
http://docs.python.org/library/multiprocessing.html
btw: you'd also have to turn your snippet into a function and also consider how to do the data reassembly. having each subprocess have a list like this ...[startcord,stopcord,buff] .. might work nicely
def myfunc(thelist):
for i in xrange(thelist[0]:thelist[1]):
....
thelist[2] = result

Categories