I have a Pandas dataframe with 150 million rows. Within that there are about 1 million groups I'd like to do some very simple calculations on. For example, I'd like to take some existing column 'A' and make a new column, 'A_Percentile' that expresses the values of 'A' as percentile ranks, within the group. Here's a little function that does it:
from scipy.stats import percentileofscore
def rankify(column_name,data=my_data_frame):
f = lambda x: [percentileofscore(x, y) for y in x]
data[column_name+'_Percentile'] = data.groupby(['Group_variable_1',
'Group_variable_2'])[column_name].transform(f)
return
Then you can call it like so:
rankify('Column_to_Rank', my_data_frame)
And wait for...quite a long time.
There are some obvious things I could do to speed this up (for instance, I'm sure there's a way to vectorize [percentileofscore(x, y) for y in x]). However, I have the feeling that there are some Pandas tricks I could be doing to speed this up immensely. Is there something I could be doing with the groupby logic? I thought about breaking it apart and parallelizing it, but 1. I'm not sure of a good way to do it and 2. the communication time to write out the data and read in the results seems like it would take nearly as long (perhaps I think that because of point #1).
As you are probably aware, the speed of groupby operations can vary tremendously -- especially as the number of groups gets high. Here's a really simple alternate approach that is quite a bit faster on some test datasets I tried (anywhere from 2x to 40x faster). Usually it is faster if you can avoid user-written functions (in combination with groupby) and stick to built-in functions (which are usually cythonized):
In [163]: %timeit rankify('x',df)
1 loops, best of 3: 7.38 s per loop
In [164]: def rankify2(column_name,data):
...: r1 = data.groupby('grp')[column_name].rank()
...: r2 = data.groupby('grp')[column_name].transform('count')
...: data[column_name+'_Percentile2'] = 100. * r1 / r2
In [165]: %timeit rankify2('x',df)
10 loops, best of 3: 178 ms per loop
Note that my method gives ever so slightly different results (like a difference of 10e-15) compared to percentileofscore(). So if you test the results with x == y most will be True but some will be False, but x.round() == y.round() will pass.
For results above, this was my test dataset (for other cases I tried, the difference was smaller but always 2x or better speedup):
df = pd.DataFrame( { "grp" : np.repeat( np.arange(1000), 100 ),
"x" : np.random.randn(100000) } )
I'm sure you could do better than that if you want. Really all you need to do here is sort and rank. I suspect the basic approach I took will be a good way to do it but if you did some or all of it in numpy or numba you might be able to speed it up. Also, you could might be able to use some of pandas indexing tricks to speed things up.
Related
I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.
Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.
You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.
As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)
numpy/pandas are known famous for their underlying acceleration, i.e. vectorization.
condition evaluation are common expressions that occurs in codes everywhere.
However, when using pandas dataframe apply function intuitively, the condition evaluation seems very slow.
An example of my apply code looks like:
def condition_eval(df):
x=df['x']
a=df['a']
b=df['b']
if x <= a:
d = round((x-a)/0.01)-1
if d <- 10:
d = -10
elif x >= b:
d = round((x-b)/0.01)+1
if d > 10:
d = 10
else:
d = 0
return d
df['eval_result'] = df.apply(condition_eval, axis=1)
The properties of such kind of problems could be:
the result can be computed with only using its own row data, and always using multiple columns.
each row has the same computation algorithm.
the algorithm may contain complex conditional branches.
What's the best practice in numpy/pandas to solve such kind of problems?
Some more thinkings.
In my opinion, one of the reason why vectorization acceleration can be effective is because the underlying cpu has some kind of vector instructions(e.g. SIMD, intel avx), which rely on a truth that the computational instructions have a deterministic behavior, i.e. no matter how the input data is, the result could be acquired after a fixed number of cpu cycles. Thus, parallelizing such kind of operations is easy.
However, branch execution in cpu is much more complicated. First of all, different branches of the same condition evaluation have different execution paths thus they may result in different cpu cycles. Modern cpus even leverage a lot of tricks like branch prediction which create more uncertainties.
So I wonder if and how pandas try to accelerate such kind of vector condition evaluation operations, and is their a better practice to work on such kind of computational workloads.
This should be equivalent:
import pandas as pd
import numpy as np
def get_eval_result(df):
conditions = (
df.x.le(df.a),
df.x.gt(df.b),
)
choices = (
np.where((d := df.x.sub(df.a).div(0.01).round().sub(1)).lt(-10), -10, d),
np.where((d := df.x.sub(df.b).div(0.01).round().add(1)).gt(10), 10, d),
)
return np.select(conditions, choices, 0)
df = df.assign(eval_result=get_eval_result)
My answer basically calculates the results of every branch, and then uses numpy syntax to specify which of those results should be used. This could be optimized slightly, but since it's using purely vectorized function, it should be far faster than using .apply.
np.select is best for this:
(df
.assign(column_to_alter=lambda x: np.select([cond1, cond2, cond3],
[option1, opt2, opt3],
default='somevalue'))
)
I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, e.g.:
size = 100000
data = np.empty(
shape=2 * size,
dtype=[('class', int),
('value', int),]
)
data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)
np.random.shuffle(data)
I need the result to be sorted with respect to value, and for same values class=0 should go first. Doing it like so (a):
idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]
seems to be an order of magnitude slower compared to sorting just data['value']. Is there a way to improve the speed, given that there are only two classes?
By experimenting randomly I noticed that an approach like this (b):
idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]
takes ~20% less time than (a). Changing field datatypes seem to also have some effect - floats instead of ints seem to be slightly faster.
The simplest way to do this is using the order parameter of sort
sort(data, order=['value', 'class'])
However, this takes 121 ms to run on my computer, while data['class'] and data['value'] take only 2.44 and 5.06 ms respectively. Interestingly, sort(data, order='class') takes 135 ms again, suggesting the problem is with sorting structured arrays.
So, the approach you've taken of sorting each field using argsort then indexing the final array seems to be on the right track. However, you need to sort each field individually,
idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]
This runs in 43.9 ms.
You can get a very slight speedup by removing one temporary array from indexing
idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]
Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.
This seems to be a known problem:
sorting numpy structured and record arrays is very slow
Edit
The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src .
The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.
In addition to the good (general-purpose) answer of #user2699, in your specific case, you can cheat because the two fields of the structured array is of the same integer type and values are relatively small (they fit in 32-bits). The cheat consists in the following steps:
subtract the minimum values of each fields to all items the field (to make them positive) using arr - np.min(arr)
transform each field to a np.uint64 with np.astype
pack bits the two fields in one binary array using: (class_arr << 32) | value_arr
sort the resulting array using np.sort
unpack the array using: class_arr = sorted_arr >> 32 and value_arr = sorted_arr & ((1<<32)-1)
This strategy is significantly faster than using two np.argsort that are pretty expensive. This is especially true for bigger array since sorting big array is even more expensive and np.sort is cheaper than np.argsort. Not to mention indirect indexing is relatively slow on big array because of the unpredictable pseudo-random memory access pattern and the high latency of the RAM. The downside of this approach is that it is a bit more tricky to implement and it does not apply in all cases.
I am trying to write a piece of nested loops in my algorithm, and meet some problems that the whole algorithm takes too long time due to these nested loops. I am quite new to Python (as you may find from my below unprofessional code :( ) and hopefully someone can guide me a way to speed up my code!
The whole algorithm is for fire detection in multi 1500*6400 arrays. A small contextual analyse is applied when go through the whole array. The contextual analyse is performed in a dynamically assigned windows size way. The windows size can go from 11*11 to 31*31 until the validate values inside the sampling windows are enough for the next round calculation, for example like below:
def ContextualWindows (arrb4,arrb5,pfire):
####arrb4,arrb5,pfire are 31*31 sampling windows from large 1500*6400 numpy array
i=5
while i in range (5,16):
arrb4back=arrb4[15-i:16+i,15-i:16+i]
## only output the array data when it is 'large' enough
## to have enough good quality data to do calculation
if np.ma.count(arrb4back)>=min(10,0.25*i*i):
arrb5back=arrb5[15-i:16+i,15-i:16+i]
pfireback=pfire[15-i:16+i,15-i:16+i]
canfire=0
i=20
else:
i=i+1
###unknown pixel: background condition could not be characterized
if i!=20:
canfire=1
arrb5back=arrb5
pfireback=pfire
arrb4back=arrb4
return (arrb4back,arrb5back,pfireback,canfire)
Then this dynamic windows will be feed into next round test, for example:
b4backave=np.mean(arrb4Windows)
b4backdev=np.std(arrb4Windows)
if b4>b4backave+3.5*b4backdev:
firetest=True
To run the whole code to my multi 1500*6400 numpy arrays, it took over half an hour, or even longer. Just wondering if anyone got an idea how to deal with it? A general idea which part I should put my effort to would be greatly helpful!
Many thanks!
Avoid while loops if speed is a concern. The loop lends itself to a for loop as start and end are fixed. Additionally, your code does a lot of copying which isn't really necessary. The rewritten function:
def ContextualWindows (arrb4,arrb5,pfire):
''' arrb4,arrb5,pfire are 31*31 sampling windows from
large 1500*6400 numpy array '''
for i in range (5, 16):
lo = 15 - i # 10..0
hi = 16 + i # 21..31
# only output the array data when it is 'large' enough
# to have enough good quality data to do calculation
if np.ma.count(arrb4[lo:hi, lo:hi]) >= min(10, 0.25*i*i):
return (arrb4[lo:hi, lo:hi], arrb5[lo:hi, lo:hi], pfire[lo:hi, lo:hi], 0)
else: # unknown pixel: background condition could not be characterized
return (arrb4, arrb5, pfire, 1)
For clarity I've used style guidelines from PEP 8 (like extended comments, number of comment chars, spaces around operators etc.). Copying of a windowed arrb4 occurs twice here but only if the condition is fulfilled and this will happen only once per function call. The else clause will be executed only if the for-loop has run to it's end. We don't even need a break from the loop as we exit the function altogether.
Let us know if that speeds up the code a bit. I don't think it'll be much but then again there isn't much code anyway.
I've run some time tests with ContextualWindows and variants. One i step takes about 50us, all ten about 500.
This simple iteration takes about the same time:
[np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
The iteration mechanism, and the 'copying' arrays are minor parts of the time. Where possible numpy is making views, not copies.
I'd focus on either minimizing the number of these count steps, or speeding up the count.
Comparing times for various operations on these windows:
First time for 1 step:
In [167]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,6)]
10000 loops, best of 3: 43.9 us per loop
now for the 10 steps:
In [139]: timeit [arrb4[15-i:16+i,15-i:16+i].shape for i in range(5,16)]
10000 loops, best of 3: 33.7 us per loop
In [140]: timeit [np.sum(arrb4[15-i:16+i,15-i:16+i]>500) for i in range(5,16)]
1000 loops, best of 3: 390 us per loop
In [141]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
1000 loops, best of 3: 464 us per loop
Simply indexing does not take much time, but testing for conditions takes substantially more.
cumsum is sometimes used to speed up sums over sliding windows. Instead of taking sum (or mean) over each window, you calculate the cumsum and then use the differences between the front and end of window.
Trying something like that, but in 2d - cumsum in both dimensions, followed by differences between diagonally opposite corners:
In [164]: %%timeit
.....: cA4=np.cumsum(np.cumsum(arrb4,0),1)
.....: [cA4[15-i,15-i]-cA4[15+i,15+i] for i in range(5,16)]
.....:
10000 loops, best of 3: 43.1 us per loop
This is almost 10x faster than the (nearly) equivalent sum. Values don't quite match, but timing suggest that this may be worth refining.
I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:
for i in xrange(rows): # rows are the number of rows in the matrix.
for j in xrange(i, rows):
r = scipy.stats.pearsonr(data[i,:], data[j,:])
print r
Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?
Thanks!
New Solution
After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.
ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
temp = np.dot(datam[i:],datam[i].T)
rs = temp / (datass[i:]*datass[i])
Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.
Old Solution
First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.
Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.
Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.
You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.
Have you tried just using numpy.corrcoef? Seeing as how you're not using the p-values, it should do exactly what you want, with as little fuss as possible. (Unless I'm mis-remembering exactly what pearson's R is, which is quite possible.)
Just quickly checking the results on random data, it returns exactly the same thing as #Justin Peel's code above and runs ~100x faster.
For example, testing things with 1000 rows and 10 columns of random data...:
import numpy as np
import scipy as sp
import scipy.stats
def main():
data = np.random.random((1000, 10))
x = corrcoef_test(data)
y = justin_peel_test(data)
print 'Maximum difference between the two results:', np.abs((x-y)).max()
return data
def corrcoef_test(data):
"""Just using numpy's built-in function"""
return np.corrcoef(data)
def justin_peel_test(data):
"""Justin Peel's suggestion above"""
rows = data.shape[0]
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = sp.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
r[j,i] = r[i,j]
return r
data = main()
Yields a maximum absolute difference of ~3.3e-16 between the two results
And timings:
In [44]: %timeit corrcoef_test(data)
10 loops, best of 3: 71.7 ms per loop
In [45]: %timeit justin_peel_test(data)
1 loops, best of 3: 6.5 s per loop
numpy.corrcoef should do just what you want, and it's a lot faster.
you can use the python multiprocess module, chunk up your rows into 10 sets, buffer your results and then print the stuff out (this would only speed it up on a multicore machine though)
http://docs.python.org/library/multiprocessing.html
btw: you'd also have to turn your snippet into a function and also consider how to do the data reassembly. having each subprocess have a list like this ...[startcord,stopcord,buff] .. might work nicely
def myfunc(thelist):
for i in xrange(thelist[0]:thelist[1]):
....
thelist[2] = result