Python: Sliding windowed mean, ignoring missing data - python

I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time, while handling nan values. The correct way for me to do it is to compute inside each window the sum of the finite elements and divide it with their number. This nonlinearity forces me to use non convolutional methods to face this problem, thus I have a severe time bottleneck in this part of the process. As a code example of what I am trying to accomplish I present the following:
import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
print 'Input:\t',data
print 'Output:\t',result
with output:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788
0.87829696 nan 0.71589171 nan 0.70359557 0.76113969
0.13694387 0.32126573 0.22730891 nan 0.35057169 nan
0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334
nan nan nan nan 0.63227417 0.92781944
0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597
nan 0.01410721 nan nan 0.6976317 nan
0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961
nan 0.2021964 ]
Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849
0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305
0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451
0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567
0.77625334 nan nan 0.63227417 0.7800468 0.85141944
0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619
0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282
0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845
0.29155301 0.2021964 ]
Can this result be produced by numpy operations, without using a for loop?

You can do that using the rolling function of Pandas:
import numpy as np
import pandas as pd
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data = np.random.random(n)
data[np.random.randint(0, n-1, n_miss)] = None
windowed_mean = pd.Series(data).rolling(window=win_size, min_periods=1).mean()
print(pd.DataFrame({'Data': data, 'Windowed mean': windowed_mean}) )
Output:
Data Windowed mean
0 0.589376 0.589376
1 0.639173 0.614274
2 0.343534 0.524027
3 0.250329 0.411012
4 0.911952 0.501938
5 NaN 0.581141
6 0.224964 0.568458
7 NaN 0.224964
8 0.508419 0.366692
9 0.215418 0.361918
10 NaN 0.361918
11 0.638118 0.426768
12 0.587478 0.612798
13 0.097037 0.440878
14 0.688689 0.457735
15 0.858593 0.548107
16 0.408903 0.652062
17 0.448993 0.572163
18 NaN 0.428948
19 0.877453 0.663223
20 NaN 0.877453
21 NaN 0.877453
22 0.021798 0.021798
23 0.482054 0.251926
24 0.092387 0.198746
25 0.251766 0.275402
26 0.093854 0.146002
27 NaN 0.172810
28 NaN 0.093854
29 NaN NaN
30 0.965669 0.965669
31 0.695999 0.830834
32 NaN 0.830834
33 NaN 0.695999
34 NaN NaN
35 0.613727 0.613727
36 0.837533 0.725630
37 NaN 0.725630
38 0.782295 0.809914
39 NaN 0.782295
40 0.777429 0.779862
41 0.401355 0.589392
42 0.491709 0.556831
43 0.127813 0.340292
44 0.781625 0.467049
45 0.960466 0.623301
46 0.637618 0.793236
47 0.651264 0.749782
48 0.154911 0.481264
49 0.159145 0.321773

Here's a convolution based approach using np.convolve -
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
Please note that this would have one extra element on either sides.
If you are working with 2D data, we can use Scipy's 2D convolution.
Approaches -
def original_app(data, win_size):
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): \
min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
return result
def numpy_app(data, win_size):
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
return out[1:-1] # Slice out the one-extra elems on sides
Sample run -
In [118]: #Construct sample data
...: n = 50
...: n_miss = 20
...: win_size = 3
...: data= np.random.random(50)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [119]: original_app(data, win_size = 3)
Out[119]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
Runtime test -
In [122]: #Construct sample data
...: n = 50000
...: n_miss = 20000
...: win_size = 3
...: data= np.random.random(n)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop
In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop
In [125]: import pandas as pd
# #jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop

Related

How to find customized average which is based on weightage including handling of nan value in pandas?

I have a data frame df_ss_g as
ent_id,WA,WB,WC,WD
123,0.045251836,0.614582906,0.225930615,0.559766482
124,0.722324239,0.057781167,,0.123603561
125,,0.361074325,0.768542766,0.080434134
126,0.085781742,0.698045853,0.763116684,0.029084545
127,0.909758657,,0.760993759,0.998406211
128,,0.32961283,,0.90038336
129,0.714585519,,0.671905291,
130,0.151888772,0.279261613,0.641133263,0.188231227
now I have to compute the average(AVG_WEIGHTAGE) which is based on a weightage i.e. =(WA*0.5+WB*1+WC*0.5+WD*1)/(0.5+1+0.5+1)
but while I am computing it using below method i.e.
df_ss_g['AVG_WEIGHTAGE']= df_ss_g.apply(lambda x:((x['WA']*0.5)+(x['WB']*1)+(x['WC']*0.5)+(x['WD']*1))/(0.5+1+0.5+1) , axis=1)
IT output as i.e. for NaN value it is giving NaN as AVG_WEIGHTAGE as null which is wrong.
all I wanted is that null should not be considered in denominator and numerator
e.g.
ent_id,WA,WB,WC,WD,AVG_WEIGHTAGE
128,,0.32961283,,0.90038336,0.614998095 i.e. (WB*1+WD*1)/1+1
129,0.714585519,,0.671905291,,0.693245405 i.e. (WA*0.5+WC*0.5)/0.5+0.5
IIUC:
import numpy as np
weights = np.array([0.5, 1, 0.5, 1]))
values = df.drop('ent_id', axis=1)
df['AVG_WEIGHTAGE'] = np.dot(values.fillna(0).to_numpy(), weights)/np.dot(values.notna().to_numpy(), weights)
df['AVG_WEIGHTAGE']
0 0.436647
1 0.217019
2 0.330312
3 0.383860
4 0.916891
5 0.614998
6 0.693245
7 0.288001
Try this method using dot products -
def av(t):
#Define weights
wt = [0.5, 1, 0.5, 1]
#Create a vector with 0 for null and 1 for non null
nulls = [int(i) for i in ~t.isna()]
#Take elementwise products of the nulls vector with both weights and t.fillna(0)
wt_new = np.dot(nulls, wt)
t_new = np.dot(nulls, t.fillna(0))
#return division
return np.divide(t_new,wt_new)
df['WEIGHTED AVG'] = df.apply(av, axis=1)
df = df.reset_index()
print(df)
ent_id WA WB WC WD WEIGHTED AVG
0 123 0.045252 0.614583 0.225931 0.559766 0.481844
1 124 0.722324 0.057781 NaN 0.123604 0.361484
2 125 NaN 0.361074 0.768543 0.080434 0.484020
3 126 0.085782 0.698046 0.763117 0.029085 0.525343
4 127 0.909759 NaN 0.760994 0.998406 1.334579
5 128 NaN 0.329613 NaN 0.900383 0.614998
6 129 0.714586 NaN 0.671905 NaN 1.386491
7 130 0.151889 0.279262 0.641133 0.188231 0.420172
It boils down to masking the nan values with 0 so they don't contribute to either weights or sum:
# this is the weights
weights = np.array([0.5,1,0.5,1])
# the columns of interest
s = df.iloc[:,1:]
# where the valid values are
mask = s.notnull()
# use `fillna` and then `#` for matrix multiplication
df['AVG_WEIGHTAGE'] = (s.fillna(0) # weights) / (mask#weights)

How to calculate p-values for pairwise correlation of columns in Pandas?

Pandas has the very handy function to do pairwise correlation of columns using pd.corr().
That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
Why not using the "method" argument of pandas.DataFrame.corr():
pearson : standard correlation coefficient.
kendall : Kendall Tau correlation coefficient.
spearman : Spearman rank correlation.
callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr
def kendall_pval(x,y):
return kendalltau(x,y)[1]
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
def spearmanr_pval(x,y):
return spearmanr(x,y)[1]
and then
corr = df.corr(method=pearsonr_pval)
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
This will work:
from scipy.stats import pearsonr
column_values = [column for column in df.columns.tolist() ]
df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
Does this work for you?
#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p
#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()
#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)

pandas: how to get the percentage for each row

When I use pandas value_count method, I get the data below:
new_df['mark'].value_counts()
1 1349110
2 1606640
3 175629
4 790062
5 330978
How can I get the percentage for each row like this?
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
I need to divide each row by the sum of these data.
np.random.seed([3,1415])
s = pd.Series(np.random.choice(list('ABCDEFGHIJ'), 1000, p=np.arange(1, 11) / 55.))
s.value_counts()
I 176
J 167
H 136
F 128
G 111
E 85
D 83
C 52
B 38
A 24
dtype: int64
As percent
s.value_counts(normalize=True)
I 0.176
J 0.167
H 0.136
F 0.128
G 0.111
E 0.085
D 0.083
C 0.052
B 0.038
A 0.024
dtype: float64
counts = s.value_counts()
percent = counts / counts.sum()
fmt = '{:.1%}'.format
pd.DataFrame({'counts': counts, 'per': percent.map(fmt)})
counts per
I 176 17.6%
J 167 16.7%
H 136 13.6%
F 128 12.8%
G 111 11.1%
E 85 8.5%
D 83 8.3%
C 52 5.2%
B 38 3.8%
A 24 2.4%
I think you need:
#if output is Series, convert it to DataFrame
df = df.rename('a').to_frame()
df['per'] = (df.a * 100 / df.a.sum()).round(1).astype(str) + '%'
print (df)
a per
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
Timings:
It seems faster is use sum as twice value_counts:
In [184]: %timeit (jez(s))
10 loops, best of 3: 38.9 ms per loop
In [185]: %timeit (pir(s))
10 loops, best of 3: 76 ms per loop
Code for timings:
np.random.seed([3,1415])
s = pd.Series(np.random.choice(list('ABCDEFGHIJ'), 1000, p=np.arange(1, 11) / 55.))
s = pd.concat([s]*1000)#.reset_index(drop=True)
def jez(s):
df = s.value_counts()
df = df.rename('a').to_frame()
df['per'] = (df.a * 100 / df.a.sum()).round(1).astype(str) + '%'
return df
def pir(s):
return pd.DataFrame({'a':s.value_counts(),
'per':s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})
print (jez(s))
print (pir(s))
Here's a more pythonic snippet than what is proposed above I think
def aspercent(column,decimals=2):
assert decimals >= 0
return (round(column*100,decimals).astype(str) + "%")
aspercent(df['mark'].value_counts(normalize=True),decimals=1)
This will output:
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
This also allows to adjust the number of decimals
Create two series, first one with absolute values and a second one with percentages, and concatenate them:
import pandas
d = {'mark': ['pos', 'pos', 'pos', 'pos', 'pos',
'neg', 'neg', 'neg', 'neg',
'neutral', 'neutral' ]}
df = pd.DataFrame(d)
absolute = df['mark'].value_counts(normalize=False)
absolute.name = 'value'
percentage = df['mark'].value_counts(normalize=True)
percentage.name = '%'
percentage = (percentage*100).round(2)
pd.concat([absolute, percentage], axis=1)
Output:
value %
pos 5 45.45
neg 4 36.36
neutral 2 18.18

How can i shorten this python code?

I am making lists by some conditions.
this is what it looks like.
def time_price_pair(a, b):
if 32400<=a and a<32940:
a_list=[]
a_list.append(b)
elif 32940<=a and a<33480:
b_list=[]
b_list.append(b)
elif 33480<=a and a <34020:
c_list=[]
c_list.append(b)
......
......
......
elif 52920 <=a and a <53460:
some_list=[]
some_list.append(b)
each condition will be added by 540. like [32400, 32940, 33480, 34020, 34560, 35100, 35640, 36180, 36720, 37260, 37800, 38340,38880, 39420....53460]
and list name doesn't matter.
I would use a dict to store these lists of values, and use some math to know where to put these numbers
from collections import defaultdict
lists = defaultdict(list)
def time_price_pair(a, b):
if 32400 <= a < 53460:
i = (a-32400)/540
lists[i].append(b)
You can just use a for loop with some incrementing variable i and keep updating the requirements. Something like this:
def time_price_pair(a, b):
min = 32400
max = 32940
inc = 540
for i in range(some value):
if min + inc*i <= a < max + inc*i:
b = min + inc*i
a_list = [b]
It looks like a simple high-level pandas function pd.cut would suit your purpose very well.
import pandas as np
import numpy as np
# simulate your data
# ==================================
np.random.seed(0)
a = np.random.randint(32400, 53439, size=1000000)
b = np.random.randn(1000000)
# put them in dataframe
df = pd.DataFrame(dict(a=a, b=b))
print(df)
a b
0 35132 -0.4605
1 43199 -0.9469
2 42245 0.2580
3 52048 -0.7309
4 45523 -0.4334
5 41625 2.0155
6 53157 -1.4712
7 46516 -0.1715
8 47335 -0.6594
9 47830 -1.0391
... ... ...
999990 39754 0.8771
999991 34779 0.7030
999992 37836 0.5409
999993 44330 -0.6747
999994 41078 -1.1368
999995 38752 1.6121
999996 42155 -0.1139
999997 49018 -0.1737
999998 45848 -1.2640
999999 50669 -0.4367
# processing
# ===================================
rng = np.arange(32400, 53461, 540)
# your custom labels
labels = np.arange(1, len(rng))
# use pd.cut()
%time df['cat'] = pd.cut(df.a, bins=rng, right=False, labels=labels)
CPU times: user 52.5 ms, sys: 16 µs, total: 52.5 ms
Wall time: 51.6 ms
print(df)
a b cat
0 35132 -0.4605 6
1 43199 -0.9469 20
2 42245 0.2580 19
3 52048 -0.7309 37
4 45523 -0.4334 25
5 41625 2.0155 18
6 53157 -1.4712 39
7 46516 -0.1715 27
8 47335 -0.6594 28
9 47830 -1.0391 29
... ... ... ..
999990 39754 0.8771 14
999991 34779 0.7030 5
999992 37836 0.5409 11
999993 44330 -0.6747 23
999994 41078 -1.1368 17
999995 38752 1.6121 12
999996 42155 -0.1139 19
999997 49018 -0.1737 31
999998 45848 -1.2640 25
999999 50669 -0.4367 34
[1000000 rows x 3 columns]
# groupby
grouped = df.groupby('cat')['b']
# access to a particular group using your user_defined key
grouped.get_group(1).values
array([ 0.4525, -0.7226, -0.981 , ..., 0.0985, -1.4286, -0.2257])
A dictionary could be used to hold all of the used time range bins as follows:
import collections
time_prices = [(32401, 20), (32402,30), (32939, 42), (32940, 10), (32941, 15), (40000, 123), (40100, 234)]
dPrices = collections.OrderedDict()
for atime, aprice in time_prices:
abin = 32400 + ((atime - 32400) // 540) * 540 # For bins as times
#abin = (atime - 32400) // 540 + 1 # For bins starting from 1
dPrices.setdefault(abin, []).append(aprice)
# Display results
for atime, prices in dPrices.items():
print atime, prices
This would give you the following output:
32400 [20, 30, 42]
32940 [10, 15]
39960 [123, 234]
Or individually as:
print dPrices[32400]
[20, 30, 42]
Tested using Python 2.7

Speed up numpy.where for extracting integer segments?

I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.

Categories