Minimize two variables with scipy optimize - python

I want to fit two learning rates (alpha), one for the first half of the data and one for the second half of the data. I was able to do this for just one learning but am running into errors when attempting to fit two.
These functions work when minimizing one variable:
optimize.fminbound(sse_f,0,1)
minimize_scalar(sse_f, bounds=(0,1), method='bounded')
I'm not sure if I can use fminbound/minimize_scalar or if I should use minimize() for two learning rates (alphas).
My function looks something like this, I've removed a few lines for simplicity but basically, I want to minimize SSE for the first half of the data and the second half separately.
def sse_f(a_1,a_2):
data = []
for _, row in temp.iterrows():
alpha = a_1 #for first 120 rows use one alpha
if row['trial_nr'] == 120: #for rows after 120 use second alpha
alpha = a_2
#calculate diff variables removed for simplicity
data.append([phase,pi,abs_PE])
col = ['phase','abs_PE','congruency']
df_ = pd.DataFrame(data, columns= col)
df_a = df[(df['phase']=='a')]
df_b = df[(df['phase']=='b')]
x = np.array(df_a[['congruency','abs_PE']]) #run linear regression for first half
y = df_a['RT']
sse_a = lin_reg(x,y)
x_b = np.array(df_b[['congruency','abs_PE']]) #run linear regression for second half
y_b = df_b['RT']
sse_b = lin_reg(x_b,y_b) #calculate SSE
return sse_a,sse_b #return SSE for first half and second half of data
The output of this function would be a tuple e.g:
sse_f(.43,.43)
(54487466.6875101, 17251575.11206138)
If i use the minimize() I get this error:
minimize(sse_f, x0=(0,0), bounds=[(0,0),(1,1)])
TypeError: sse_f() missing 1 required positional argument: 'a_2'
And if I use minimize_scalar() I get this error:
ValueError: Optimisation bounds must be scalars or array scalars.
Any pointers regarding how to fit two alphas or how why I get these errors would be greatly appreciated!

I just changed your function definition so that the list of tuples is given to alpha1 and alpha2 individualy. I wanted to test but i don't know what the variable temp is. But I think that the list is now put into the variables like follows [minimum(alpha1 , alpha2),maximum((alpha1 , alpha2)].
def sse_f(a):
a_1=a[0]
a_2=a[1]
data = []
for _, row in temp.iterrows():
alpha = a_1 #for first 120 rows use one alpha
if row['trial_nr'] == 120: #for rows after 120 use second alpha
alpha = a_2
#calculate diff variables removed for simplicity
data.append([phase,pi,abs_PE])
col = ['phase','abs_PE','congruency']
df_ = pd.DataFrame(data, columns= col)
df_a = df[(df['phase']=='a')]
df_b = df[(df['phase']=='b')]
x = np.array(df_a[['congruency','abs_PE']]) #run linear regression for first half
y = df_a['RT']
sse_a = lin_reg(x,y)
x_b = np.array(df_b[['congruency','abs_PE']]) #run linear regression for second half
y_b = df_b['RT']
sse_b = lin_reg(x_b,y_b) #calculate SSE
return sse_a,sse_b #return SSE for first half and second half of data
Let me know if it helped.

Related

Filtering the two frequencies with highest amplitudes of a signal in the frequency domain

I have tried to filter the two frequencies which have the highest amplitudes. I am wondering if the result is correct, because the filtered signal seems less smooth than the original?
Is it correct that the output of the FFT-function contains the fundamental frequency A0/ C0, and is it correct to include it in the search of the highest amplitude (it is indeed the highest!) ?
My code (based on my professors and collegues code, and I did not understand every detail so far):
# signal
data = np.loadtxt("profil.txt")
t = data[:,0]
x = data[:,1]
x = x-np.mean(x) # Reduce signal to mean
n = len(t)
max_ind = int(n/2-1)
dt = (t[n-1]-t[0])/(n-1)
T = n*dt
df = 1./T
# Fast-Fourier-Transformation
c = 2.*np.absolute(fft(x))/n #get the power sprectrum c from the array of complex numbers
c[0] = c[0]/2. #correction for c0 (fundamental frequency)
f = np.fft.fftfreq(n, d=dt)
a = fft(x).real
b = fft(x).imag
n_fft = len(a)
# filter
p = np.ones(len(c))
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-1)]] = 0 #setting the positions of p to 0 with
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-2)]] = 0 #the indices from the argsort function
print(c[0:int(len(c)/2-1)].argsort()[int(n_fft/2-2)]) #over the first half of the c array,
ab_filter_2 = fft(x) #because the second half contains the
ab_filter_2.real = a*p #negative frequencies.
ab_filter_2.imag = b*p
x_filter2 = ifft(ab_filter_2)*2
I do not quite get the whole deal about FFT returning negative and positive frequencies. I know they are just mirrored, but then why can I not search over the whole array? And the iFFT function works with an array of just the positive frequencies?
The resulting plot: (blue original, red is filtered):
This part is very wasteful:
a = fft(x).real
b = fft(x).imag
You’re computing the FFT twice for no good reason. You compute it a 3rd time later, and you already computed it once before. You should compute it only once, not 4 times. The FFT is the most expensive part of your code.
Then:
ab_filter_2 = fft(x)
ab_filter_2.real = a*p
ab_filter_2.imag = b*p
x_filter2 = ifft(ab_filter_2)*2
Replace all of that with:
out = ifft(fft(x) * p)
Here you do the same thing twice:
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-1)]] = 0
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-2)]] = 0
But you set only the left half of the filter. It is important to make a symmetric filter. There are two locations where abs(f) has the same value (up to rounding errors!), there is a positive and a negative frequency that go together. Those two locations should have the same filter value (actually complex conjugate, but you have a real-valued filter so the difference doesn’t matter in this case).
I’m unsure what that indexing does anyway. I would split out the statement into shorter parts on separate lines for readability.
I would do it this way:
import numpy as np
x = ...
x -= np.mean(x)
fft_x = np.fft.fft(x)
c = np.abs(fft_x) # no point in normalizing, doesn't change the order when sorting
f = c[0:len(c)//2].argsort()
f = f[-2:] # the last two elements are the indices to the largest two frequency components
p = np.zeros(len(c))
p[f] = 1 # preserve the largest two components
p[-f] = 1 # set the same components counting from the end
out = np.fft.ifft(fft_x * p).real
# note that np.fft.ifft(fft_x * p).imag is approximately zero if the filter is created correctly
Is it correct that the output of the FFT-function contains the fundamental frequency A0/ C0 […]?
In principle yes, but you subtracted the mean from the signal, effectively setting the fundamental frequency (DC component) to 0.

Remove outliers from Pandas Dataframe (Circular Data)

I would like to remove outliers from Pandas dataframe using some user defined function. There are some answers to the same question I am asking in Stackoverflow but the difference is that the Data-set I have are circular data. Therefore, using Pandas built-in functions mean(), std() would not be appropriate. For example in circular data values of 355 and 5 have only a difference of 10 but linear difference gives 350.
I have thousands of dataframes like the one below. We clearly see that Geophone 6 is an outlier.
Geophone azimuth incidence
0 1 194.765326 29.703151
1 2 193.143982 23.380681
2 3 199.327911 34.752212
3 4 195.641010 49.186893
4 5 193.479015 21.192982
5 6 0.745142 3.410046
6 7 192.380435 29.778807
7 8 196.700814 19.750237
It can also be confirmed when plotting the data in a polar diagram.
I have written two functions mean_angle and variance_angle which calculates circular mean and variance to be applied to the data. Variance gives a value between 0 and 1. When data are close to each other Variance value gets closer to 0 and vise versa.
import numpy as np
def mean_angle(deg):
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
mu = np.arctan(S/C)
mu = np.rad2deg(mu)
if S>0 and C>0:
mu = mu
elif S>0 and C<0:
mu = mu +180
elif S<0 and C<0:
mu = mu+180
elif S<0 and C>0:
mu = mu +360
return mu
def variance_angle(deg):
"""
deg: angles in degrees
"""
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
length = C.size
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
R = np.sqrt(S**2 + C**2)
R_avg = R/length
V = 1- R_avg
return V
mean_azimuth = mean_angle(df.azimuth)
variance = variance_angle(df.azimuth)
print(mean_azimuth)
197.4122778774279
print(variance)
0.24614383460498535
However, when excluding row 5 from calculation, mean and variance become 195.06226604362286 , 0.0007544067627361928 respectively. The Variance is changed from 0.25 to almost 0.
Therefore, I would like to find a way to remove any circular outlier value/s (azimuth) which makes circular variance high using the defined functions shown above.
In this example incidence is also an outlier for the same Geophone but It actually does not have any relation to azimuth. There are other data where incidenceis within the range but azimuth is an outlier.
Any help is really appreciated.
One way to do outlier detection is to compute mean and std of the data, then remove points that lie somewhere outside A*std of the mean (where you tune A to be whatever is reasonable for your data.)
So you could use your functions to compute mean and variance of your dataframe, then pass over the dataframe again to remove data points outside A*std of the mean.

How can I remove sharp jumps in data?

I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
e.g.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data
Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
plt.show()
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
Edit:
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
(df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.
Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
df_complete.plot()
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
df_cleaned.plot()

How to get the P Value in a Variable from OLSResults in Python?

The OLSResults of
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
print(fit.summary())
shows the P values of each attribute to only 3 decimal places.
I need to extract the p value for each attribute like Distance, CarrierNum etc. and print it in scientific notation.
I can extract the coefficients using fit.params[0] or fit.params[1] etc.
Need to get it for all their P values.
Also what does all P values being 0 mean?
You need to do fit.pvalues[i] to get the answer where i is the index of independent variables. i.e. fit.pvalues[0] for intercept, fit.pvalues[1] for Distance, etc.
You can also look for all the attributes of an object using dir(<object>).
Instead of using fit.summary() you could use fit.pvalues[attributeIndex] in a for loop to print the p-values of all your features/attributes as follows:
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
for attributeIndex in range (0, numberOfAttributes):
print(fit.pvalues[attributeIndex])
==========================================================================
Also what does all P values being 0 mean?
It might be a good outcome. The p-value for each term tests the null hypothesis that the coefficients (b1, b2, ..., bn) are equal to zero causing no effect to the fitting equation y = b0 + b1x1 + b2x2... A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable (y).
On the other hand, a larger (insignificant) p-value suggests that changes in the predictor are not correlated to changes in the response.
I have used this solution
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
model = sm.OLS(Y, X).fit()
# Following code snippet will generate sorted dataframe with feature name and it's p-value.
# Hence, you will see most relevant features on the top (p-values will be sorted in ascending order)
d = {}
for i in X.columns.tolist():
d[f'{i}'] = model_ols.pvalues[i]
df_pvalue= pd.DataFrame(d.items(), columns=['Var_name', 'p-Value']).sort_values(by = 'p-Value').reset_index(drop=True)

Looking for a way to adjust the values of one array based on another array?

I started with a set of bivariate data. My goal is to first find points in that data set for which the y-values are outliers. Then, I wanted to create a new data set that included not only the outlier points, but also any points with an x-value of within 0.01 of any given outlier point.
Then (if possible) I want to subtract the original outlier x-values from the new x-set, so that I have a group of points with x-values of between -0.01 and 0.01, with x-value now indicating distance from an original outlier x-value.
I have this code:
import numpy as np
mean = np.mean(y)
SD = np.std(y)
x_indices = [i for i in range(len(y)) if ((y[i]) > ((2*SD)+mean))]
expanded_indices = [i for i in range(len(x)) if np.any((abs(x[i] - x[x_indices])) < 0.01)]
This worked great, and now I can call (and plot) x and y using the indices:
plt.plot(x[expanded_indices],y[expanded_indices])
However, I have no idea how to subtract the original "x_indices" values to get an x range of -0.01 to 0.01, since everything I tried failed.
I want to do something like what I have below, except I know that I can't subtract two arrays of different sizes, and I'm worried I can't use np.any in this context either.
x_values = [(x[expanded_indices] - x[indices]) if np.any((abs(x[expanded_indices] - x[indices])) < 0.01)]
Any ideas? I'm sorry this is so long -- I'm very new at this and pretty lost. I've been giving it a go for the last few hours and any assistance would be appreciated. Thanks!
sample data could be as follows:
x =[0,0.994,0.995,0.996,0.997,0.998,1.134,1.245,1.459,1.499,1.500,1.501,2.103,2.104,2.105,2.106]
y =
[1.5,1.6,1.5,1.6,10,1.5,1.5,1.5,1.6,1.6,1.5,1.6,1.5,11,1.6,1.5]
Once you have the set with y-outliers values and the set with the expanded values, you can go over the whole second set with a for loop and subtract the corresponding 1st set value using 2 For() loops:
import numpy as np
x =np.array([0,0.994,0.995,0.996,0.997,0.998,1.134,1.245,1.459,1.499,1.500,1.501,2.103,2.104,2.105,2.106])
y = np.array([1.5,1.6,1.5,1.6,10,1.5,1.5,1.5,1.6,1.6,1.5,1.6,1.5,11,1.6,1.5])
mean = np.mean(y)
SD = np.std(y)
# elements with y-element outside defined region
indices = [i for i in range(len(y)) if ((y[i]) > ((2*SD)+mean))]
my_1st_set = x[indices]
# Set with values within 0.01 difference with 1st set points
expanded_indices = [i for i in range(len(x)) if np.any((abs(x[i] - x[x_indices])) < 0.01)]
my_2nd_set = x[expanded_indices]
# A final set with the subtracted values from the 2nd set
my_final_set = my_2nd_set
for i in range(my_final_set.size):
for j in range(my_1st_set.size):
if abs(my_final_set[i] - my_1st_set[j]) < 0.01:
my_final_set[i] = x[i] - my_1st_set[j]
break
my_final_set is a numpy array with the resulting values of subtracting the original expanded_indices values with their corresponding value of the first set
Let's see if I understood you correctly. This code should find the outliers, and put an array into res for each outlier.
import numpy as np
mean = np.mean(y)
SD = np.std(y)
x = np.array([0,0.994,0.995,0.996,0.997,0.998,1.134,1.245,1.459,1.499,1.500,1.501,2.103,2.104,2.105,2.106])
y = np.array([1.5,1.6,1.5,1.6,10,1.5,1.5,1.5,1.6,1.6,1.5,1.6,1.5,11,1.6,1.5])
outlier_indices = np.abs(y - mean) > 2*SD
res = []
for x_at_outlier in x[np.flatnonzero(outlier_indices)]:
part_res = x[np.abs(x - x_at_outlier) < 0.01]
part_res -= np.mean(part_res)
res.append(part_res)
res is now a list of arrays, with each array containing the values around one outlier. Perhaps it is easier to continue working with the data in this format?
If you want all of them in one numpy array:
res = np.hstack(res)

Categories