Remove outliers from Pandas Dataframe (Circular Data) - python

I would like to remove outliers from Pandas dataframe using some user defined function. There are some answers to the same question I am asking in Stackoverflow but the difference is that the Data-set I have are circular data. Therefore, using Pandas built-in functions mean(), std() would not be appropriate. For example in circular data values of 355 and 5 have only a difference of 10 but linear difference gives 350.
I have thousands of dataframes like the one below. We clearly see that Geophone 6 is an outlier.
Geophone azimuth incidence
0 1 194.765326 29.703151
1 2 193.143982 23.380681
2 3 199.327911 34.752212
3 4 195.641010 49.186893
4 5 193.479015 21.192982
5 6 0.745142 3.410046
6 7 192.380435 29.778807
7 8 196.700814 19.750237
It can also be confirmed when plotting the data in a polar diagram.
I have written two functions mean_angle and variance_angle which calculates circular mean and variance to be applied to the data. Variance gives a value between 0 and 1. When data are close to each other Variance value gets closer to 0 and vise versa.
import numpy as np
def mean_angle(deg):
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
mu = np.arctan(S/C)
mu = np.rad2deg(mu)
if S>0 and C>0:
mu = mu
elif S>0 and C<0:
mu = mu +180
elif S<0 and C<0:
mu = mu+180
elif S<0 and C>0:
mu = mu +360
return mu
def variance_angle(deg):
"""
deg: angles in degrees
"""
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
length = C.size
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
R = np.sqrt(S**2 + C**2)
R_avg = R/length
V = 1- R_avg
return V
mean_azimuth = mean_angle(df.azimuth)
variance = variance_angle(df.azimuth)
print(mean_azimuth)
197.4122778774279
print(variance)
0.24614383460498535
However, when excluding row 5 from calculation, mean and variance become 195.06226604362286 , 0.0007544067627361928 respectively. The Variance is changed from 0.25 to almost 0.
Therefore, I would like to find a way to remove any circular outlier value/s (azimuth) which makes circular variance high using the defined functions shown above.
In this example incidence is also an outlier for the same Geophone but It actually does not have any relation to azimuth. There are other data where incidenceis within the range but azimuth is an outlier.
Any help is really appreciated.

One way to do outlier detection is to compute mean and std of the data, then remove points that lie somewhere outside A*std of the mean (where you tune A to be whatever is reasonable for your data.)
So you could use your functions to compute mean and variance of your dataframe, then pass over the dataframe again to remove data points outside A*std of the mean.

Related

Filtering the two frequencies with highest amplitudes of a signal in the frequency domain

I have tried to filter the two frequencies which have the highest amplitudes. I am wondering if the result is correct, because the filtered signal seems less smooth than the original?
Is it correct that the output of the FFT-function contains the fundamental frequency A0/ C0, and is it correct to include it in the search of the highest amplitude (it is indeed the highest!) ?
My code (based on my professors and collegues code, and I did not understand every detail so far):
# signal
data = np.loadtxt("profil.txt")
t = data[:,0]
x = data[:,1]
x = x-np.mean(x) # Reduce signal to mean
n = len(t)
max_ind = int(n/2-1)
dt = (t[n-1]-t[0])/(n-1)
T = n*dt
df = 1./T
# Fast-Fourier-Transformation
c = 2.*np.absolute(fft(x))/n #get the power sprectrum c from the array of complex numbers
c[0] = c[0]/2. #correction for c0 (fundamental frequency)
f = np.fft.fftfreq(n, d=dt)
a = fft(x).real
b = fft(x).imag
n_fft = len(a)
# filter
p = np.ones(len(c))
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-1)]] = 0 #setting the positions of p to 0 with
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-2)]] = 0 #the indices from the argsort function
print(c[0:int(len(c)/2-1)].argsort()[int(n_fft/2-2)]) #over the first half of the c array,
ab_filter_2 = fft(x) #because the second half contains the
ab_filter_2.real = a*p #negative frequencies.
ab_filter_2.imag = b*p
x_filter2 = ifft(ab_filter_2)*2
I do not quite get the whole deal about FFT returning negative and positive frequencies. I know they are just mirrored, but then why can I not search over the whole array? And the iFFT function works with an array of just the positive frequencies?
The resulting plot: (blue original, red is filtered):
This part is very wasteful:
a = fft(x).real
b = fft(x).imag
You’re computing the FFT twice for no good reason. You compute it a 3rd time later, and you already computed it once before. You should compute it only once, not 4 times. The FFT is the most expensive part of your code.
Then:
ab_filter_2 = fft(x)
ab_filter_2.real = a*p
ab_filter_2.imag = b*p
x_filter2 = ifft(ab_filter_2)*2
Replace all of that with:
out = ifft(fft(x) * p)
Here you do the same thing twice:
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-1)]] = 0
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-2)]] = 0
But you set only the left half of the filter. It is important to make a symmetric filter. There are two locations where abs(f) has the same value (up to rounding errors!), there is a positive and a negative frequency that go together. Those two locations should have the same filter value (actually complex conjugate, but you have a real-valued filter so the difference doesn’t matter in this case).
I’m unsure what that indexing does anyway. I would split out the statement into shorter parts on separate lines for readability.
I would do it this way:
import numpy as np
x = ...
x -= np.mean(x)
fft_x = np.fft.fft(x)
c = np.abs(fft_x) # no point in normalizing, doesn't change the order when sorting
f = c[0:len(c)//2].argsort()
f = f[-2:] # the last two elements are the indices to the largest two frequency components
p = np.zeros(len(c))
p[f] = 1 # preserve the largest two components
p[-f] = 1 # set the same components counting from the end
out = np.fft.ifft(fft_x * p).real
# note that np.fft.ifft(fft_x * p).imag is approximately zero if the filter is created correctly
Is it correct that the output of the FFT-function contains the fundamental frequency A0/ C0 […]?
In principle yes, but you subtracted the mean from the signal, effectively setting the fundamental frequency (DC component) to 0.

Apply a function across all rows in new column creation Pandas

I have the following dataset where i make my predictions and historically i know the standard deviations on these predictions:
d = {'Name': ['Jim', 'Matt','Alex','Nathan','Dom'], 'Predict': [2.901826509,3.212149337,2.388237651,3.744206058,1.944415024]}
df = pd.DataFrame(data=d)
df['Mean'] = 4
df['StDev'] = 6
df.head(5)
Name Predict Mean StDev
0 Jim 2.901827 4 6
1 Matt 3.212149 4 6
2 Alex 2.388238 4 6
3 Nathan 3.744206 4 6
4 Dom 1.944415 4 6
I have also found a function from https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f
That has the following:
import numpy as np
from scipy.stats import norm
def MC_prob(M,mu,sigma):
prob_larger_than3 = []
for i in range(M):
# Using CDF since P[Z>=3] = 1-P[Z<=3]
p = 1- norm.cdf(3, mu, sigma)
# Using Survival Function P[Z>=3]
p = norm.sf(3, mu, sigma)
prob_larger_than3.append(p)
MC_approximation_prob = np.array(prob_larger_than3).mean()
return(MC_approximation_prob)
MC_prob(M = 10000, mu = 10, sigma = 2)
0.9997673709209641
I would like to apply this function and create a new column in my dataframe, with the probability of my Predict column being over 3.
I tried:
df['ProbOver3'] = MC_prob(M = 10000, mu = df.Predict, sigma = df.StDev)
but it gave the same value for every for row. Any ideas on how to apply this over every row? Essentially I am trying to simulate and return a probability of each row being above or below certain numbers and I hope I am on the right track. It's a Follow up question to this one Apply a monte carlo simulation on a pandas dataframe and return probability result in column
Any help would be much appreciated, thanks very much!
Use df.apply() with a lambda. You can apply (pun intended) this function to every row to make a new column by adding the axis=1 which specifies every row. Then use a lambda to pass the row to the function. Here is how you could use this:
df['ProbOver3'] = df.apply(lambda row: MC_prob(10000, row['Predict'], row['StDev']), axis=1)
Checkout the docs on df.apply for more info.

Python matrix tagging

I have an algorithm I want to implement and I'm trying to figure out the best way to do it.
I have a matrix H of size mxn (m - number of last inputs - sliding window, n - number of attributes).
I have a set of attributes A, and I want to find correlations between the attributes.
My problem is how can I tag a matrix column/line with a name?
this is the algorithm I'm trying to implement:
attributes a_i, a_j are extracted from H and denoted as H_i^T ,H_j^T (where T denotes transpose).
We then apply the Pearson correlation on them denoted as ρi,j.
for example:
If we have:
H(mxn = 4x3) = IQ Height weight
30 180 80
30 170 60
40 183 85
10 190 95
ct = 0.7
A = {IQ, Height, Weight}
Then the result we should get is:
CS = {(C,0)}
Where C = {Height, Weight}
I would also love to get any visualiztion tools reccomandations.
Thanks for your help!
Pandas is your best friend when it comes to tabular data. I'm not an expert on linear algebra notation but it seems like what you're trying to do is append a tuple into a set if the two items in the tuple are correlated by more than the threshold value, ie. if Height and Weight have a correlation coefficient > 0.7, then add those two values into list CS. I would do something like this:
import pandas as pd
import seaborn as sns
df = pd.DataFrame.from_dict({
"IQ":[30,30,40,10],
"Height":[180,170,183,190,],
"Weight":[80,60,85,95]
})
lst = []
threshold = 0.7
p_arr = df.corr().to_dict()
for attr in p_arr:
for sub_attr in p_arr[attr]:
p = p_arr[attr][sub_attr]
if attr != sub_attr and p > threshold:
lst.append(((attr, sub_attr), p))
produces:
[(('Height', 'Weight'), 0.9956654266839726),
(('Weight', 'Height'), 0.9956654266839726)]
and for correlation heatmap
sns.heatmap(df.corr())

Gap Statistics with Standard 1 error

I have implemented a Kmeans using Scikit Learn command and I have tried Elbow and Silhoutte Coefficient to find the optimal K. I am planning to use gap statistics to further verify my results.
def optimalK(data, nrefs=3, maxClusters=15):
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
However my plots for gap statistic is increasing therefore optimal number of clusters is always the end point for my range of clusters. Assume I am defining cluster range to be from 1 to 10 then optimal will be 10.
According to the internet websites and the original paper the workaround is to implement the standard 1 error in which
GAP(K)> GAP(K+1)- S(K+1)
Can anyone explain to me how to implement this in the above code? I do not know how to calculate the S(k+1) since it involves finding the standard deviation of the reference distribution.
s(k+1) = sd(k+1)*square_root(1+(1/B))
B is the number of copies of Monte Carlo Samples. I look at different websites but it seems they did not implement the gap statistics with standard 1 error.
def gap_stat(data,label):
k = len(np.unique(label))
n = data.shape[0]
p = data.shape[1]
D_r = []
C_r = []
for label_number in range(0,k):
this_label_index = np.where(label==label_number)[0]
temp_sum = 0
pairwise_distance_matrix =
euclidean_distances(data[this_label_index],squared=True)
D_r.append(np.sum(pairwise_distance_matrix))
C_r.append(float(len(this_label_index)))
W_r = np.sum(np.asarray(D_r)/(2*np.asarray(C_r)))
gap_stats = np.log(float(p*n)/12)-(2/float(p))*np.log(k)-
np.log(W_r)
return(gap_stats)

Discretization of probability array in Python

I have a numpy array (actually imported from a GIS raster map) which contains
probability values of occurrence of a species like following example:
a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)
Now I want to get a discrete version for that array again. Like if I have
e.g. 100 individuals which are located on the area of that array (1200 cells) how are they
distributed? Of course they should be distributed according to their probability,
meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability
cell. It should be possible that multiple individuals can occupy on cell...
It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.
...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.
Use random.choice with bincount:
np.bincount(np.random.choice(b.size, 100, p=b.flat),
minlength=b.size).reshape(b.shape)
If you don't have NumPy 1.7, you can replace random.choice with:
np.searchsorted(np.cumsum(b), np.random.random(100))
giving:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
minlength=b.size).reshape(b.shape)
So far I think ecatmur's answer seems quite reasonable and simple.
I just want to add may a more "applied" example. Considering a dice
with 6 faces (6 numbers). Each number/result has a probability of 1/6.
Displaying the dice in form of an array could look like:
b = np.array([[1,1,1],[1,1,1]])/6.0
Thus rolling the dice 100 times (n=100) results in following simulation:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)
I think that can be an appropriate approach for such an application.
Thus thank you ecatmur for your help!
/Johannes
this is similar to my question i had earlier this month.
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution
The two main lines that are important are the last two lines in the code above

Categories