Outlier detection based on the moving mean in Python - python

I am trying to translate an algorithm from MATLAB to Python. The algorithm works with large datasets, and need an outlier detection and elimination technique to be applied.
In the MATLAB code, the outlier deletion technique I use is movmedian:
Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
Data_raw(find(Outlier_T),:)=[]
Which detects outliers with a rolling median, by finding desproportionate values in the centre of a three value moving window. So If I have a column "Temperatura" with a 40 on row 3, it is detected and the entire row is deleted.
Temperatura Date
1 24.72 2.3
2 25.76 4.6
3 40 7.0
4 25.31 9.3
5 26.21 15.6
6 26.59 17.9
... ... ...
To my understanding, this is achieved with pandas.DataFrame.rolling. I have seen several posts examplify its use, but I am not managing to make it work with my code:
Attempt A:
Dataframe.rolling(df["t_new"]))
Attempt B:
df-df.rolling(3).median().abs()>200
#based on #Ami Tavory's answer
Am I missing something obvious here? What is the right way of doing this?
Thank you for your time.

Code below drops the rows based on threshold. This threshold could be adjusted as needed. Not sure if it replicates Matlab code though.
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
Output
print(df)
Temperatura Date
0 24.72 2.3
1 25.76 4.6
3 25.31 9.3
4 26.21 15.6
5 26.59 17.9

Late to the party, based on Nilesh Ingle's answer. Modified to be more general, verbose (graphs!), and a percentage threshold instead of the data's real values.
# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df["Temp_Scaled"] = scaler.fit_transform(df["Temp"].values.reshape(-1, 1))
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))
# Calculate difference
df["Temp_Diff"] = df["Temp_Scaled"] - df["Temp_Rolling"]
import numpy as np
import matplotlib.pyplot as plt
# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4
# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)
# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()
Once you're satisfied with the data cleaning
# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()

Nilesh answer works perfectly, to iterate on his code you could also do :
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index)
# if you want to drop the column as well
del df["rolling_temp"]

Related

Apply a function across all rows in new column creation Pandas

I have the following dataset where i make my predictions and historically i know the standard deviations on these predictions:
d = {'Name': ['Jim', 'Matt','Alex','Nathan','Dom'], 'Predict': [2.901826509,3.212149337,2.388237651,3.744206058,1.944415024]}
df = pd.DataFrame(data=d)
df['Mean'] = 4
df['StDev'] = 6
df.head(5)
Name Predict Mean StDev
0 Jim 2.901827 4 6
1 Matt 3.212149 4 6
2 Alex 2.388238 4 6
3 Nathan 3.744206 4 6
4 Dom 1.944415 4 6
I have also found a function from https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f
That has the following:
import numpy as np
from scipy.stats import norm
def MC_prob(M,mu,sigma):
prob_larger_than3 = []
for i in range(M):
# Using CDF since P[Z>=3] = 1-P[Z<=3]
p = 1- norm.cdf(3, mu, sigma)
# Using Survival Function P[Z>=3]
p = norm.sf(3, mu, sigma)
prob_larger_than3.append(p)
MC_approximation_prob = np.array(prob_larger_than3).mean()
return(MC_approximation_prob)
MC_prob(M = 10000, mu = 10, sigma = 2)
0.9997673709209641
I would like to apply this function and create a new column in my dataframe, with the probability of my Predict column being over 3.
I tried:
df['ProbOver3'] = MC_prob(M = 10000, mu = df.Predict, sigma = df.StDev)
but it gave the same value for every for row. Any ideas on how to apply this over every row? Essentially I am trying to simulate and return a probability of each row being above or below certain numbers and I hope I am on the right track. It's a Follow up question to this one Apply a monte carlo simulation on a pandas dataframe and return probability result in column
Any help would be much appreciated, thanks very much!
Use df.apply() with a lambda. You can apply (pun intended) this function to every row to make a new column by adding the axis=1 which specifies every row. Then use a lambda to pass the row to the function. Here is how you could use this:
df['ProbOver3'] = df.apply(lambda row: MC_prob(10000, row['Predict'], row['StDev']), axis=1)
Checkout the docs on df.apply for more info.

How can I remove sharp jumps in data?

I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
e.g.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data
Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
plt.show()
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
Edit:
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
(df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.
Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
df_complete.plot()
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
df_cleaned.plot()

Empty Plot when dealing with a huge number of rows

I attempted to use the code below to plot a graph to show the Speed per hour by days.
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import glob, os
taxi_df = pd.read_csv('ChicagoTaxi.csv')
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
#For filtering away any zero values when trip_Seconds or trip_miles = 0
filterZero = taxi_df[(taxi_df.trip_seconds != 0) & (taxi_df.trip_miles != 0)]
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
filterZero['speed'] *= 60
filterZero = filterZero.reset_index(drop=True)
filterZero.groupby(filterZero['trip_start_timestamp'].dt.strftime('%w'))['speed'].mean().plot()
plt.xlabel('Day')
plt.ylabel('Speed(Miles per Minutes)')
plt.title('Mean Miles per Hour By Days')
plt.show() #Not working
Example rows
0 2016-01-13 06:15:00 8.000000
1 2016-01-22 09:30:00 10.500000
Small Dataset : [1250219 rows x 2 columns]
Big Dataset: [15172212 rows x 2 columns]
For a smaller dataset the code works perfectly and the plot is shown. However when I attempted to use a dataset with 15 million rows the plot shown was empty as the values were "inf" despite running mean(). Am i doing something wrong here?
0 inf
1 inf
...
5 inf
6 inf
The speed is "Miles Per Hour" by day! I was trying out all time format so there is a mismatch in the picture sorry.
Image of failed Plotting(Larger Dataset):
Image of successful Plotting(Smaller Dataset):
I can't really be sure because you do not provide a real example of your dataset, but I'm pretty sure your problem comes from the column trip_seconds.
See these two lines:
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
If some of your values in the column trip_seconds are ≤ 30, then this line will round them to 0.0.
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
Therefore this line will be filled with some inf values (as anything / 0.0 = inf). Taking the mean() of an array with inf values will return inf regardless.
Two things to consider:
if your values in the column trip_seconds are actually in seconds, then after dividing your values by 60, they will be in minutes, which will make your speed in miles per minutes, not per hour.
You should try without rounding the times

Classification of continious data

I've got a Pandas df that I use for Machine Learning in Scikit for Python.
One of the columns is a target value which is continuous data (varying from -10 to +10).
From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.
So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.
In Excel I have calculated the percentiles and then used some logic to build the classes.
How to do this in Python?
#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])
#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3
df['group'][df['target'] < quantiles[.4]] = 2
df['group'][df['target'] < quantiles[.2]] = 1
looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?
import numpy as np
import pandas as pd
#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)
#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]

Applying pandas qcut bins to new data

I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)

Categories