I have a pandas dataframe and I am experimenting with sci-kit learn Novelty and Outlier Detection. I am trying figure out how to save my good dataset back to new a new CSV file after the outlier detector flags outliers.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import LocalOutlierFactor
df = pd.read_csv('./ILCplusDAT.csv')
df = df.fillna(method = 'ffill').fillna(method = 'bfill')
npower_pid = df[['power','pid']].to_numpy()
And using the sci kit learn feature where visually to me the results look good only using 2 of the columns power & pid of the original df:
ax = plt.figure(figsize=(25,8))
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.005)
good = lof.fit_predict(npower_pid) == 1
plt.scatter(npower_pid[good, 0], npower_pid[good, 1], s=2, label="Good", color="#4CAF50")
plt.scatter(npower_pid[~good, 0], npower_pid[~good, 1], s=8, label="Bad", color="#F44336")
plt.legend();
Which creates an interesting plot that I would love to save a "filtered" original data frame of "BAD" data removed. Any tips greatly appreciated...hopefully this makes sense. The original data frame is 3 columns but the filtered data as shown in the plot below is only 2 of those columns. Can I still filter the original dataframe based on the output shown in this plot?
You want to filter df using your array, good:
# you can filter df using bool masking in .loc[...]
df.loc[good == True]
# or...
df.loc[good == False]
# ***NOTE: if you've altered the index in df you may have unexpected results.
# convert `good` into a `pd.Series` with the same index as `df`
s = pd.Series(good, index=df.index, name="is_outlier")
# ... join with df
df = df.join(s)
# then filter to True
df.loc[df.is_outlier == True]
# or False
df.loc[df.is_outlier == False]
Thanks to #Ian Thompson
My code for what its worth...
s = pd.Series(good, index=df.index, name="is_outlier")
df = df.join(s)
# df2 is filtered to remove BAD data
df2 = df[(df['is_outlier']==True)]
df2 = df2[['pid','power','dat']]
df2.to_csv('./filteredILCdata.csv')
Related
For the dataframe below,I want to write a function that
I. extract the outliers for each column and export the output as a csv file (I need help with this one)
II. visualize using boxplot and export as pdf file
Outlier definition: as boundaries ±3 standard deviations from the mean
OR
as being any point of data that lies over 1.5 IQRs below the first quartile (Q1) or above the third quartile (Q3)in a data set.
High = (Q3) + 1.5 IQR
Low = (Q1) – 1.5 IQR
See below for the dataset and my attempt :
# dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# intialise data of lists.
data = {'region':['R1', 'R1', 'R2', 'R2', 'R2','R1','R1','R1','R2','R2'],
'cost':[120.05, 181.90, 10.21, 133.01, 311.19,2003.4,112.4,763.2,414.8,812.5],
'commission':[110.21, 191.12, 190.21,15.31, 245.09,63.41,811.3,10.34, 153.10, 311.17],
'salary':[10022,19910, 19113,449999, 25519,140.29, 291.07, 390.22, 245.09, 4122.62],
'revenue':[14029, 29100, 39022, 24509, 412271,110.21, 191.12, 190.21, 12.00, 245.09],
'tax':[120.05, 181.90, 10.34, 153.10, 311.17,52119,32991,52883,69359,57835],
'debt':[100.22,199.10, 191.13,199.99, 255.19,41218, 52991,1021,69152,79355],
'income': [43211,7672991,56881,211,77342,100.22,199.10, 191.13,199.99, 255.19],
'rebate': [31.21,429.01,538.18,621.58,6932.5,120.05, 181.90, 10.34, 153.10, 311.17],
'scale':['small','small','small','mid','mid','large','large','mid','large','small']
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
############## my attempt ####################
def outlier_extractor(data):
# select numeric columns
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
#(I)Extract and export outliers as csv..... I need help with this one
#(II) boxplot visualization
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.savefig('graph_outliers.pdf')
plt.show()
# driver code
outlier_extractor(df)
Please comment and share your full code. Thanks in advance
def outlier_extractor(data):
numeric_data = data.select_dtypes(include=np.number)
Q1, Q3 = numeric_data.quantile(.25), numeric_data.quantile(.75)
IQR = Q3-Q1
numeric_data[:] = np.where((numeric_data > Q3+1.5*IQR)|(numeric_data < Q1-1.5*IQR), np.nan, numeric_data)
numeric_data.apply(lambda series:series.dropna().to_csv(series.name+".csv"))
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
#plt.savefig('graph_outliers.pdf')
plt.show()
outlier_extractor(df)
Note that the apply function saves each filtered column in one different csv file. From your description I though that this was your task.
Note also that you don't need the seaborn package
EDIT
To export all the filtered dataframe with missing values replacing the ouliers you have to replace the to_csv row with:
numeric_data.to_excel("filtered_numeric_data.xlsx")
I have a Dataframe and im trying to apply some ml algorithms on it.
im using pandas to handle it but im having several problems with it:
as you see in the 3rd cell, i have splitted Y into Ytr and Yts. after this the dataframe losses its column names. I've tried to name the column again but it doesn't work.
in the 4th cell, Im trying to use conditional statement to create a subset of Y in which Y values are 1(it is named ytr1). but it returns an empty dataframe.
any suggestions on the whole code would be really appreciated since im not really experienced with Pandas
note: if you haven't worked with jupyter notebook, #%% just means a new cell.
#%%
from pandas import DataFrame as df
import random
import numpy as np
import pandas as pd
import re
#%%
# Preparing the DataFrame
labels = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\labels.csv', header=None)
ll = labels.loc[:, 0].tolist()
data = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\pima-indians-diabetes2.csv', names=ll)
i = data.columns.values.tolist() # i is the labels of the csv file
i[-1]
#%%
# Spliting the Dataset
X = data.drop(i[-1], axis=1)
Y = data.iloc[:, 8]
Y = Y.to_frame()
Y = pd.DataFrame(Y.values.reshape(-1, 1), columns=i[-1])
tr_idx = data.sample(frac=0.7).index
Xtr = df(X[X.index.isin(tr_idx)])
Xts = df(X[~X.index.isin(tr_idx)])
Ytr = df(Y[X.index.isin(tr_idx)], columns='result')
Yts = df(Y[~X.index.isin(tr_idx)], columns=i[-1])
#%%
# splitting the Classes
ytr1 = Ytr.drop(Ytr[Ytr.iloc[0]!=1].index)
X: all the columns except Labels\classes which are 0 or 1
Y: last column of the csv files that are loaded as labels
Xtr: fraction of X that Im planning to use for training
Xts: fraction of X that Im planning to use for testing
I know we can use the following code to create a decile column for based on a column of given data set considering there are tie in the data (see How to qcut with non unique bin edges?):
import numpy as np
import pandas as pd
# create a sample
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 3), columns=list('ABC'))
# sort by column C
df = df.sort_values(['C'] , ascending = False )
# create decile by column C
df['decile'] = pd.qcut(df['C'].rank(method='first'), 10, labels=np.arange(10, 0, -1))
Is there an easy way to save the cut point from df then use the same cut point to cut a new data set? For example:
np.random.seed([1])
df_new = pd.DataFrame(np.random.rand(100, 1), columns=list('C'))
You can using .left get all bins
s1=pd.Series([1,2,3,4,5,6,7,8,9])
s2=pd.Series([2,3,4,6,1])
a=pd.qcut(s1,10).unique()
bins=[x.left for x in a ] + [np.inf]
pd.cut(s2,bins=bins)
I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I use to remove these outliers or peak points from the data?
Example:
data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}
in this data If I see the points 4, 4, 5, 6 are completely outlier values,
before I have used the rolling mean with 5 min of window frame to smooth these values but still I am getting these type of a lot of blip points, which I want to remove, can any one suggest me any technique to get rid of these points.
I have an image which is more clear view of data:
if you see here how the data is showing some outlier points which I have to remove?
any Idea whats the possible way to get rid of these points?
I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
Originial data:
Test run 1 : Z-score = 4:
As you can see, no data has been removed because the level was set too high.
Test run 2 : Z-score = 2:
Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.
Test run 3 : Z-score = 1.2:
This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!
EDIT: Here's the whole thing for an easy copy&paste:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)
# final plot
df_clean.plot(style = 'o')
You might cut values above a certain quantile as follows:
import numpy as np
clean_data=np.array(data_points)[(data_points<=np.percentile(data_points, 95))]
In pandas you would use df.quantile, you can find it here
Or you may use the Q3+1.5*IQR approach to eliminate the outliers, like you would do through a boxplot
I have the following problem:
Given a 2D dataframe, first column with values and second giving categories of the points, I would like to compute a k-means dictionary of the means of each category and assign the centroid that the group mean of a particular value is closest to as a new column in the original data frame.
I would like to do this using groupby.
More generally, my problem is, that apply (to my knowledge) only can use functions that are defined on the individual groups (like mean()). k-means needs information on all the groups. Is there a nicer way than transforming everything to numpy arrays and working with these?
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k=4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means = groups.mean().unstack()
centroids, dictionary = kmeans2(means,k)
fig, ax = plt.subplots()
print dictionary
What I would like to get now, is a new column in df, that gives the value in dictionary for each entry.
You can achieve it by the following:
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k = 4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means_data_frame = pd.DataFrame(groups.mean())
centroid, means_data_frame['cluster'] = kmeans2(means_data_frame['B'], k)
df.join(means_data_frame, rsuffix='_mean', on='A')
This will append 2 more columns to df B_mean and cluster denoting the group's mean and the cluster that group's mean is closest to, respectively.
If you really want to use apply, you can write a function to read the cluster value from means_data_frame and assign it to a new column in df