Detect and fix outliers in a pandas series - python

I have pandas series with some outliers values. Here's some mock data:
df = pd.DataFrame({'col1': [1200, 400, 50, 75, 8, 9, 8, 7, 6, 5, 4, 6, 6, 8, 3, 6, 6, 7, 6]})
I'd like to substitute outliers i.e values that >= 3 standard deviation from mean with the mean value.

Let's do:
thrs = df['col1'].mean() + 3 * df['col1'].std()
df.loc[df['col1'] >= thrs, 'col1'] = df['col1'].mean()

std_dev = df["col1"].std()
mean = df["col1"].mean()
df["col1"] = np.where(df.col1 >= (mean + 3*std_dev), mean, df.col1)

Related

Alternative to for loops for calculating 15^6 combinations in Python

Today, I have a nested for loop in python to calculate the value of all different combinations in a horse racing card consisting of six different races; i.e. six different arrays (of different lengths, but up to 15 items per array). It can be up to 11 390 625 combinations (15^6).
For each horse in each race, I calculate a value (EV) which I want to multiply.
Array 1: 1A,1B,1C,1D,1E,1F
Array 2: 2A,2B,2C,2D,2E,2F
Array 3: 3A,3B,3C,3D,3E,3F
Array 4: 4A,4B,4C,4D,4E,4F
Array 5: 5A,5B,5C,5D,5E,5F
Array 6: 6A,6B,6C,6D,6E,6F
1A * 1B * 1C * 1D * 1E * 1F = X,XX
.... .... .... .... ... ...
6A * 6B * 6C * 6D * 6E * 6F 0 X,XX
Doing four levels is OK. It takes me about 3 minutes.
I have yet not been able to do six levels.
I need help in creating a better way of doing this, and have no idea how to proceed. Does numpy perhaps offer help here? Pandas? I've tried compiling the code with Cython, but it did not help much.
My function takes in a list containing the horses in numerical order and their EV. (Since horse starting numbers do not start with zero, I add 1 to the index). I iterate through all the different races, and save the output for the combination into a dataframe.
def calculateCombos(horses_in_race_1,horses_in_race_2,horses_in_race_3,horses_in_race_4,horses_in_race_5,horses_in_race_6,totalCombinations, df):
totalCombinations = 0
for idx1, hr1_ev in enumerate(horses_in_race_1):
hr1_no = idx1 + 1
for idx2, hr2_ev in enumerate(horses_in_race_2):
hr2_no = idx2 + 1
for idx3, hr3_ev in enumerate(horses_in_race_3):
hr3_no_ = idx3 + 1
for idx4, hr4_ev in enumerate(horses_in_race_4):
hr4_no = idx4 + 1
for idx5, hr5_ev in enumerate(horses_in_race_5):
hr5_no = idx5 + 1
for idx6, hr6_ev in enumerate(horses_in_race_6):
hr6_no = idx6 + 1
totalCombinations = totalCombinations + 1
combinationEV = hr1_ev * hr2_ev * hr3_ev * hr4_ev * hr5_ev * hr6_ev
new_row = {'Race1':str(hr1_no),'Race2':str(hr2_no),'Race3':str(hr3_no),'Race4':str(hr4_no),'Race5':str(hr5_no),'Race6':str(hr6_no), 'EV':combinationEV}
df = appendCombinationToDF(df, new_row)
return df
Why don't you try this and see if you can run the function without any issues? This works on my laptop (I'm using PyCharm). If you can't run this, then I would say that you need a better PC perhaps. I did not encounter any memory error.
Assume that we have the following:
horses_in_race_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_3 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_4 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_5 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_6 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
I have re-written the function as follows - made a change in enumeration. Also, not using df as I do not know what function this is - appendCombinationToDF
def calculateCombos(horses_in_race_1,horses_in_race_2,horses_in_race_3,horses_in_race_4,horses_in_race_5,horses_in_race_6):
for idx1, hr1_ev in enumerate(horses_in_race_1, start = 1):
for idx2, hr2_ev in enumerate(horses_in_race_2, start = 1):
for idx3, hr3_ev in enumerate(horses_in_race_3, start = 1):
for idx4, hr4_ev in enumerate(horses_in_race_4, start = 1):
for idx5, hr5_ev in enumerate(horses_in_race_5, start = 1):
for idx6, hr6_ev in enumerate(horses_in_race_6, start = 1):
combinationEV = hr1_ev * hr2_ev * hr3_ev * hr4_ev * hr5_ev * hr6_ev
new_row = {'Race1':str(idx1),'Race2':str(idx2),'Race3':str(idx3),'Race4':str(idx4),'Race5':str(idx5),'Race6':str(idx6), 'EV':combinationEV}
l.append(new_row)
#df = appendCombinationToDF(df, new_row)
l = [] # df = ...
calculateCombos(horses_in_race_1, horses_in_race_2, horses_in_race_3, horses_in_race_4, horses_in_race_5, horses_in_race_6)
Executing len(l), I get:
11390625 # maximum combinations possible. This means that above function ran successfully and computation succeeded.
If the above can be executed, replace list l with df and see if function can execute without encountering memory error. I was able to run the above in less than 20-30 seconds.

How to filter a pandas dataframe and then groupby and aggregate a list of values?

I'm trying to use groupby and get values as a list.
End df should be "bid" as index, score as list for second column (ex. [85, 58] if they both have the same "bid"]
This is my df:
When I use merged.groupby("bid")['score_y'].apply(list)
I get TypeError: 'Series' objects are mutable, thus they cannot be hashed.
Does anyone know why I'm getting this error?
Edit 1:
This is the datasource: https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i
The df "ins" yields the following where "bid" are the numbers before the "_' in "iid".
My code so far:
ins2018 = ins[ins['year'] == 2018] #.drop(["iid", 'date', 'type', 'timestamp', 'year', 'Missing Score'], axis = 1)
# new = ins2018.loc[ins2018["score"] > 0].sort_values("date").groupby("bid").count()
# new = new.loc[new["iid"] == 2]
# merge = pd.merge(new, ins2018, how = "left", on = "bid").sort_values('date_y')
# merged = merge.loc[merge['score_y'] > 0].drop(['iid_x', 'date_x', 'score_x', 'type_x', 'timestamp_x', 'year_x', 'Missing Score_x', 'iid_y', 'type_y', 'timestamp_y', 'year_y', 'Missing Score_y', "date_y"], axis = 1)
Aggregate list onto score_y with pandas.DataFrame.aggregat
Depending on merged, the index may need to be reset.
# reset the index of of merged
merged = merged.reset_index(drop=True)
# groupby bid and aggregate a list onto score_y
merged.groupby('bid').agg({'score_y': list})
Example
import pandas as pd
import numpy as np
import random
np.random.seed(365)
random.seed(365)
rows = 100
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)]}
df = pd.DataFrame(data)
# groupby and aggregate a list
dfg = df.groupby('groups').agg({'a': list})
dfg
[out]:
a
groups
1-5 [7, 8, 4, 3, 1, 7, 9, 3, 2, 7, 6, 4, 4, 6]
100-500 [4, 3, 2, 8, 6, 3, 1, 5, 7, 7, 3, 5, 4, 7, 2, 2, 4]
26-100 [4, 2, 2, 9, 5, 3, 1, 0, 7, 9, 7, 7, 9, 9, 9, 7, 0, 0, 4]
500-1000 [2, 8, 0, 7, 6, 6, 8, 4, 6, 2, 2, 5]
6-25 [5, 9, 7, 0, 6, 5, 7, 9, 9, 9, 6, 5, 6, 0, 2, 7, 4, 0, 3, 9, 0, 5, 0, 3]
>1000 [2, 1, 3, 6, 7, 6, 0, 5, 9, 9, 3, 2, 6, 0]
Using data from Restaurant Scores - LIVES Standard
Attempts to follow along with the code in the OP.
import pandas as pd
# load data
ins = pd.read_csv('data/Restaurant_Scores_-_LIVES_Standard.csv')
# convert inspection_date to a datetime format
ins.inspection_date = pd.to_datetime(ins.inspection_date)
# add a year column
ins['year'] = ins.inspection_date.dt.year
# select data for 2018
ins2018 = ins[ins['year'] == 2018]
################################################################
# this is where you run into issues
# new is the counts for every column
# this is what you could have done to get the number of inspection counts
# just count the occurrences of business_id
counts = ins2018.groupby('business_id').agg({'business_id': 'count'}).rename(columns={'business_id': 'inspection_counts'}).reset_index()
# don't do this: get dataframe of counts
# new = ins2018.loc[ins2018["inspection_score"] > 0].sort_values("inspection_date").groupby("business_id").count()
# don't do this: select data
# new = new.loc[new["inspection_id"] == 2].reset_index()
# merge updated
merge = pd.merge(counts, ins2018, how = "left", on = "business_id")
################################################################
# select data again
merged = merge.loc[(merge['inspection_score_y'] > 0) & (merge.inspection_counts >= 2)]
# groupby and aggregate list
mg = merged.groupby('business_id').agg({'inspection_score_y': list})
# display(mg)
inspection_score_y
business_id
31 [96.0, 96.0]
54 [94.0, 94.0]
61 [94.0, 94.0]
66 [98.0, 98.0]
101 [92.0, 92.0]
groupby on ins updated
import pandas as pd
# load data and parse the dates
ins = pd.read_csv('data/Restaurant_Scores_-_LIVES_Standard.csv', parse_dates=['inspection_date'])
# select specific data
data = ins[(ins.inspection_date.dt.year == 2018) & (ins.inspection_score > 0)].dropna().reset_index(drop=True)
# groupby
dg = data.groupby('business_id').agg({'inspection_score': list})
# display(dg)
inspection_score
business_id
54 [94.0, 94.0]
146 [90.0, 81.0, 90.0, 81.0, 90.0, 81.0, 81.0, 81.0]
151 [81.0, 81.0, 81.0, 81.0, 81.0]
155 [90.0, 90.0, 90.0, 90.0]
184 [90.0, 90.0, 90.0, 96.0]
# if you only want results with 2 or more inspections
# get the length of the list because each score represents and inspection
dg['inspection_count'] = dg.inspection_score.map(len)
# filter for 2 or more; this removes 81 business_id that had less than two inspections
dg = dg[dg.inspection_count >= 2]

How to calculate median from 2 different lists in Python

I have two lists note = [6,8,10,13,14,17] Effective = [3,5,6,7,5,1] ,the first one represents grades, the second one the students in the class that got that grade. so 3 kids got a 6 and 1 got a 17. I want to calculate the mean and the median. for the mean I got:
note = [6,8,10,13,14,17]
Effective = [3,5,6,7,5,1]
products = [] for num1, num2 in zip(note, Effective):
products.append(num1 * num2)
print(sum(products)/(sum(Effective)))
My first question is, how do I turn both lists into a 3rd list:
(6,6,6,8,8,8,8,8,10,10,10,10,10,10,13,13,13,13,13,13,13,14,14,14,14,14,17)
in order to get the median.
Thanks,
Donka
Here's one approach iterating over Effective on an inner level to replicate each number as many times as specified in Effective, and taking the median using statistics.median:
from statistics import median
out = []
for i in range(len(note)):
for _ in range(Effective[i]):
out.append(note[i])
print(median(out))
# 10
To get your list you could do something like
total = []
for grade, freq in zip(note, Effective):
total += freq*[grade]
You can use np.repeat to get a list with the new values.
note = [6,8,10,13,14,17]
Effective = [3,5,6,7,5,1]
import numpy as np
new_list = np.repeat(note,Effective)
np.median(new_list),np.mean(new_list)
To achieve output like the third list that you expect you have to do something like that:
from statistics import median
note = [6,8,10,13,14,17]
Effective = [3,5,6,7,5,1]
newList = []
for index,value in enumerate(Effective):
for j in range(value):
newList.append(note[index])
print(newList)
print("Median is {}".format(median(newList)))
Output:
[6, 6, 6, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 17]
Median is 10
For computing the median I suggest you use statistics.median:
from statistics import median
note = [6, 8, 10, 13, 14, 17]
effective = [3, 5, 6, 7, 5, 1]
total = [n for n, e in zip(note, effective) for _ in range(e)]
result = median(total)
print(result)
Output
10
If you look at total (in the code above), you have:
[6, 6, 6, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 17]
A functional alternative, using repeat:
from statistics import median
from itertools import repeat
note = [6, 8, 10, 13, 14, 17]
effective = [3, 5, 6, 7, 5, 1]
total = [v for vs in map(repeat, note, effective) for v in vs]
result = median(total)
print(result)
note = [6,8,10,13,14,17]
effective = [3,5,6,7,5,1]
newlist=[]
for i in range(0,len(note)):
for j in range(effective[i]):
newlist.append(note[i])
print(newlist)

Cumulative Explained Variance for PCA in Python

I have a simple R script for running FactoMineR's PCA on a tiny dataframe in order to find the cumulative percentage of variance explained for each variable:
library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)
df <- data.frame(a, b, c, d)
df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)
Which returns:
> print(df_pca$eig$`cumulative percentage of variance`)
[1] 58.55305 84.44577 99.86661 100.00000
I'm trying to do the same in Python using scikit-learn's decomposition package as follows:
import pandas as pd
from sklearn import decomposition, linear_model
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)
# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
But this results in:
[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]
As you can see, both correctly add up to 100%, but it seems the contributions of each variable differ between the R and Python versions. Does anyone know where these differences are coming from or how to correctly replicate the R results in Python?
EDIT: Thanks to Vlo, I now know that the differences stem from the FactoMineR PCA function scaling the data by default. By using the sklearn preprocessing package (pca_data = preprocessing.scale(df)) to scale my data before running PCA, my results match the
Thanks to Vlo, I learned that the differences between the FactoMineR PCA function and the sklearn PCA function is that the FactoMineR one scales the data by default. By simply adding a scaling function to my python code, I was able to reproduce the results.
import pandas as pd
from sklearn import decomposition, preprocessing
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca_data = preprocessing.scale(df)
pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
Output:
[0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]

calculate histogram peaks in python

In Python, how do I calcuate the peaks of a histogram?
I tried this:
import numpy as np
from scipy.signal import argrelextrema
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4,
5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9,
12,
15, 16, 17, 18, 19, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24,]
h = np.histogram(data, bins=[0, 5, 10, 15, 20, 25])
hData = h[0]
peaks = argrelextrema(hData, np.greater)
But the result was:
(array([3]),)
I'd expect it to find the peaks in bin 0 and bin 3.
Note that the peaks span more than 1 bin. I don't want it to consider the peaks that span more than 1 column as additional peak.
I'm open to another way to get the peaks.
Note:
>>> h[0]
array([19, 15, 1, 10, 5])
>>>
In computational topology, the formalism of persistent homology provides a definition of "peak" that seems to address your need. In the 1-dimensional case the peaks are illustrated by the blue bars in the following figure:
A description of the algorithm is given in this
Stack Overflow answer of a peak detection question.
The nice thing is that this method not only identifies the peaks but it quantifies the "significance" in a natural way.
A simple and efficient implementation (as fast as sorting numbers) and the source material to the above answer given in this blog article:
https://www.sthu.org/blog/13-perstopology-peakdetection/index.html
Try the findpeaks library.
pip install findpeaks
# Your input data:
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,]
# import library
from findpeaks import findpeaks
# Find some peaks using the smoothing parameter.
fp = findpeaks(lookahead=1, interpolate=10)
# fit
results = fp.fit(data)
# Make plot
fp.plot()
# Results with respect to original input data.
results['df']
# Results based on interpolated smoothed data.
results['df_interp']
I wrote an easy function:
def find_peaks(a):
x = np.array(a)
max = np.max(x)
lenght = len(a)
ret = []
for i in range(lenght):
ispeak = True
if i-1 > 0:
ispeak &= (x[i] > 1.8 * x[i-1])
if i+1 < lenght:
ispeak &= (x[i] > 1.8 * x[i+1])
ispeak &= (x[i] > 0.05 * max)
if ispeak:
ret.append(i)
return ret
I defined a peak as a value bigger than 180% that of the neighbors and bigger than 5% of the max value. Of course you can adapt the values as you prefer in order to find the best set up for your problem.

Categories