I have the following dataset where i make my predictions and historically i know the standard deviations on these predictions:
d = {'Name': ['Jim', 'Matt','Alex','Nathan','Dom'], 'Predict': [2.901826509,3.212149337,2.388237651,3.744206058,1.944415024]}
df = pd.DataFrame(data=d)
df['Mean'] = 4
df['StDev'] = 6
df.head(5)
Name Predict Mean StDev
0 Jim 2.901827 4 6
1 Matt 3.212149 4 6
2 Alex 2.388238 4 6
3 Nathan 3.744206 4 6
4 Dom 1.944415 4 6
I have also found a function from https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f
That has the following:
import numpy as np
from scipy.stats import norm
def MC_prob(M,mu,sigma):
prob_larger_than3 = []
for i in range(M):
# Using CDF since P[Z>=3] = 1-P[Z<=3]
p = 1- norm.cdf(3, mu, sigma)
# Using Survival Function P[Z>=3]
p = norm.sf(3, mu, sigma)
prob_larger_than3.append(p)
MC_approximation_prob = np.array(prob_larger_than3).mean()
return(MC_approximation_prob)
MC_prob(M = 10000, mu = 10, sigma = 2)
0.9997673709209641
I would like to apply this function and create a new column in my dataframe, with the probability of my Predict column being over 3.
I tried:
df['ProbOver3'] = MC_prob(M = 10000, mu = df.Predict, sigma = df.StDev)
but it gave the same value for every for row. Any ideas on how to apply this over every row? Essentially I am trying to simulate and return a probability of each row being above or below certain numbers and I hope I am on the right track. It's a Follow up question to this one Apply a monte carlo simulation on a pandas dataframe and return probability result in column
Any help would be much appreciated, thanks very much!
Use df.apply() with a lambda. You can apply (pun intended) this function to every row to make a new column by adding the axis=1 which specifies every row. Then use a lambda to pass the row to the function. Here is how you could use this:
df['ProbOver3'] = df.apply(lambda row: MC_prob(10000, row['Predict'], row['StDev']), axis=1)
Checkout the docs on df.apply for more info.
Related
I have the following dataframe that shows monthly average daily volume of sales (adv) and average selling price (pkg_yld) and the 12-month percent changes for each metric:
df:
Over the period July 2021 to Dec. 2021 I need to:
(1) forecast pkg_yld using a linear regression model and then (2) calculate yoy_pkg_yld.
Both columns show NaN over the forecast horizon as shown above.
I have estimated a crude regression where Y = adv and X = pkg_yld; estimated between July 2020 – June 2021:
X = df['adv']
y = df['pkg_yld']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()
The estimated regression is:
pkg_yld = 93.3 – 0.68 * adv
I want to enter the provided monthly ‘adv’ column values into the formula to populate a predicted monthly ‘pkg_yld’ for each month from July- Dec. 2021 and then calculate the yoy_pkg_yld over the same period.
For July 2021:
pkg_yld = 93.3 – (0.68 x 9.0) = 87.18
yoy_pkg_yld= ((87.2 / 97.5) -1 ) x 100 = -10.6%
And so on to Dec. 2021.
What is the simplest way to overwrite NaNs in the pkg_yld and yoy_pkg_yld columns with the estimated values in my existing dataframe?
The first solution that I could come up with is defining a custom function which takes a series as an input and calculates the missing values if they are NaN. Then apply this function to every row of the DataFrame with df.apply
import pandas as pd
import numpy as np
# setup example dataframe
data = {
"A": [1, 2, 3, 4],
"B": [4, 6, np.nan, 35],
"C": [np.nan, 7, np.nan, 1]
}
df = pd.DataFrame(data)
# Function to apply on each row
# Insert your column names and coefficients
def func(row):
if np.isnan(row["B"]):
row["B"] = row["A"] * 0.5 + 3
if np.isnan(row["C"]):
row["C"] = row["A"] * 0.7 + 15
return row
result = df.apply(func, axis=1) #Apply the function, axis=1 defines to apply on rows
You can also add positional or keyword arguments to func and pass them to the values argument of apply or directly as keyword arguments. Have a look into the documentation for further details.
Alternatively you yould compute all the regression results in advance and fill the missing values with df.fillna where necessary
import copy
values = copy.deepcopy(df) # copy df to avoid overriding existing values
values["B"] = values["A"] * 0.5 + 3
values["C"] = values["A"] * 0.7 + 15
result = df.fillna(value=values)
The second approach is more readable IMO, however you are calculating all possible results, which could be a lot less efficient than the first approach, depending on the complexity of your calculation and size of your dataframe.
I want to fit two learning rates (alpha), one for the first half of the data and one for the second half of the data. I was able to do this for just one learning but am running into errors when attempting to fit two.
These functions work when minimizing one variable:
optimize.fminbound(sse_f,0,1)
minimize_scalar(sse_f, bounds=(0,1), method='bounded')
I'm not sure if I can use fminbound/minimize_scalar or if I should use minimize() for two learning rates (alphas).
My function looks something like this, I've removed a few lines for simplicity but basically, I want to minimize SSE for the first half of the data and the second half separately.
def sse_f(a_1,a_2):
data = []
for _, row in temp.iterrows():
alpha = a_1 #for first 120 rows use one alpha
if row['trial_nr'] == 120: #for rows after 120 use second alpha
alpha = a_2
#calculate diff variables removed for simplicity
data.append([phase,pi,abs_PE])
col = ['phase','abs_PE','congruency']
df_ = pd.DataFrame(data, columns= col)
df_a = df[(df['phase']=='a')]
df_b = df[(df['phase']=='b')]
x = np.array(df_a[['congruency','abs_PE']]) #run linear regression for first half
y = df_a['RT']
sse_a = lin_reg(x,y)
x_b = np.array(df_b[['congruency','abs_PE']]) #run linear regression for second half
y_b = df_b['RT']
sse_b = lin_reg(x_b,y_b) #calculate SSE
return sse_a,sse_b #return SSE for first half and second half of data
The output of this function would be a tuple e.g:
sse_f(.43,.43)
(54487466.6875101, 17251575.11206138)
If i use the minimize() I get this error:
minimize(sse_f, x0=(0,0), bounds=[(0,0),(1,1)])
TypeError: sse_f() missing 1 required positional argument: 'a_2'
And if I use minimize_scalar() I get this error:
ValueError: Optimisation bounds must be scalars or array scalars.
Any pointers regarding how to fit two alphas or how why I get these errors would be greatly appreciated!
I just changed your function definition so that the list of tuples is given to alpha1 and alpha2 individualy. I wanted to test but i don't know what the variable temp is. But I think that the list is now put into the variables like follows [minimum(alpha1 , alpha2),maximum((alpha1 , alpha2)].
def sse_f(a):
a_1=a[0]
a_2=a[1]
data = []
for _, row in temp.iterrows():
alpha = a_1 #for first 120 rows use one alpha
if row['trial_nr'] == 120: #for rows after 120 use second alpha
alpha = a_2
#calculate diff variables removed for simplicity
data.append([phase,pi,abs_PE])
col = ['phase','abs_PE','congruency']
df_ = pd.DataFrame(data, columns= col)
df_a = df[(df['phase']=='a')]
df_b = df[(df['phase']=='b')]
x = np.array(df_a[['congruency','abs_PE']]) #run linear regression for first half
y = df_a['RT']
sse_a = lin_reg(x,y)
x_b = np.array(df_b[['congruency','abs_PE']]) #run linear regression for second half
y_b = df_b['RT']
sse_b = lin_reg(x_b,y_b) #calculate SSE
return sse_a,sse_b #return SSE for first half and second half of data
Let me know if it helped.
I am trying to translate an algorithm from MATLAB to Python. The algorithm works with large datasets, and need an outlier detection and elimination technique to be applied.
In the MATLAB code, the outlier deletion technique I use is movmedian:
Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
Data_raw(find(Outlier_T),:)=[]
Which detects outliers with a rolling median, by finding desproportionate values in the centre of a three value moving window. So If I have a column "Temperatura" with a 40 on row 3, it is detected and the entire row is deleted.
Temperatura Date
1 24.72 2.3
2 25.76 4.6
3 40 7.0
4 25.31 9.3
5 26.21 15.6
6 26.59 17.9
... ... ...
To my understanding, this is achieved with pandas.DataFrame.rolling. I have seen several posts examplify its use, but I am not managing to make it work with my code:
Attempt A:
Dataframe.rolling(df["t_new"]))
Attempt B:
df-df.rolling(3).median().abs()>200
#based on #Ami Tavory's answer
Am I missing something obvious here? What is the right way of doing this?
Thank you for your time.
Code below drops the rows based on threshold. This threshold could be adjusted as needed. Not sure if it replicates Matlab code though.
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
Output
print(df)
Temperatura Date
0 24.72 2.3
1 25.76 4.6
3 25.31 9.3
4 26.21 15.6
5 26.59 17.9
Late to the party, based on Nilesh Ingle's answer. Modified to be more general, verbose (graphs!), and a percentage threshold instead of the data's real values.
# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df["Temp_Scaled"] = scaler.fit_transform(df["Temp"].values.reshape(-1, 1))
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))
# Calculate difference
df["Temp_Diff"] = df["Temp_Scaled"] - df["Temp_Rolling"]
import numpy as np
import matplotlib.pyplot as plt
# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4
# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)
# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()
Once you're satisfied with the data cleaning
# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()
Nilesh answer works perfectly, to iterate on his code you could also do :
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index)
# if you want to drop the column as well
del df["rolling_temp"]
I would like to remove outliers from Pandas dataframe using some user defined function. There are some answers to the same question I am asking in Stackoverflow but the difference is that the Data-set I have are circular data. Therefore, using Pandas built-in functions mean(), std() would not be appropriate. For example in circular data values of 355 and 5 have only a difference of 10 but linear difference gives 350.
I have thousands of dataframes like the one below. We clearly see that Geophone 6 is an outlier.
Geophone azimuth incidence
0 1 194.765326 29.703151
1 2 193.143982 23.380681
2 3 199.327911 34.752212
3 4 195.641010 49.186893
4 5 193.479015 21.192982
5 6 0.745142 3.410046
6 7 192.380435 29.778807
7 8 196.700814 19.750237
It can also be confirmed when plotting the data in a polar diagram.
I have written two functions mean_angle and variance_angle which calculates circular mean and variance to be applied to the data. Variance gives a value between 0 and 1. When data are close to each other Variance value gets closer to 0 and vise versa.
import numpy as np
def mean_angle(deg):
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
mu = np.arctan(S/C)
mu = np.rad2deg(mu)
if S>0 and C>0:
mu = mu
elif S>0 and C<0:
mu = mu +180
elif S<0 and C<0:
mu = mu+180
elif S<0 and C>0:
mu = mu +360
return mu
def variance_angle(deg):
"""
deg: angles in degrees
"""
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
length = C.size
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
R = np.sqrt(S**2 + C**2)
R_avg = R/length
V = 1- R_avg
return V
mean_azimuth = mean_angle(df.azimuth)
variance = variance_angle(df.azimuth)
print(mean_azimuth)
197.4122778774279
print(variance)
0.24614383460498535
However, when excluding row 5 from calculation, mean and variance become 195.06226604362286 , 0.0007544067627361928 respectively. The Variance is changed from 0.25 to almost 0.
Therefore, I would like to find a way to remove any circular outlier value/s (azimuth) which makes circular variance high using the defined functions shown above.
In this example incidence is also an outlier for the same Geophone but It actually does not have any relation to azimuth. There are other data where incidenceis within the range but azimuth is an outlier.
Any help is really appreciated.
One way to do outlier detection is to compute mean and std of the data, then remove points that lie somewhere outside A*std of the mean (where you tune A to be whatever is reasonable for your data.)
So you could use your functions to compute mean and variance of your dataframe, then pass over the dataframe again to remove data points outside A*std of the mean.
I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)