How to replace NaNs using regression model estimates as forecasts - python

I have the following dataframe that shows monthly average daily volume of sales (adv) and average selling price (pkg_yld) and the 12-month percent changes for each metric:
df:
Over the period July 2021 to Dec. 2021 I need to:
(1) forecast pkg_yld using a linear regression model and then (2) calculate yoy_pkg_yld.
Both columns show NaN over the forecast horizon as shown above.
I have estimated a crude regression where Y = adv and X = pkg_yld; estimated between July 2020 – June 2021:
X = df['adv']
y = df['pkg_yld']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()
The estimated regression is:
pkg_yld = 93.3 – 0.68 * adv
I want to enter the provided monthly ‘adv’ column values into the formula to populate a predicted monthly ‘pkg_yld’ for each month from July- Dec. 2021 and then calculate the yoy_pkg_yld over the same period.
For July 2021:
pkg_yld = 93.3 – (0.68 x 9.0) = 87.18
yoy_pkg_yld= ((87.2 / 97.5) -1 ) x 100 = -10.6%
And so on to Dec. 2021.
What is the simplest way to overwrite NaNs in the pkg_yld and yoy_pkg_yld columns with the estimated values in my existing dataframe?

The first solution that I could come up with is defining a custom function which takes a series as an input and calculates the missing values if they are NaN. Then apply this function to every row of the DataFrame with df.apply
import pandas as pd
import numpy as np
# setup example dataframe
data = {
"A": [1, 2, 3, 4],
"B": [4, 6, np.nan, 35],
"C": [np.nan, 7, np.nan, 1]
}
df = pd.DataFrame(data)
# Function to apply on each row
# Insert your column names and coefficients
def func(row):
if np.isnan(row["B"]):
row["B"] = row["A"] * 0.5 + 3
if np.isnan(row["C"]):
row["C"] = row["A"] * 0.7 + 15
return row
result = df.apply(func, axis=1) #Apply the function, axis=1 defines to apply on rows
You can also add positional or keyword arguments to func and pass them to the values argument of apply or directly as keyword arguments. Have a look into the documentation for further details.
Alternatively you yould compute all the regression results in advance and fill the missing values with df.fillna where necessary
import copy
values = copy.deepcopy(df) # copy df to avoid overriding existing values
values["B"] = values["A"] * 0.5 + 3
values["C"] = values["A"] * 0.7 + 15
result = df.fillna(value=values)
The second approach is more readable IMO, however you are calculating all possible results, which could be a lot less efficient than the first approach, depending on the complexity of your calculation and size of your dataframe.

Related

how can i use dataframe and datetimeindex to return rolling 12-month?

Imagine a pandas dataframe with 2 columns (“Manager Returns” and “Benchmark Returns”) and a DatetimeIndex of monthly frequency. Please write a function to calculate the rolling 12-month manager alpha and rolling-12 month tracking error (both annualized).
so far I have this but confused about the rolling-12 month:
import pandas as pd
import numpy as np
#define dummy dataframe with monthly returns
df = pd.DataFrame(1 + np.random.rand(20), columns=['returns'])
#compute 12-month rolling returns
df_roll = df.rolling(window=12).apply(np.prod) - 1
So, you want to calculate the excess return on the 'Manager Returns' compared to the 'Benchmark Returns. First, we create some random data for these two values.
import pandas as pd
import numpy as np
n=20
df = pd.DataFrame(dict(
Manager=np.random.randint(2, 9, size=n),
Benchmark=np.random.randint(1, 7, size=n),
index=pd.date_range("20180101", freq='MS', periods=20)))
df.set_index('index', inplace = True)
To calculates the excess return (Alpha), the rolling mean of Alpha and the rolling mean of Tracking Error we create new columns for each value.
# Create Alpha
df['Alpha'] = df['Manager'] - df['Benchmark']
# Rolling mean of Alpha
df['Alpha_rolling'] = df['Alpha'].rolling(12).mean()
# Rolling mean of Tracking error
df['TrackingError_rolling'] = df['Alpha'].rolling(12).std()
Edit: I see that the values should be annualized, so you would have to transform the monthly returns I guess, my finance lingo is not currently up to date.

Pandas calculating the speed between two rows based on timestamp and coordinates

Given a pandas dataframe like:
timestamp latitude longitude
1652846403129 30 20
1652846415130 31 21
1652846427128 32 22
1652846439128 33 23
How could I calculate the "speed" between two rows and have it as a new column?
I would have to calculate the distance based on the coordinates and divide by time taken.
I have a function that calculates distance between two coordinates, but it takes two tuples of form (lat,long) as input.
rolling.apply seemed like an option so that I can work on windows of two, but I can't quite get it to work and it seemed only to work on one series at a time.
Something like creating a new column with shifting doesn't seem to work either since the function for distance that I'm using doesn't support the use of dataframes.
You can use the diff function to get the time:
df.loc[:, 'time'] = df['timestamp'].diff()
You can also get the distance from the previous row using the shift function using your function:
def get_distance
df.loc[:, 'distance'] = df[['latitude', 'longitude']].apply(lambda x: get_distance((x['latitude'], x['longitude']), (x['latitude'].shift(1), x['longitude'].shift(1)))
Below code can help in getting the required results:
# Import required packages
import pandas as pd
from haversine import haversine
# Dummy data
df = pd.DataFrame({'timestamp':[1652846403129,1652846415130, 1652846427128, 1652846439128],
'latitude':[30, 31, 32, 33],
'longitude':[20, 21, 22, 23]
})
# haversine distance function
def haversine_dist(x1,x2,y1,y2):
return haversine((x1, x2) , (y1, y2), unit='mi')
# Data Processing
df['distance'] = np.vectorize(haversine_dist)(df['latitude'],df['longitude'],df['latitude'].shift(1),df['longitude'].shift(1)) #calculate haversine distance
df['time_taken'] = df['timestamp'] - df['timestamp'].shift(1) # calculate time difference
df['speed'] = df['distance']/df['time_taken'] # cal speed
df
Output:

Apply a function across all rows in new column creation Pandas

I have the following dataset where i make my predictions and historically i know the standard deviations on these predictions:
d = {'Name': ['Jim', 'Matt','Alex','Nathan','Dom'], 'Predict': [2.901826509,3.212149337,2.388237651,3.744206058,1.944415024]}
df = pd.DataFrame(data=d)
df['Mean'] = 4
df['StDev'] = 6
df.head(5)
Name Predict Mean StDev
0 Jim 2.901827 4 6
1 Matt 3.212149 4 6
2 Alex 2.388238 4 6
3 Nathan 3.744206 4 6
4 Dom 1.944415 4 6
I have also found a function from https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f
That has the following:
import numpy as np
from scipy.stats import norm
def MC_prob(M,mu,sigma):
prob_larger_than3 = []
for i in range(M):
# Using CDF since P[Z>=3] = 1-P[Z<=3]
p = 1- norm.cdf(3, mu, sigma)
# Using Survival Function P[Z>=3]
p = norm.sf(3, mu, sigma)
prob_larger_than3.append(p)
MC_approximation_prob = np.array(prob_larger_than3).mean()
return(MC_approximation_prob)
MC_prob(M = 10000, mu = 10, sigma = 2)
0.9997673709209641
I would like to apply this function and create a new column in my dataframe, with the probability of my Predict column being over 3.
I tried:
df['ProbOver3'] = MC_prob(M = 10000, mu = df.Predict, sigma = df.StDev)
but it gave the same value for every for row. Any ideas on how to apply this over every row? Essentially I am trying to simulate and return a probability of each row being above or below certain numbers and I hope I am on the right track. It's a Follow up question to this one Apply a monte carlo simulation on a pandas dataframe and return probability result in column
Any help would be much appreciated, thanks very much!
Use df.apply() with a lambda. You can apply (pun intended) this function to every row to make a new column by adding the axis=1 which specifies every row. Then use a lambda to pass the row to the function. Here is how you could use this:
df['ProbOver3'] = df.apply(lambda row: MC_prob(10000, row['Predict'], row['StDev']), axis=1)
Checkout the docs on df.apply for more info.

How to calculate the correlation coefficient of grouped quantities in Pandas?

I have a DataFrame in which each row represents a traffic accident. Two of the columns are Speed_limit and Number_of_casualties. I would like to compute the Pearson correlation coefficient between the speed limit and the ratio of the number of casualties to accidents for each speed limit.
My solution so far is to get the relevant quantities as arrays and use SciPy's pearsonr:
import pandas as pd
import scipy.stats
df = pd.DataFrame({'Speed_limit': [10, 10, 20, 20, 20, 30],
'Number_of_casualties': [1, 2, 3, 4, 1, 4]})
accidents_per_speed_limit = df['Speed_limit'].value_counts().sort_index()
number_of_casualties_per_speed_limit = df.groupby('Speed_limit').sum()['Number_of_casualties']
speed_limit = accidents_per_speed_limit.index
ratio = number_of_casualties_per_speed_limit.values / accidents_per_speed_limit.values
r, _ = scipy.stats.pearsonr(x=speed_limit, y=ratio)
print("The Pearson's correlation coefficient between the number of casualties per accidents and the speed limit is {r}.".format(r=r))
However, it would seem to me that it should be possible to do this more elegantly using the pandas.DataFrame.corr method. How could I refactor this code to make it more pandas-like?
Instead of count and sum you can use directly use mean of groupby data then use series corr (by default method is pearson) i.e
m = df.groupby('Speed_limit').mean().reset_index()
m['Speed_limit'].corr(m['Number_of_casualties'])
Output :
0.99926008128973687
I found the following way using two auxiliary DataFrames:
df_aux = df.groupby('Speed_limit').agg(['count', 'sum'])
df_aux2 = pd.DataFrame({'ratio': df_aux['Number_of_casualties', 'sum'] / df_aux['Number_of_casualties', 'count'],
'speed_limit': df_aux.index})
print(df_aux2.corr()['ratio']['speed_limit'])
which corroborates the result obtained with scipy.stats.pearsonr. It's still not very elegant though, and I would appreciate suggestions for improvements.

Applying pandas qcut bins to new data

I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)

Categories