I have a dataframe with this info:
I need to find a formula that calculates, for each of the 4 months of 2023, the real variation of column A against the same months of 2022. For example, in the case of 2023-04, the calculation is
x = 140 (value of 2022-04) * 1,66 (accumulated inflation from 2022-04 to 2023-04)
x= 232,27
Real variation 2023-04= (150 (value of 2023-04) - x)/x
Real variation 2023-04 = -0,35
The value 1,66, that is the accumulated inflation from 2022-04 to 2023-04, comes from this calculation: starting from the number 1 in 2022-04, for every month until 2023-04, apply the formula = previous row value*(1+inflation column value). For example, in the case 2023-04 the value 1,66 is the the last one of the calculation (the accumulated inflation of the 12 months) 1 1,06 1,09 1,15 1,19 1,28 1,35 1,39 1,46 1,58 1,64 1,66.
Thanks
your data is realy bad. You have missing values. ColumnB is in [%] I think.
here is my suggestion
Dataframe:
Time columnA columnB
0 2022-01-31 100 0.3
1 2022-02-28 120 0.5
2 2022-03-31 150 0.4
3 2022-04-30 140 0.7
Code of your calculations
df['vals'] = np.nan
df.loc[3, 'vals'] = 1
k = 1
arr = []
for i in df['columnB'].loc[4:].values:
k = k*(1+i/10)
arr.append(k)
df.loc[4:, 'vals'] = arr
df['Month'] = df['Time'].dt.month
df['Year'] = df['Time'].dt.year
year = 2023
for month in range(1,13):
v1 = df['vals'].loc[(df['Month'] == month)&(df['Year'] == year)].values[0]
v2 = df['columnA'].loc[(df['Month'] == month)&(df['Year'] == year-1)].values[0]
x = v1['vals']*v2
print(f'{year}-{month}', (v1['columnA']-x)/x)
Output would be:
2023-4 -0.354193
The code could be perhabs optimized, but I am not sure, if your input is correct.
cheers
Here is a completely vectorized solution using pure pandas (in other words: it is fast).
It is relatively straightforward if you have a DataFrame with the proper index. Also, your value "INFLATION" is in undefined units. In order to match your example, I have to divide it by 10 (so it is neither fraction nor percentage).
Step 1: sample data as reproducible example
df = pd.DataFrame({
'TIME': ['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06', '2022-07', '2022-08',
'2022-09', '2022-10', '2022-11', '2023-01', '2023-02', '2023-03', '2023-04'],
'A': [100, 120, 150, 140, 180, 200, 100, 120, 150, 140, 180, 200, 100, 120, 150],
'INFLATION': [0.3, 0.5, 0.4, 0.7, 0.6, 0.3, 0.5, 0.4, 0.7, 0.6, 0.3, 0.5, 0.8, 0.4, 0.1],
})
Step 2: Your calculation
Convert the time column into a PeriodIndex
df = df.assign(TIME=df['TIME'].apply(pd.Period)).set_index('TIME')
Solution
ci = (1 + df['INFLATION']/10).cumprod() # cumulative inflation
ci12 = ci / ci.shift(12, 'M') # 12-month variation
x = df['A'].shift(12, 'M') * ci12
real_var = (df['A'] - x) / x # your "real variation"
# putting it all together in a new df
res = df.assign(ci12=ci12, x=x, real_var=real_var)
>>> res
A INFLATION ci12 x real_var
TIME
2022-01 100 0.3 NaN NaN NaN
2022-02 120 0.5 NaN NaN NaN
2022-03 150 0.4 NaN NaN NaN
2022-04 140 0.7 NaN NaN NaN
2022-05 180 0.6 NaN NaN NaN
2022-06 200 0.3 NaN NaN NaN
2022-07 100 0.5 NaN NaN NaN
2022-08 120 0.4 NaN NaN NaN
2022-09 150 0.7 NaN NaN NaN
2022-10 140 0.6 NaN NaN NaN
2022-11 180 0.3 NaN NaN NaN
2023-01 200 0.5 1.708788 170.878849 0.170420
2023-02 100 0.8 1.757611 210.913323 -0.525872
2023-03 120 0.4 1.757611 263.641653 -0.544837
2023-04 150 0.1 1.659053 232.267475 -0.354193
Related
I have a pandas dataframe which looks like
Temperature_lim Factor
0 32 0.95
1 34 1.00
2 36 1.06
3 38 1.10
4 40 1.15
I need to extract factor value for any given temperature , if my current temperature is 31, my factor is 0.95. If my current temp is 33, factor is 1, if my current_temp is 38.5 factor is 1.15. So by giving my current temperature, i would like to know the factor for that temperature.
I can do this using multiple if else statements, but is there any effective way I can do it by creating bins/intervals in pandas or python.
Thank you
Use cut with add -np.inf to values of column Temperature_lim and missing values by last value of Factor value:
df1 = pd.DataFrame({'Temp':[31,33,38.5, 40, 41]})
b = [-np.inf] + df['Temperature_lim'].tolist()
lab = df['Factor']
df1['new'] = pd.cut(df1['Temp'], bins=b, labels=lab, right=False).fillna(lab.iat[-1])
print (df1)
Temp new
0 31.0 0.95
1 33.0 1.00
2 38.5 1.15
3 40.0 1.15
4 41.0 1.15
I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')
here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef
I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range?
I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem?
import pandas as pd
data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] }
data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'],
'O':[20, 40, 60], 'S': [5, 10, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
r = pd.DataFrame()
#use this code to get first value when F <= Yrs
r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \
(1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \
(pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of']))
#use this code to get the rest of the values until L = Yrs
r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\
'2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \
pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \
pd.to_numeric(df2['Of'])
r
I expect output to be:(the values may be different but its the pattern I want)
2018 2019 2020 2021 2022
0 7.75 8.375 NaN NaN NaN
1 11.0 11.5 12 12.5 13.0
2 NaN NaN 18 18.75 19.25
but i get:
2018 2019 2020 2021 2022
0 7.75 8.375 9.0 9.625 10.25
1 11.0 11.5 12 NaN NaN
2 16.50 17.25 18 NaN NaN
I want to fetch the value from the previous column but the same row and I need to multiply that value with 5 and write it to the current place.
I have tried shift method of pandas but it's not working. after that, I have written the separate function to get the previous column name..but I think that's not the good approach.
'''
def get_previous_column_name(wkName):
v = int(wkName.strip('W'))
newv = str(v - 1)
if len(newv) == 1:
newv = '0' + newv
return 'W' + newv
'''
dataframe:
W01,W02,W03,W04,W05
7, 8
10,20
20, 40
expected result:
W01,W02,W03,W04,W05
7, 8, 40, 200, 1000
10, 20, 100, 500, 2500
20, 40, 200, 1000, 5000
Here is one way ffill +cumsum
df=df.ffill(1)*(5)**df.isnull().cumsum(1)
df
Out[230]:
W01 W02 W03 W04 W05
0 7.0 8.0 40.0 200.0 1000.0
1 10.0 20.0 100.0 500.0 2500.0
2 20.0 40.0 200.0 1000.0 5000.0
import pandas as pd
data = pd.read_csv('C:/d1', sep=',', header=None,names=['W1','W2'])
df=pd.DataFrame(data)
dfNew=pd.DataFrame(columns=['W1','W2','W3','W4','W5'])
(rows,columns)=df.shape
for index in range(rows):
tempRow=[df.iat[index,0],df.iat[index,1],df.iat[index,1]*5,df.iat[index,1]*25,df.iat[index,1]*125]
dfNew.loc[len(dfNew)]=tempRow
print()
print(dfNew)
If you indeed have only three columns to fill, just do the multiplication:
df['W03'] = df['W02'] * 5
df['W04'] = df['W03'] * 5
df['W05'] = df['W04'] * 5
df
# W01 W02 W03 W04 W05
#0 7 8 40 200 1000
#1 10 20 100 500 2500
#2 20 40 200 1000 5000
AS the title suggest, I am trying to create confidence intervals based on a table with a ton of nan values. Here is an example of what I am working with.
Attendence% 2016-10 2016-11 2017-01 2017-02 2017-03 2017-04 ...
Name
Karl nan 0.2 0.4 0.5 0.2 1.0
Alice 1.0 0.7 0.6 nan nan nan
Ryan nan nan 1.0 0.1 0.9 0.2
Don nan 0.5 nan 0.2 nan nan
Becca nan 0.2 0.6 0 nan nan
For reference, in my actual dataframe there are more NaNs than not, and they represent months where they did not need to show up, so replacing the values with 0 will affect the result.
Now everytime I try applying a Confidence interval to each name, it it returns the mean as NaN, as well as both intervals.
Karl (nan, nan, nan)
Alice (nan, nan, nan)
Ryan (nan, nan, nan)
Don (nan, nan, nan)
Becca (nan, nan, nan)
Is there a way to filter out the NaN so it just applies the formula while not taking into account the NaN values. So far what I have been doing has been the following:
unstacked being the table i visually represented.
def mean_confidence_interval(unstacked, confidence=0.9):
a = 1.0 * np.array(unstacked)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
answer = unstacked.apply(mean_confidence_interval)
answer
Use np.nanmean instead of np.mean: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html
And for scipy.stats.sem(a), replace it with pass scipy.stats.sem(a, nan_policy='omit').
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html