Pandas: Apply function that references other rows & other dataframe - python

I am trying to build a function to use in a df.apply() that references 1) other rows, and 2) another DatetimeIndex.
dt_index = DatetimeIndex(['2022-09-16', '2022-12-16', '2023-03-10', '2023-06-16',
'2023-09-15', '2023-12-15', '2024-03-15', '2024-06-14'],
dtype='datetime64[ns]', freq=None)
In regards to the main df:
df.index = DatetimeIndex(['2022-08-30', '2022-08-31', '2022-09-01', '2022-09-02',
'2022-09-03', '2022-09-04', '2022-09-05', '2022-09-06',
'2022-09-07', '2022-09-08',
...
'2024-08-20', '2024-08-21', '2024-08-22', '2024-08-23',
'2024-08-24', '2024-08-25', '2024-08-26', '2024-08-27',
'2024-08-28', '2024-08-29'],
dtype='datetime64[ns]', name='index', length=731, freq=None)
df = 3M 1Y 2Y
2022-08-30 1.00 1.00 1.00 1.000000
2022-08-31 2.50 2.50 2.50 2.500000
2022-09-01 3.50 3.50 3.50 3.500000
2022-09-02 5.50 5.50 5.50 5.833333
2022-09-03 5.65 5.65 5.65 5.983333
... ... ... ... ...
2024-08-25 630.75 615.75 599.75 607.750000
2024-08-26 631.75 616.75 600.75 608.750000
2024-08-27 632.75 617.75 601.75 609.750000
2024-08-28 633.75 618.75 602.75 610.750000
2024-08-29 634.75 619.75 603.75 611.750000
My goal is to use a function that:
For each index value, x, in df, find the closest two values in dt_index (have this below)
Then, in df, return: (x - id_low) / (id_high - id_low)
def transform(x, dt_index):
id_low = dt_index.iloc[dt_index.get_loc(x, method ='ffill')]
id_high = dt_index.iloc[dt_index.get_loc(x, method ='bfill')]
It's part 2 that I dont know how to write, as it references other rows in df outside of the one the function is being applied to.
Any help appreciated!

After fixing inaccuracies in your code,
You can simply reference your dataframe df inside the function:
def transform(x, dt_index):
id_low = dt_index[dt_index.get_indexer([x.name], method ='ffill')][0]
id_high = dt_index[dt_index.get_indexer([x.name], method ='bfill')][0]
return (x - df.loc[id_low]) / (df.loc[id_high] - df.loc[id_low])
df.transform(transform, dt_index=dt_index, axis=1)
Example:
df = pd.DataFrame(np.arange(24).reshape(6, 4))
dt_index = pd.Index([0,2,5])
# Result:
0 1 2 3
0 NaN NaN NaN NaN
1 0.500000 0.500000 0.500000 0.500000
2 NaN NaN NaN NaN
3 0.333333 0.333333 0.333333 0.333333
4 0.666667 0.666667 0.666667 0.666667
5 NaN NaN NaN NaN
Note:
NaN values are due to the mathematically undefined result for 0/0:
when id_low == id_high == x.name.

Related

Pandas: Apply a function that returns the division between two rows, for multiple columns

I have a dataframe that is grouped by the indexes 'age group' and 'churn'
age group
churn
salary
debt
investments
21-30
0
5000
3000
2500
21-30
1
4000
2000
5000
31-40
0
6000
2500
1000
31-40
1
2000
2000
1000
I'd like create a new dataframe that returns the division between column[churn==1] / column[churn==0] for each numeric column
Desired dataframe:
age group
variation_salary
variation_debt
variation_investments
21-30
0.8
0.66
2
31-40
0.33
0.8
1
How can I resolve this problem?
Thanks in advance!
When setting the age group and churn as index, you can then select subsets with .xs() and the age group index will ensure you’re dividing the same age groups with each other:
>>> df = df.set_index(['age group', 'churn'])
>>> df.xs(1, 'index', 'churn') / df.xs(0, 'index', 'churn')
salary debt investments
age group
21-30 0.800000 0.666667 2.0
31-40 0.333333 0.800000 1.0
You can then also call .reset_index() on the result.
You can group by age group again, select the relevant columns, and then use .last() to get the entry corresponding to churn==1 within the group and .first() to get the entry corresponding to churn==0 within the group. Then, use them for division:
g = df.groupby('age group')[['salary', 'debt', 'investments']]
g.last() / g.first()
Result:
salary debt investments
age group
21-30 0.800000 0.666667 2.0
31-40 0.333333 0.800000 1.0
Another approach: you can also filter by churn==1 and churn==0 before grouping, as follows:
g1 = df.loc[df['churn']==1].groupby('age group')[['salary', 'debt', 'investments']].first()
g0 = df.loc[df['churn']==0].groupby('age group')[['salary', 'debt', 'investments']].first()
g1 / g0
Result:
salary debt investments
age group
21-30 0.800000 0.666667 2.0
31-40 0.333333 0.800000 1.0
We can further fine-tune to your desired layout, as follows:
df_out = ((g.last() / g.first())
.add_prefix('variation_')
.reset_index()
)
Or, for the alternative approach, use:
df_out = ((g1 / g0)
.add_prefix('variation_')
.reset_index()
)
Result:
print(df_out)
age group variation_salary variation_debt variation_investments
0 21-30 0.800000 0.666667 2.0
1 31-40 0.333333 0.800000 1.0

Remove group of empty or nan in pandas groupby

In a dataframe, with some empty(NaN) values in some rows - Example below
s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])
When I groupby start and end columns
s.groupby(['start', 'end']).total.sum()
the output I get is
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
735571896 203559638 nan
282186552 nan
736453090 6126187 nan
I want to exclude all the groups of start where all values with end is 'nan' - Expected output -
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
I tried with dropna(), but it is removing all the nan values and not nan groups.
I am newbie in python and pandas. Can someone help me in this? thank you
In newer pandas versions is necessary use min_count=1 for missing values if use sum:
s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()
Then is possible filter if at least one non missing value per first level by Series.notna with GroupBy.transform and GroupBy.any, filtering is by boolean indexing:
s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
Or is possible get unique values of first level index values by MultiIndex.get_level_values and filtering by DataFrame.loc:
idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')
here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

How to add conditions when calculating using Python?

I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)

Applying a map across several columns in Pandas

I am trying to apply a map across several columns in pandas to reflect when data is invalid. When data is invalid in my df['Count'] column, I want to set my df['Value'], df['Lower Confidence Interval'], df['Upper Confidence Interval'] and df['Denominator'] columns to -1.
This is a sample of the dataframe:
Count Value Lower Confidence Interval Upper Confidence Interval Denominator
121743 54.15758428 53.95153779 54.36348867 224794
280 91.80327869 88.18009411 94.38654088 305
430 56.95364238 53.39535553 60.44152684 755
970 70.54545455 68.0815009 72.89492873 1375
nan
70 28.57142857 23.27957213 34.52488678 245
125 62.5 55.6143037 68.91456314 200
Currently, I am trying:
set_minus_1s = {np.nan: -1, '*': -1, -1: -1}
then:
df[['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']] = df['Count'].map(set_minus_1s)
and getting the error:
ValueError: Must have equal len keys and value when setting with an iterable
Is there any way of chaining the column references to make one call to the map rather than have separate lines for each column to call the set_minus_1s dictionary as a map?
I think you can use where or mask and replace all rows where not isnull after apply map:
val = df['Count'].map(set_minus_1s)
print (val)
0 NaN
1 NaN
2 NaN
3 NaN
4 -1.0
5 NaN
6 NaN
Name: Count, dtype: float64
cols =['Value','Count','Lower Confidence Interval','Upper Confidence Interval','Denominator']
df[cols] = df[cols].where(val.isnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0
cols = ['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']
df[cols] = df[cols].mask(val.notnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0

Categories