Populate the current row based on the prev - python

A question was posted on the link below where one wanted to use a previous row to populate the current row:
Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
In this case, there was only one index, the date.
Now I want to add a second index, employee ID; on the first occurrence of the first index, Index_EmpID, then I would like B to be populated with a value from A. On any subsequent occurrence, I would like the value from the previous row multiplied by the value from the current row.
I have the following data frame:
|Index_EmpID |Index_Date | A | B |
|============|===========|======|=====|
|A123 |2022-01-31 | 1 | NaN |
|A123 |2022-02-28 | 1 | NaN |
|A123 |2022-03-31 | 1.05 | NaN |
|A123 |2022-04-30 | 1 | NaN |
|A567 |2022-01-31 | 1 | NaN |
|A567 |2022-02-28 | 1.05 | NaN |
|A567 |2022-03-31 | 1 | NaN |
|A567 |2022-04-30 | 1.05 | NaN |
I require:
|Index_EmpID |Index_Date | A | B |
|============|===========|======|======|
|A123 |2022-01-31 | 1 | 1 |
|A123 |2022-02-28 | 1 | 1 |
|A123 |2022-03-31 | 1.05 | 1.05 |
|A123 |2022-04-30 | 1 | 1.05 |
|A567 |2022-01-31 | 1 | 1 |
|A567 |2022-02-28 | 1.05 | 1.05 |
|A567 |2022-03-31 | 1 | 1.05 |
|A567 |2022-04-30 | 1.05 |1.1025|

Something like
import numpy as np
df.groupby("Index_EmpID")["A"].agg(np.cumprod).reset_index()
should work.

A solution that uses iterrows is not as nice a solution as the one that uses groupby but it follows directly from the description and uses only the most elementary Pandas facilities.
empdf = pd.DataFrame({'Index_EmpID': (['A123']*4 + ['A567']*4),
'Index_Date': (['2022-01-31', '2022-02-28',
'2022-03-31', '2022-04-30'] * 2),
'A': [1, 1, 1.05, 1, 1, 1.05, 1, 1.05],
'B': ([np.nan]*8)})
past_id, past_b, bs = None, 1, []
for label, row in empdf.iterrows():
if row['Index_EmpID'] == past_id:
bs.append(past_b * row['A'])
else:
bs.append(row['A'])
past_b = bs[-1]
past_id = row['Index_EmpID']
empdf['B'] = bs
This would produce exactly the dataframe you requested
Index_EmpID Index_Date A B
0 A123 2022-01-31 1.00 1.0000
1 A123 2022-02-28 1.00 1.0000
2 A123 2022-03-31 1.05 1.0500
3 A123 2022-04-30 1.00 1.0500
4 A567 2022-01-31 1.00 1.0000
5 A567 2022-02-28 1.05 1.0500
6 A567 2022-03-31 1.00 1.0500
7 A567 2022-04-30 1.05 1.1025

Related

Split a column into multiple columns with condition

I have a question about splitting columns into multiple rows at Pandas with conditions.
For example, I tend to do something as follows but takes a very long time using for loop
| Index | Value |
| ----- | ----- |
| 0 | 1 |
| 1 | 1,3 |
| 2 | 4,6,8 |
| 3 | 1,3 |
| 4 | 2,7,9 |
into
| Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| ----- | - | - | - | - | - | - | - | - | - |
| 0 | 1 | | | | | | | | |
| 1 | 1 | | 3 | | | | | | |
| 2 | | | | 4 | | 6 | | 8 | |
| 3 | 1 | | 3 | | | | | | |
| 4 | | 2 | | | | | 7 | | 9 |
I wonder if there are any packages that can help this out rather than to write a for loop to map all indexes.
Assuming the "Value" column contains strings, you can use str.split and pivot like so:
value = df["Value"].str.split(",").explode().astype(int).reset_index()
output = value.pivot(index="index", columns="Value", values="Value")
output = output.reindex(range(value["Value"].min(), value["Value"].max()+1), axis=1)
>>> output
Value 1 2 3 4 5 6 7 8 9
index
0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN 4.0 NaN 6.0 NaN 8.0 NaN
3 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
4 NaN 2.0 NaN NaN NaN NaN 7.0 NaN 9.0
Input df:
df = pd.DataFrame({"Value": ["1", "1,3", "4,6,8", "1,3", "2,7,9"]})

Pandas DataFrame - Access Values That are created on the fly

I am trying figure out something which I can easily preform on excel but I am having a hard time to understand how to do it on a Pandas Data Frame without using loops.
Suppose that I have a data frame as follows:
+------------+-------+-------+-----+------+
| Date | Price | Proxy | Div | Days |
+------------+-------+-------+-----+------+
| 13/01/2021 | 10 | 20 | 0.5 | NaN |
| 08/01/2021 | NaN | 30 | 0.6 | 5 |
| 04/01/2021 | NaN | 40 | 0.7 | 4 |
| 03/01/2021 | NaN | 50 | 0.8 | 1 |
| 01/01/2021 | NaN | 60 | 0.9 | 2 |
+------------+-------+-------+-----+------+
The task is to fill all the Price where price is null. In excel I would suppose that Date is column A and first row of Date id row 2 then to fill NaN in row 2 of Price I would use the formula =(B2)/(((C3/C2)*D3)*E3)=2.22.
Now I want to use the value 2.22 on the fly to fill NaN in row 3 of Price reason being to fill nan of row 3 I need to make use of filled row 2 value. Hence the formula in excel would to fill row 3 price would be =(B3)/(((C4/C3)*D4)*E4).
1 way would be to loop over all the rows of Data Frame that I don't want to do. What would be the vectorised approach to solve this problem?
Expected Output
+------------+-------+-------+-----+------+
| Date | Price | Proxy | Div | Days |
+------------+-------+-------+-----+------+
| 13/01/2021 | 10 | 20 | 0.5 | NA |
| 08/01/2021 | 2.22 | 30 | 0.6 | 5 |
| 04/01/2021 | 0.60 | 40 | 0.7 | 4 |
| 03/01/2021 | 0.60 | 50 | 0.8 | 1 |
| 01/01/2021 | 0.28 | 60 | 0.9 | 2 |
+------------+-------+-------+-----+------+
Current_Price = Prev Price (non-nan) / (((Current_Proxy/Prev_Proxy) * Div) * Days)
Edit
Create initial data frame using code below
data = {'Date': ['2021-01-13', '2021-01-08', '2021-01-04', '2021-01-03', '2021-01-01'],
'Price':[10, np.nan, np.nan, np.nan,np.nan],
'Proxy':[20, 30, 40, 50, 60],
'Div':[0.5, 0.6, 0.7, 0.8, 0.9],
'Days':[np.nan, 5, 4, 1, 2]}
df = pd.DataFrame(data)
What you want to achieve is actually a cumulated product:
df['Price'] = (df['Price'].combine_first(df['Proxy'].shift()/df.eval('Proxy*Div*Days'))
.cumprod().round(2))
Output:
Date Price Proxy Div Days
0 2021-01-13 10.00 20 0.5 NaN
1 2021-01-08 2.22 30 0.6 5.0
2 2021-01-04 0.60 40 0.7 4.0
3 2021-01-03 0.60 50 0.8 1.0
4 2021-01-01 0.28 60 0.9 2.0

is there a pandas function to concatenate for example three previous rows together (like i have a window with length three)

for example i have below DataFrame
df13 = pd.DataFrame(np.random.randint(1,9, size=(5,3)),
columns=['a','b','c'])
df13
a b c
0 8 5 2
1 5 7 7
2 3 7 5
3 7 7 7
4 2 2 6
and want
a b c a b c a b c
0 None None None None None None 8.00 5.00 2.00
1 None None None 8 5 2 5.00 7.00 7.00
2 8 5 2 5 7 7 3.00 7.00 5.00
3 5 7 7 3 7 5 7.00 7.00 7.00
4 3 7 5 7 7 7 2.00 2.00 6.00
5 7 7 7 2 2 6 nan nan nan
6 2 2 6 NaN NaN NaN nan nan nan
for example row 2 have 2 previous rows.
i do that with this code
def laa(df, previous_count):
dfNone = pd.DataFrame({col : None for col in df.columns},
index=[0])
df_tmp = df.copy()
for x in range(1 ,previous_count+1):
df_tmp = pd.concat([dfNone, df_tmp])
df_tmp = df_tmp.reset_index()
del df_tmp['index']
df = pd.concat([df_tmp, df], axis=1)
return df
(None rows must be removed)
pandas doesn't have function to do that?
This will do the trick using shift() and concat() functions in pandas:
df = pd.DataFrame(np.random.randint(1,9, size=(5,3)), columns=['a','b','c'])
df1 = pd.concat([df.shift(2), df.shift(1),df], axis = 1)
df2 = pd.concat([df, df.shift(-1),df.shift(-2)], axis = 1)
final_df = pd.concat([df1,df2]).drop_duplicates()
Sample output:
If df is as follows:
+----+-----+-----+-----+
| | a | b | c |
|----+-----+-----+-----|
| 0 | 6 | 2 | 6 |
| 1 | 7 | 2 | 1 |
| 2 | 4 | 4 | 5 |
| 3 | 1 | 1 | 1 |
| 4 | 2 | 2 | 4 |
+----+-----+-----+-----+
Then, final_df would be :
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | a | b | c | a | b | c | a | b | c |
|----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | nan | nan | nan | nan | nan | nan | 6 | 2 | 6 |
| 1 | nan | nan | nan | 6 | 2 | 6 | 7 | 2 | 1 |
| 2 | 6 | 2 | 6 | 7 | 2 | 1 | 4 | 4 | 5 |
| 3 | 7 | 2 | 1 | 4 | 4 | 5 | 1 | 1 | 1 |
| 4 | 4 | 4 | 5 | 1 | 1 | 1 | 2 | 2 | 4 |
| 3 | 1 | 1 | 1 | 2 | 2 | 4 | nan | nan | nan |
| 4 | 2 | 2 | 4 | nan | nan | nan | nan | nan | nan |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

Rolling quantiles over a column in pandas

I have a table as such
+------+------------+-------+
| Idx | date | value |
+------+------------+-------+
| A | 20/11/2016 | 10 |
| A | 21/11/2016 | 8 |
| A | 22/11/2016 | 12 |
| B | 20/11/2016 | 16 |
| B | 21/11/2016 | 18 |
| B | 22/11/2016 | 11 |
+------+------------+-------+
I'd like to create a column that creates a new column 'rolling_quantile_value' based on the column 'value' that calculates a quantile based on the past for each row and each possible Idx.
For the example above, if the quantile chosen is median, the output should look like this :
+------+------------+-------+-----------------------+
| Idx | date | value | rolling_median_value |
+------+------------+-------+-----------------------+
| A | 20/11/2016 | 10 | NaN |
| A | 21/11/2016 | 8 | 10 |
| A | 22/11/2016 | 12 | 9 |
| A | 23/11/2016 | 14 | 10 |
| B | 20/11/2016 | 16 | NaN |
| B | 21/11/2016 | 18 | 16 |
| B | 22/11/2016 | 11 | 17 |
+------+------------+-------+-----------------------+
I've done it the naive way where I just put a function that creates row by row based on precedents rows of value and flags the jump from one Id to another but I'm sure that it's not the most efficient way to do that, nor the most elegant.
Looking forward to your suggestions !
I think you want expanding
df['rolling_median_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.median()
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_median_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.0
3 A 23/11/2016 14 10.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.0
UPDATE
df['rolling_quantile_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.quantile(0.75)
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_quantile_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.5
3 A 23/11/2016 14 11.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.5

How to find average after sorting month column in python

I have a challenge in front of me in python.
| Growth_rate | Month |
| ------------ |-------|
| 0 | 1 |
| -2 | 1 |
| 1.2 | 1 |
| 0.3 | 2 |
| -0.1 | 2 |
| 7 | 2 |
| 9 | 3 |
| 4.1 | 3 |
Now I want to average the growth rate according to the months in a new columns. Like 1st month the avg would be -0.26 and it should look like below table.
| Growth_rate | Month | Mean |
| ----------- | ----- | ----- |
| 0 | 1 | -0.26 |
| -2 | 1 | -0.26 |
| 1.2 | 1 | -0.26 |
| 0.3 | 2 | 2.2 |
| -0.1 | 2 | 2.2 |
| 7 | 2 | 2.2 |
| 9 | 3 | 6.5 |
| 4.1 | 3 | 6.5 |
This will calculate the mean growth rate and put it into mean column.
Any help would be great.
df.groupby(df.months).mean().reset_index().rename(columns={'Growth_Rate':'mean'}).merge(df,on='months')
Out[59]:
months mean Growth_Rate
0 1 -0.266667 0.0
1 1 -0.266667 -2.0
2 1 -0.266667 1.2
3 2 2.200000 -0.3
4 2 2.200000 -0.1
5 2 2.200000 7.0
6 3 6.550000 9.0
7 3 6.550000 4.1
Assuming that you are using the pandas package. If your table is in a DataFrame df
In [91]: means = df.groupby('Month').mean().reset_index()
In [92]: means.columns = ['Month', 'Mean']
Then join via merge
In [93]: pd.merge(df, means, how='outer', on='Month')
Out[93]:
Growth_rate Month Mean
0 0.0 1 -0.266667
1 -2.0 1 -0.266667
2 1.2 1 -0.266667
3 0.3 2 2.400000
4 -0.1 2 2.400000
5 7.0 2 2.400000
6 9.0 3 6.550000
7 4.1 3 6.550000

Categories