Filling NaN values from a list with different shape - python

I have a df
df = pd.DataFrame(index = ['A','B','C','D','E'],
columns = ['date_1','date_2','value_2','value_3','value_4'],
data = [['2021-06-28', '2022-05-03', 30, 40, 60],
['2022-01-10', '2022-05-15', 50, 90, 70],
[np.nan, '2022-05-15', 40, 60, 80],
[np.nan, '2022-04-28', 40, 60, 90],
[np.nan, '2022-06-28', 50, 60, 54]])
date_1 date_2 value_2 value_3 value_4
A 2021-06-28 2022-05-03 30 40 60
B 2022-01-03 2022-05-15 50 90 70
C NaN 2022-05-15 40 60 80
D NaN 2022-04-28 40 60 90
...
E NaN 2022-06-28 50 60 54
I am trying to fill NaN values in column date_1. The values I need to fill the date_1 column are changing every week, the min value of date_1 value need to be2021-06-28 and the max value is 2022-06-20. Each week the max value in date_1 column will be the last Monday. I need column date_1 to each date up to 2022-06-20 at least once so that each date starting from 2021-06-28 to 2022-06-20 will be in date_1 at least once. The order of these values does not matter.
I tried:
from datetime import date, timedelta
today = date.today()
last_monday = pd.to_datetime((today - timedelta(days=today.weekday()) - timedelta(days=7)).strftime('%Y-%m-%d'))
# date_mappings is a dictionary with this kind of structure:
# {1 : '2021-06-28', 2 : '2021-07-05', ... 52 : '2022-06-20'}
dates_needed = [x for x in pd.to_datetime(list(date_mappings.values())) if x >= last_monday]
So now dates_needed has the remaining of the dates that needs to be added at least once in date_1 column.
The problem I am facing is that the shapes do not match when I try to fill the values, because there can be multiple rows with the same date_2.
If I try to use:
df.loc[df['date_1'].isna(), 'date_1'] = dates_needed
I get:
ValueError: Must have equal len keys and value when setting with an iterable
Because this only works if I match the shape:
df.loc[df['date_1'].isna(), 'date_1'] = [pd.to_datetime('2022-01-10 00:00:00'),
pd.to_datetime('2022-01-17 00:00:00'),
pd.to_datetime('2022-01-24 00:00:00')]
date_1 date_2 value_2 value_3 value_4
A 2021-06-28 2022-05-03 30 40 60
B 2022-01-10 2022-05-15 50 90 70
C 2022-01-10 2022-05-15 40 60 80
D 2022-01-17 2022-04-28 40 60 90
E 2022-01-24 2022-06-28 50 60 54
So my goal is to fill NaN values in date_1 from a created list dates_needed where the each date from dates_needed is used at least once in date_1 column and the order does not matter.

Here is solution for mapping by integers from date_mappings by helper Index by number of missing values by sum. Solution working if difference between length of dict vs number of missing values:
m = df['date_1'].isna()
df.loc[m, 'date_1'] = (pd.Index(range(m.sum())) + 1).map(date_mappings)

Related

Creating a column with moving sum

I have a time series data and a non-continuous data logs with timestamps. I want to merge the latter with the time series data, and create a new columns with column values.
Let the time series data be:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df['col'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df2['col'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3= df3.rename(columns={'index': 'timestamp'})
timestamp col uid
0 2020-10-10 00:00:00 96 1
1 2020-10-10 00:05:00 47 1
2 2020-10-10 00:10:00 78 1
3 2020-10-10 00:15:00 27 1
...
Let the log data be:
import datetime as dt
df_log=pd.DataFrame(np.array([[100, 1, 3], [40, 2, 6], [50, 1, 5], [60, 2, 9], [20, 1, 2], [30, 2, 5]]),
columns=['duration', 'uid', 'factor'])
df_log['timestamp'] = pd.Series([dt.datetime(2020,10,10, 15,21), dt.datetime(2020,10,10, 16,27),
dt.datetime(2020,10,11, 21,25), dt.datetime(2020,10,11, 10,12),
dt.datetime(2020,10,13, 20,56), dt.datetime(2020,10,13, 13,15)])
duration uid factor timestamp
0 100 1 3 2020-10-10 15:21:00
1 40 2 6 2020-10-10 16:27:00
...
I want to merge these two (df_merged), and create new column in the time series data as such (respective to the uid):
df_merged['new'] = df_merged['duration] * df_merged['factor']
and ffill the df_merged['new'] with this value until the next log for each uid, then do the same operation on the next log and sum, and have it be a moving 2-day average.
Can anybody show me a direction for this problem?
Expected Output:
timestamp col uid duration factor new
0 2020-10-10 15:20:00 96 1 100 3 300
1 2020-10-10 15:25:00 47 1 100 3 300
2 2020-10-10 15:30:00 78 1 100 3 300
...
2020-10-11 21:25:00 .. 1 60 9 540+300
2020-10-11 21:30:00 .. 1 60 9 540+300
...
2020-10-13 20:55:00 .. 1 20 2 40+540
2020-10-13 21:00:00 .. 1 20 2 40+540
..
2020-10-13 21:25:00 .. 1 20 2 40
as I understand it, it's simpler to calculate the new column on df_log before merging. You'd just use rolling to calculate the window for each uid group:
df_log["new"] = df_log["duration"] * df_log["factor"]
# 2 day rolling window summing `new`
df_log = df_log.groupby("uid").rolling("2d", on="timestamp")["new"].sum().to_frame()
Then merging is straightforward:
# prepare for merge
df_log = df_log.sort_values(by="timestamp")
df3 = df3.sort_values(by="timestamp")
df_merged = (
pd.merge_asof(df3, df_log, on="timestamp", by=["uid"])
.dropna()
.reset_index(drop=True)
)
This solution does deviate slightly from your expected output. The first included row from the continuous series (df3) would be at timestamp 2020-10-10 15:25:00 instead of 2020-10-10 15:20:00 since the merge method would look for the last timestamp in df_log before the timestamp in df3.
Alternatively, if you require the first row in the output to have timestamp 2020-10-10 15:20:00, you can use direction="forward" in pd.merge_asof. That would make each row match the first row in df_log with a timestamp after the one in df3, so you'd need to remove the extra rows in the beginning for each uid.

Faster way to iterate in numpy / pandas?

I have a big portfolio of bonds and I want to create a table with days as index, the bonds as columns and the notional of the bonds as values.
I need to put at 0 the rows before the starting date and after the terminating date of each bond.
Is there a more efficient way than this:
[[np.where( (day>=bonds.inception[i]) &
(day + relativedelta(months=+m) >= bonds.maturity[i] ) &
(day <= bonds.maturity[i]),
bonds.principal[i],
0)
for i in range(bonds.shape[0])] for day in idx_d]
input example:
id
nom
inception
maturity
38
200
22/04/2022
22/04/2032
87
100
22/04/2022
22/04/2052
output example:
day
38
87
21/04/2022
0
0
22/04/2022
100
200
The solution below still requires a loop. I don't know if it's faster, or whether you find it clear, but I'll offer it as an alternative.
Create an example dataframe (with a few extra bonds for demonstration purposes):
import pandas as pd
df = pd.DataFrame({'id': [38, 87, 49, 51, 89],
'nom': [200, 100, 150, 50, 250],
'start_date': ['22/04/2022', '22/04/2022', '01/01/2022', '01/05/2022', '23/04/2012'],
'end_date': ['22/04/2032', '22/04/2052', '01/01/2042', '01/05/2042', '23/04/2022']})
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df = df.set_index('id')
print(df)
This then looks like:
id
nom
start_date
end_date
38
200
2022-04-22 00:00:00
2032-04-22 00:00:00
87
100
2022-04-22 00:00:00
2052-04-22 00:00:00
49
150
2022-01-01 00:00:00
2042-01-01 00:00:00
51
50
2022-01-05 00:00:00
2042-01-05 00:00:00
89
250
2012-04-23 00:00:00
2022-04-23 00:00:00
Now, create a new blank dataframe, with 0 as the default value:
new = pd.DataFrame(data=0, columns=df.index, index=pd.date_range('2022-04-20', '2062-04-22'))
new.index.rename('day', inplace=True)
Then, iterate over the columns (or index of the original dataframe), selecting the relevant interval and set the column value to the relevant 'nom' for that selected interval:
for column in new.columns:
sel = (new.index >= df.loc[column, 'start_date']) & (new.index <= df.loc[column, 'end_date'])
new.loc[sel, column] = df.loc[df.index == column, 'nom'].values
print(new)
which results in:
day
38
87
49
51
89
2022-04-20 00:00:00
0
0
150
50
250
2022-04-21 00:00:00
0
0
150
50
250
2022-04-22 00:00:00
200
100
150
50
250
2022-04-23 00:00:00
200
100
150
50
250
2022-04-24 00:00:00
200
100
150
50
0
...
2062-04-21 00:00:00
0
0
0
0
0
2062-04-22 00:00:00
0
0
0
0
0
[14613 rows x 5 columns]

Pandas: Filter or Groupby then transform to select the last row

This post has a reference to one of my post in SO.
Just to reiterate, I have a dataframe df as
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-03-01 A 25 88 <-----Last row for Group A
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238 <-----Last row of Group B
Considering the last row of each Group, if the Duration value is less than 90, we omit that group. So my resultant data frame df_final should look like
Date Group Value Duration
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 240
There are two ways we approach to this problem.
First is filter method:
df.groupby('Group').filter(lambda x: x.Duration.max()>=90)
Second is groupby.transform method:
df = df[df.groupby('Group')['Duration'].transform('last') >= 90]
But I want to filter this by the Date column and NOT by Duration. I am getting the correct result by the following code:
df_interim = df.loc[(df['Date']=='2019-03-01')&(df['Duration'] >=90)]
df_final = df.merge(df_interim[['Group','Date']],on='Group',how='right').reset_index()
In the above code, I have hard coded the Date.
My question is : How can I dynamically select the last date in the data frame? And then perform the filter or groupby.transform on Group?
Any clue?
We can select the last date by use transform as well
lastd=df.groupby('Date')['Duration'].transform('max')
df_interim = df.loc[(df['Date']==lastd)&(df['Duration'] >=90)]
I think you need first filter for maximum index by Date by DataFrameGroupBy.idxmax, then select rows by DataFrame.loc for all columns:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.loc[df.groupby('Group')['Date'].idxmax()]
print (df1)
Date Group Value Duration
2 2018-03-01 A 25 88
5 2018-03-01 B 25 238
Then filter by Duration only rows with maximal Date:
g = df1.loc[df1['Duration'] >= 90, 'Group']
print (g)
Date Group Value Duration
3 2018-01-01 B 15 180
4 2018-02-01 B 30 210
5 2018-03-01 B 25 238
And last filter original Group column by Series.isin with boolean indexing:
df = df[df['Group'].isin(g)]
print (df)
Date Group Value Duration
3 2018-01-01 B 15 180
4 2018-02-01 B 30 210
5 2018-03-01 B 25 238

pandas find the last row with the same value as the previous row in a df

I have a df,
acct_no code date id
100 10 01/04/2019 22
100 10 01/03/2019 22
100 10 01/05/2019 22
200 20 01/06/2019 33
200 20 01/05/2019 33
200 20 01/07/2019 33
I want to first sort the df in ascending order for date when acct_no and code are the same,
df.sort_values(['acct_no', 'code', 'date'], inplace=True)
then I am wondering what the way to find the last row whose acct_no, code are the same as the previous row, the result need to look like,
acct_no code date id
100 10 01/05/2019 22
200 20 01/07/2019 33
You can also try with groupby.last():
df.groupby(['acct_no', 'code'],as_index=False).last()
acct_no code date id
0 100 10 01/05/2019 22
1 200 20 01/07/2019 33
Use DataFrame.drop_duplicates, but first convert column to datetimes:
#if dates are first use dayfirst=True
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
#if months are first
#df['date'] = pd.to_datetime(df['date'])
df1 = (df.sort_values(['acct_no', 'code', 'date'])
.drop_duplicates(['acct_no', 'code'], keep='last'))
print (df1)
acct_no code date id
2 100 10 2019-05-01 22
5 200 20 2019-07-01 33

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

Categories