Faster way to iterate in numpy / pandas? - python

I have a big portfolio of bonds and I want to create a table with days as index, the bonds as columns and the notional of the bonds as values.
I need to put at 0 the rows before the starting date and after the terminating date of each bond.
Is there a more efficient way than this:
[[np.where( (day>=bonds.inception[i]) &
(day + relativedelta(months=+m) >= bonds.maturity[i] ) &
(day <= bonds.maturity[i]),
bonds.principal[i],
0)
for i in range(bonds.shape[0])] for day in idx_d]
input example:
id
nom
inception
maturity
38
200
22/04/2022
22/04/2032
87
100
22/04/2022
22/04/2052
output example:
day
38
87
21/04/2022
0
0
22/04/2022
100
200

The solution below still requires a loop. I don't know if it's faster, or whether you find it clear, but I'll offer it as an alternative.
Create an example dataframe (with a few extra bonds for demonstration purposes):
import pandas as pd
df = pd.DataFrame({'id': [38, 87, 49, 51, 89],
'nom': [200, 100, 150, 50, 250],
'start_date': ['22/04/2022', '22/04/2022', '01/01/2022', '01/05/2022', '23/04/2012'],
'end_date': ['22/04/2032', '22/04/2052', '01/01/2042', '01/05/2042', '23/04/2022']})
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df = df.set_index('id')
print(df)
This then looks like:
id
nom
start_date
end_date
38
200
2022-04-22 00:00:00
2032-04-22 00:00:00
87
100
2022-04-22 00:00:00
2052-04-22 00:00:00
49
150
2022-01-01 00:00:00
2042-01-01 00:00:00
51
50
2022-01-05 00:00:00
2042-01-05 00:00:00
89
250
2012-04-23 00:00:00
2022-04-23 00:00:00
Now, create a new blank dataframe, with 0 as the default value:
new = pd.DataFrame(data=0, columns=df.index, index=pd.date_range('2022-04-20', '2062-04-22'))
new.index.rename('day', inplace=True)
Then, iterate over the columns (or index of the original dataframe), selecting the relevant interval and set the column value to the relevant 'nom' for that selected interval:
for column in new.columns:
sel = (new.index >= df.loc[column, 'start_date']) & (new.index <= df.loc[column, 'end_date'])
new.loc[sel, column] = df.loc[df.index == column, 'nom'].values
print(new)
which results in:
day
38
87
49
51
89
2022-04-20 00:00:00
0
0
150
50
250
2022-04-21 00:00:00
0
0
150
50
250
2022-04-22 00:00:00
200
100
150
50
250
2022-04-23 00:00:00
200
100
150
50
250
2022-04-24 00:00:00
200
100
150
50
0
...
2062-04-21 00:00:00
0
0
0
0
0
2062-04-22 00:00:00
0
0
0
0
0
[14613 rows x 5 columns]

Related

Creating a column with moving sum

I have a time series data and a non-continuous data logs with timestamps. I want to merge the latter with the time series data, and create a new columns with column values.
Let the time series data be:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df['col'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df2['col'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3= df3.rename(columns={'index': 'timestamp'})
timestamp col uid
0 2020-10-10 00:00:00 96 1
1 2020-10-10 00:05:00 47 1
2 2020-10-10 00:10:00 78 1
3 2020-10-10 00:15:00 27 1
...
Let the log data be:
import datetime as dt
df_log=pd.DataFrame(np.array([[100, 1, 3], [40, 2, 6], [50, 1, 5], [60, 2, 9], [20, 1, 2], [30, 2, 5]]),
columns=['duration', 'uid', 'factor'])
df_log['timestamp'] = pd.Series([dt.datetime(2020,10,10, 15,21), dt.datetime(2020,10,10, 16,27),
dt.datetime(2020,10,11, 21,25), dt.datetime(2020,10,11, 10,12),
dt.datetime(2020,10,13, 20,56), dt.datetime(2020,10,13, 13,15)])
duration uid factor timestamp
0 100 1 3 2020-10-10 15:21:00
1 40 2 6 2020-10-10 16:27:00
...
I want to merge these two (df_merged), and create new column in the time series data as such (respective to the uid):
df_merged['new'] = df_merged['duration] * df_merged['factor']
and ffill the df_merged['new'] with this value until the next log for each uid, then do the same operation on the next log and sum, and have it be a moving 2-day average.
Can anybody show me a direction for this problem?
Expected Output:
timestamp col uid duration factor new
0 2020-10-10 15:20:00 96 1 100 3 300
1 2020-10-10 15:25:00 47 1 100 3 300
2 2020-10-10 15:30:00 78 1 100 3 300
...
2020-10-11 21:25:00 .. 1 60 9 540+300
2020-10-11 21:30:00 .. 1 60 9 540+300
...
2020-10-13 20:55:00 .. 1 20 2 40+540
2020-10-13 21:00:00 .. 1 20 2 40+540
..
2020-10-13 21:25:00 .. 1 20 2 40
as I understand it, it's simpler to calculate the new column on df_log before merging. You'd just use rolling to calculate the window for each uid group:
df_log["new"] = df_log["duration"] * df_log["factor"]
# 2 day rolling window summing `new`
df_log = df_log.groupby("uid").rolling("2d", on="timestamp")["new"].sum().to_frame()
Then merging is straightforward:
# prepare for merge
df_log = df_log.sort_values(by="timestamp")
df3 = df3.sort_values(by="timestamp")
df_merged = (
pd.merge_asof(df3, df_log, on="timestamp", by=["uid"])
.dropna()
.reset_index(drop=True)
)
This solution does deviate slightly from your expected output. The first included row from the continuous series (df3) would be at timestamp 2020-10-10 15:25:00 instead of 2020-10-10 15:20:00 since the merge method would look for the last timestamp in df_log before the timestamp in df3.
Alternatively, if you require the first row in the output to have timestamp 2020-10-10 15:20:00, you can use direction="forward" in pd.merge_asof. That would make each row match the first row in df_log with a timestamp after the one in df3, so you'd need to remove the extra rows in the beginning for each uid.

Sort by two columns [duplicate]

I have a data frame with two columns
df = DataFrame.from_records([
{"time": 10, "amount": 200},
{"time": 70, "amount": 1000},
{"time": 10, "amount": 300},
{"time": 10, "amount": 100},
])
I want to, given a period of time 80ms, calculate the max amount that is possible, in this case, the output should be 1300 because, in this period, the maximum amount possible is 1300.
Is it possible with Pandas? I thought about using aggregate, but I do not know how to use it
This is a knapsack problem, you can solve it with a dedicated library (e.g., knapsack):
from knapsack import knapsack
total, idx = knapsack(df['time'], df['amount']).solve(80)
df.out = df.iloc[idx]
output:
time amount
1 70 1000
2 10 300
Other examples:
# with max = 75
time amount
1 70 1000
# with max = 40
time amount
0 10 200
2 10 300
3 10 100
You could try to upsample you data to 10ms and then use a rolling window.
# set a time index to the dataframe
df2 = df.set_index(pd.to_timedelta(df['time'], unit='ms').cumsum())
it gives:
time amount
time
0 days 00:00:00.010000 10 200
0 days 00:00:00.080000 70 1000
0 days 00:00:00.090000 10 300
0 days 00:00:00.100000 10 100
We can now upsample the amount, assuming a linear increase between consecutive timestamps:
amounts = (df2.amount / df2.time).resample('10ms').bfill()
giving:
time
0 days 00:00:00.010000 20.000000
0 days 00:00:00.020000 14.285714
0 days 00:00:00.030000 14.285714
0 days 00:00:00.040000 14.285714
0 days 00:00:00.050000 14.285714
0 days 00:00:00.060000 14.285714
0 days 00:00:00.070000 14.285714
0 days 00:00:00.080000 14.285714
0 days 00:00:00.090000 30.000000
0 days 00:00:00.100000 10.000000
Freq: 10L, dtype: float64
Using a rolling window, we can now find the amount per 80ms duration:
amounts.rolling('80ms').sum()
which gives:
time
2022-01-01 00:00:00.010 20.000000
2022-01-01 00:00:00.020 34.285714
2022-01-01 00:00:00.030 48.571429
2022-01-01 00:00:00.040 62.857143
2022-01-01 00:00:00.050 77.142857
2022-01-01 00:00:00.060 91.428571
2022-01-01 00:00:00.070 105.714286
2022-01-01 00:00:00.080 120.000000
2022-01-01 00:00:00.090 130.000000
2022-01-01 00:00:00.100 125.714286
Freq: 10L, dtype: float64
We can see that the maximum value is reached after 90 ms and is 130.
If you only want the max value:
amounts.rolling('80ms').sum().max()
giving directly:
130.0

Filling NaN values from a list with different shape

I have a df
df = pd.DataFrame(index = ['A','B','C','D','E'],
columns = ['date_1','date_2','value_2','value_3','value_4'],
data = [['2021-06-28', '2022-05-03', 30, 40, 60],
['2022-01-10', '2022-05-15', 50, 90, 70],
[np.nan, '2022-05-15', 40, 60, 80],
[np.nan, '2022-04-28', 40, 60, 90],
[np.nan, '2022-06-28', 50, 60, 54]])
date_1 date_2 value_2 value_3 value_4
A 2021-06-28 2022-05-03 30 40 60
B 2022-01-03 2022-05-15 50 90 70
C NaN 2022-05-15 40 60 80
D NaN 2022-04-28 40 60 90
...
E NaN 2022-06-28 50 60 54
I am trying to fill NaN values in column date_1. The values I need to fill the date_1 column are changing every week, the min value of date_1 value need to be2021-06-28 and the max value is 2022-06-20. Each week the max value in date_1 column will be the last Monday. I need column date_1 to each date up to 2022-06-20 at least once so that each date starting from 2021-06-28 to 2022-06-20 will be in date_1 at least once. The order of these values does not matter.
I tried:
from datetime import date, timedelta
today = date.today()
last_monday = pd.to_datetime((today - timedelta(days=today.weekday()) - timedelta(days=7)).strftime('%Y-%m-%d'))
# date_mappings is a dictionary with this kind of structure:
# {1 : '2021-06-28', 2 : '2021-07-05', ... 52 : '2022-06-20'}
dates_needed = [x for x in pd.to_datetime(list(date_mappings.values())) if x >= last_monday]
So now dates_needed has the remaining of the dates that needs to be added at least once in date_1 column.
The problem I am facing is that the shapes do not match when I try to fill the values, because there can be multiple rows with the same date_2.
If I try to use:
df.loc[df['date_1'].isna(), 'date_1'] = dates_needed
I get:
ValueError: Must have equal len keys and value when setting with an iterable
Because this only works if I match the shape:
df.loc[df['date_1'].isna(), 'date_1'] = [pd.to_datetime('2022-01-10 00:00:00'),
pd.to_datetime('2022-01-17 00:00:00'),
pd.to_datetime('2022-01-24 00:00:00')]
date_1 date_2 value_2 value_3 value_4
A 2021-06-28 2022-05-03 30 40 60
B 2022-01-10 2022-05-15 50 90 70
C 2022-01-10 2022-05-15 40 60 80
D 2022-01-17 2022-04-28 40 60 90
E 2022-01-24 2022-06-28 50 60 54
So my goal is to fill NaN values in date_1 from a created list dates_needed where the each date from dates_needed is used at least once in date_1 column and the order does not matter.
Here is solution for mapping by integers from date_mappings by helper Index by number of missing values by sum. Solution working if difference between length of dict vs number of missing values:
m = df['date_1'].isna()
df.loc[m, 'date_1'] = (pd.Index(range(m.sum())) + 1).map(date_mappings)

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

Pandas: calculation of the number of days when the sum of the durations on that day was more than 30 minutes

Here is a sample source:
ID Date Duration
111 2020-01-01 00:42:23
111 2020-01-01 00:23:23
111 2020-01-02 00:37:22
222 2020-01-02 00:13:08
222 2020-01-03 01:52:11
....
999 2020-01-31 00:15:21
999 2020-01-31 00:52:12
I use Pandas and I want to calculate the sum of duration for each day by Date, and calculate how many days in month sum of duration by day > 30 min (group by ID)
Here is what I need to get:
ID Total days when sum of duration by day from each ID > 30 min (per month)
111 2
222 1
....
999 5
Some like this:
aggregation = {
'num_days': pd.NamedAgg(column="duration", aggfunc=lambda x: x.sum() > dt.timedelta(minutes=30)),
}
total_active = df.groupby('Id').agg(**aggregation)
But this is not at all what I need...
Can anyone help?
Try this,
df['_duration'] = pd.to_datetime(df['Duration'], format="%H:%M:%S").dt.hour
df_g = df.groupby('id')['_duration'].sum().reset_index()
# this should yield greater than 30.
df_g = df_g[df_g['_duration'] > 30]
to_dateime
print(df)
ID Date Duration
0 111 2020-01-01 00:42:23
1 111 2020-01-01 00:23:23
2 111 2020-01-02 00:37:22
3 222 2020-01-02 00:13:08
4 222 2020-01-03 01:52:11
5 999 2020-01-31 00:15:21
6 999 2020-01-31 00:52:12
use pd.Timedelta to convert the Duration column's dtype to <m8[ns]:
df['Duration'] = df.Duration.apply(pd.Timedelta)
and then use groupby and sum:
result = (df.groupby(['ID', "Date"])['Duration'].sum() > "30min").groupby("ID").sum()
Output:
ID
111 2.0
222 1.0
999 1.0
Not sure if we are to sum or count. However to meet your output.
df['Date']=pd.to_datetime(df['Date'])#Coerce Date to datetime
df['Duration']=pd.to_timedelta(df['Duration'], unit='m')#Coerce duration to timedelta
df.set_index(df['Date'], inplace=True)#Set time as index
#Groupby date and id, examine condtiton and sum.
(df.groupby([df.index.date, df.ID])['Duration'].sum()>'30min').groupby('ID').sum()

Categories