Sort by two columns [duplicate] - python

I have a data frame with two columns
df = DataFrame.from_records([
{"time": 10, "amount": 200},
{"time": 70, "amount": 1000},
{"time": 10, "amount": 300},
{"time": 10, "amount": 100},
])
I want to, given a period of time 80ms, calculate the max amount that is possible, in this case, the output should be 1300 because, in this period, the maximum amount possible is 1300.
Is it possible with Pandas? I thought about using aggregate, but I do not know how to use it

This is a knapsack problem, you can solve it with a dedicated library (e.g., knapsack):
from knapsack import knapsack
total, idx = knapsack(df['time'], df['amount']).solve(80)
df.out = df.iloc[idx]
output:
time amount
1 70 1000
2 10 300
Other examples:
# with max = 75
time amount
1 70 1000
# with max = 40
time amount
0 10 200
2 10 300
3 10 100

You could try to upsample you data to 10ms and then use a rolling window.
# set a time index to the dataframe
df2 = df.set_index(pd.to_timedelta(df['time'], unit='ms').cumsum())
it gives:
time amount
time
0 days 00:00:00.010000 10 200
0 days 00:00:00.080000 70 1000
0 days 00:00:00.090000 10 300
0 days 00:00:00.100000 10 100
We can now upsample the amount, assuming a linear increase between consecutive timestamps:
amounts = (df2.amount / df2.time).resample('10ms').bfill()
giving:
time
0 days 00:00:00.010000 20.000000
0 days 00:00:00.020000 14.285714
0 days 00:00:00.030000 14.285714
0 days 00:00:00.040000 14.285714
0 days 00:00:00.050000 14.285714
0 days 00:00:00.060000 14.285714
0 days 00:00:00.070000 14.285714
0 days 00:00:00.080000 14.285714
0 days 00:00:00.090000 30.000000
0 days 00:00:00.100000 10.000000
Freq: 10L, dtype: float64
Using a rolling window, we can now find the amount per 80ms duration:
amounts.rolling('80ms').sum()
which gives:
time
2022-01-01 00:00:00.010 20.000000
2022-01-01 00:00:00.020 34.285714
2022-01-01 00:00:00.030 48.571429
2022-01-01 00:00:00.040 62.857143
2022-01-01 00:00:00.050 77.142857
2022-01-01 00:00:00.060 91.428571
2022-01-01 00:00:00.070 105.714286
2022-01-01 00:00:00.080 120.000000
2022-01-01 00:00:00.090 130.000000
2022-01-01 00:00:00.100 125.714286
Freq: 10L, dtype: float64
We can see that the maximum value is reached after 90 ms and is 130.
If you only want the max value:
amounts.rolling('80ms').sum().max()
giving directly:
130.0

Related

Convert string hours to minute pd.eval

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

Faster way to iterate in numpy / pandas?

I have a big portfolio of bonds and I want to create a table with days as index, the bonds as columns and the notional of the bonds as values.
I need to put at 0 the rows before the starting date and after the terminating date of each bond.
Is there a more efficient way than this:
[[np.where( (day>=bonds.inception[i]) &
(day + relativedelta(months=+m) >= bonds.maturity[i] ) &
(day <= bonds.maturity[i]),
bonds.principal[i],
0)
for i in range(bonds.shape[0])] for day in idx_d]
input example:
id
nom
inception
maturity
38
200
22/04/2022
22/04/2032
87
100
22/04/2022
22/04/2052
output example:
day
38
87
21/04/2022
0
0
22/04/2022
100
200
The solution below still requires a loop. I don't know if it's faster, or whether you find it clear, but I'll offer it as an alternative.
Create an example dataframe (with a few extra bonds for demonstration purposes):
import pandas as pd
df = pd.DataFrame({'id': [38, 87, 49, 51, 89],
'nom': [200, 100, 150, 50, 250],
'start_date': ['22/04/2022', '22/04/2022', '01/01/2022', '01/05/2022', '23/04/2012'],
'end_date': ['22/04/2032', '22/04/2052', '01/01/2042', '01/05/2042', '23/04/2022']})
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df = df.set_index('id')
print(df)
This then looks like:
id
nom
start_date
end_date
38
200
2022-04-22 00:00:00
2032-04-22 00:00:00
87
100
2022-04-22 00:00:00
2052-04-22 00:00:00
49
150
2022-01-01 00:00:00
2042-01-01 00:00:00
51
50
2022-01-05 00:00:00
2042-01-05 00:00:00
89
250
2012-04-23 00:00:00
2022-04-23 00:00:00
Now, create a new blank dataframe, with 0 as the default value:
new = pd.DataFrame(data=0, columns=df.index, index=pd.date_range('2022-04-20', '2062-04-22'))
new.index.rename('day', inplace=True)
Then, iterate over the columns (or index of the original dataframe), selecting the relevant interval and set the column value to the relevant 'nom' for that selected interval:
for column in new.columns:
sel = (new.index >= df.loc[column, 'start_date']) & (new.index <= df.loc[column, 'end_date'])
new.loc[sel, column] = df.loc[df.index == column, 'nom'].values
print(new)
which results in:
day
38
87
49
51
89
2022-04-20 00:00:00
0
0
150
50
250
2022-04-21 00:00:00
0
0
150
50
250
2022-04-22 00:00:00
200
100
150
50
250
2022-04-23 00:00:00
200
100
150
50
250
2022-04-24 00:00:00
200
100
150
50
0
...
2062-04-21 00:00:00
0
0
0
0
0
2062-04-22 00:00:00
0
0
0
0
0
[14613 rows x 5 columns]

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

Pandas: calculation of the number of days when the sum of the durations on that day was more than 30 minutes

Here is a sample source:
ID Date Duration
111 2020-01-01 00:42:23
111 2020-01-01 00:23:23
111 2020-01-02 00:37:22
222 2020-01-02 00:13:08
222 2020-01-03 01:52:11
....
999 2020-01-31 00:15:21
999 2020-01-31 00:52:12
I use Pandas and I want to calculate the sum of duration for each day by Date, and calculate how many days in month sum of duration by day > 30 min (group by ID)
Here is what I need to get:
ID Total days when sum of duration by day from each ID > 30 min (per month)
111 2
222 1
....
999 5
Some like this:
aggregation = {
'num_days': pd.NamedAgg(column="duration", aggfunc=lambda x: x.sum() > dt.timedelta(minutes=30)),
}
total_active = df.groupby('Id').agg(**aggregation)
But this is not at all what I need...
Can anyone help?
Try this,
df['_duration'] = pd.to_datetime(df['Duration'], format="%H:%M:%S").dt.hour
df_g = df.groupby('id')['_duration'].sum().reset_index()
# this should yield greater than 30.
df_g = df_g[df_g['_duration'] > 30]
to_dateime
print(df)
ID Date Duration
0 111 2020-01-01 00:42:23
1 111 2020-01-01 00:23:23
2 111 2020-01-02 00:37:22
3 222 2020-01-02 00:13:08
4 222 2020-01-03 01:52:11
5 999 2020-01-31 00:15:21
6 999 2020-01-31 00:52:12
use pd.Timedelta to convert the Duration column's dtype to <m8[ns]:
df['Duration'] = df.Duration.apply(pd.Timedelta)
and then use groupby and sum:
result = (df.groupby(['ID', "Date"])['Duration'].sum() > "30min").groupby("ID").sum()
Output:
ID
111 2.0
222 1.0
999 1.0
Not sure if we are to sum or count. However to meet your output.
df['Date']=pd.to_datetime(df['Date'])#Coerce Date to datetime
df['Duration']=pd.to_timedelta(df['Duration'], unit='m')#Coerce duration to timedelta
df.set_index(df['Date'], inplace=True)#Set time as index
#Groupby date and id, examine condtiton and sum.
(df.groupby([df.index.date, df.ID])['Duration'].sum()>'30min').groupby('ID').sum()

pandas resample timed events in DataFrame to precise time-bins

maybe I could not find it... anyhow, with pandas '0.19.2' there is the following
problem:
I have some timed events of associated groups which can be generated by:
from numpy.random import randint, seed
import pandas as pd
seed(42) # reproducibility
samp_N = 1000
# create times within 3 hours, and 15 random groups
df = pd.DataFrame({'time': randint(0,3*60*60, samp_N),
'group': randint(0, 15, samp_N)})
# make a resample-able index from the seconds time values
df.set_index(pd.TimedeltaIndex(df.time, 's'), inplace=True)
which looks like:
group time
02:01:10 10 7270
00:14:20 13 860
01:29:50 9 5390
01:26:31 13 5191
...
When I try to resample the events, I get something undesirable
df.resample('5T').count()
group time
00:00:04 28 28
00:05:04 18 18
00:10:04 32 32
...
Unfortunately the resampling periods start at arbitrary (first in data) offset values.
It is even more annoying if I group this (as ultimately required)
df.groupby('group').resample('5T').count()
then I get a new offset for each group
what I want is the precise start of sampling windows:
00:00:00 5 ...
00:05:00 17 ...
00:10:00 11 ...
...
there was a suggestion in: https://stackoverflow.com/a/23966229
df.groupby(pd.TimeGrouper('5Min')).count()
but it does not work either, as it also ruins the grouping required above.
thanks for hints!
Unfortunately i didn't come up with a nice solution but rather a work around. I added a dummy row with time value zero and then grouped by time and group:
df = pd.Series({'time':0,'group':-1}).to_frame().T.set_index(pd.TimedeltaIndex([0], 's')).append(df)
df = df.groupby([pd.Grouper(freq='5Min'), 'group']).count().reset_index('group')
df = df.loc[df['group']!=-1]
df.head()
group time
0 days 0 2
0 days 1 4
0 days 2 3
0 days 3 1
0 days 4 2
I am not sure this is the result you want:
result = df.groupby(['group', pd.Grouper(freq='5Min')]).count().reset_index(level=0)
result.head()
>>> group time
00:05:00 0 2
00:10:00 0 1
00:15:00 0 3
00:20:00 0 2
00:30:00 0 1
result.sort_index().head()
>>> group time
0 days 10 1
0 days 14 3
0 days 2 1
0 days 13 1
0 days 4 3

Categories