date balance
2020-03-31 1000
2020-03-31 900
2020-03-31 800
2020-03-31 700
2020-03-31 200
2020-03-31 100
....
2020-03-31 20
2020-03-31 1
2020-03-31 0.3
....
2020-06-30 3420
2020-06-30 3000
2020-06-30 2000
....
2020-06-30 30
2020-06-30 3
....
2020-09-30 10000
2020-09-30 3000
..
2020-09-30 3
I want to group by date and sum value across those that belong to the largest 1% percentile.
I used
book2 = book.groupby(['date'])['balance'].agg([lambda x : np.quantile(x, q=0.99), "sum"])
but this is giving me a strange value...
Any idea how to solve this?
Thanks!
Search all values above the top 1% then sum them for each date:
df.groupby('date')['balance'].apply(lambda x: x[x >= np.quantile(x, q=0.99)].sum())
Related
i have following datframe
created_time shares_count
2021-07-01 250.0
2021-07-31 501.0
2021-08-02 48.0
2021-08-05 300.0
2021-08-07 200.0
2021-09-06 28.0
2021-09-08 100.0
2021-09-25 100.0
2021-09-30 200.0
did the grouping as monthly like this
df_groupby_monthly = df.groupby(pd.Grouper(key='created_time',freq='M')).sum()
df_groupby_monthly
Now how to get the average of these 'shares_count's by dividing from a sum of monthly rows?
ex: if the 07th month has 2 rows average should be 751.0/2 = 375.5, and the 08th month has 3 rows average should be 548.0/3 = 182.666, and the 09th month has 4 rows average should be 428.0/4 = 142.66
how to get like this final output
created_time shares_count
2021-07-31 375.5
2021-08-31 182.666
2021-09-30 142.66
I have tried following
df.groupby(pd.Grouper(key='created_time',freq='M')).apply(lambda x: x['shares_count'].sum()/len(x))
this is working if only one column, multiple ones hard to get
df['created_time'] = pd.to_datetime(df['created_time'])
output = df.groupby(df['created_time'].dt.to_period('M')).mean().round(2).reset_index()
output
###
created_time shares_count
0 2021-07 375.50
1 2021-08 182.67
2 2021-09 107.00
Use this code:
df=df.groupby(pd.Grouper(key='created_time',freq='M')).agg({'shares_count':['sum', 'count']}).reset_index()
df['ss']=df[('shares_count','sum')]/df[('shares_count','count')]
sample dataframe looks:
ID Date Value
2 2020-06-30 124
1 2020-09-30 265
1 2021-12-31 140
1 2020-12-31 142
2 2020-12-31 147
1 2019-12-31 677
1 2021-03-31 235
2 2021-09-30 917
2 2021-03-31 149
I want to grab rows of max date for each year of each ID.
The final output would be:
ID Date Value
1 2019-12-31 677
1 2020-12-31 142
1 2021-12-31 140
2 2020-12-31 147
2 2021-09-30 917
I tried groupby ID but not sure how to grab rows by max date for each year.
Many thanks for your help!
Here is one way to accomplish it
df.assign(yr=df['Date'].astype('datetime64').dt.year).groupby(['ID','yr']).max().reset_index().drop(columns=['yr'])
since a max for each year is needed, a temporary year is created via assign, then grouped by id and year to get the max for each year. Finally dropping the yr column from result
ID Date Value
0 1 2019-12-31 677
1 1 2020-12-31 265
2 1 2021-12-31 235
3 2 2020-12-31 147
4 2 2021-09-30 917
First you would need to extract year from date:
df['year'] = pd.DatetimeIndex(df['Date']).year
then if you want to grab rows with max date in a year, get the max date:
maxDf = df.groupby([year])['Date'].max()
then you can filter you dataframe on the max dates
maxDates = maxDf['Date'].values.tolist()
df.loc[df['Date'].isin(maxDates)]
I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1
I have the following dataframe:
amount
01-01-2020 100
01-02-2020 100
01-03-2020 100
01-04-2020 100
01-05-2020 100
01-06-2020 100
01-07-2020 100
01-08-2020 100
01-09-2020 100
01-10-2020 100
01-11-2020 100
01-12-2020 100
I need to add a new column which starts with 100 and increases the value by 10% every 4 months, ie:
amount result
01-01-2020 100 100
01-02-2020 100 100
01-03-2020 100 100
01-04-2020 100 100
01-05-2020 100 110
01-06-2020 100 110
01-07-2020 100 110
01-08-2020 100 110
01-09-2020 100 121
01-10-2020 100 121
01-11-2020 100 121
01-12-2020 100 121
I think you need Grouper for each 4 months with GroupBy.ngroup for groups, then get 10% by multiple Series by 100 with divide 10 and last add 100:
df.index = pd.to_datetime(df.index, dayfirst=True)
df['result'] = df.groupby(pd.Grouper(freq='4MS')).ngroup().mul(100).div(10).add(100)
print (df)
amount result
2020-01-01 100 100.0
2020-02-01 100 100.0
2020-03-01 100 100.0
2020-04-01 100 100.0
2020-05-01 100 110.0
2020-06-01 100 110.0
2020-07-01 100 110.0
2020-08-01 100 110.0
2020-09-01 100 120.0
2020-10-01 100 120.0
2020-11-01 100 120.0
2020-12-01 100 120.0
If datetimes are consecutive and always each 4 rows is possible use:
df['result'] = np.arange(len(df)) // 4 * 100 / 10 + 100
print (df)
amount result
2020-01-01 100 100.0
2020-02-01 100 100.0
2020-03-01 100 100.0
2020-04-01 100 100.0
2020-05-01 100 110.0
2020-06-01 100 110.0
2020-07-01 100 110.0
2020-08-01 100 110.0
2020-09-01 100 120.0
2020-10-01 100 120.0
2020-11-01 100 120.0
2020-12-01 100 120.0
Here is another way:
pct = .1
df['result'] = df['amount'] * (1 + pct) ** (np.arange(len(df))//4)
you forgot to substract boolean vlaues for each period:
df['result'] = df['amount'] * (1 + pct) ** (np.arange(len(df))//4) - np.arange(len(df))//4
this is how you will have correct results.
I have some data which I'm trying to groupby "name" first and then resample by "transaction_date"
transaction_date name revenue
01/01/2020 ADIB 30419
01/01/2020 ADIB 1119372
01/01/2020 ADIB 1272170
01/01/2020 ADIB 43822
01/01/2020 ADIB 24199
The issue i have is writing groupby resample in two different ways return two different results
1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)
2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)
The first method returns the values I'm looking for.
I don't understand why the two methods return different results. Is this a bug?
Result 1
name transaction_date revenue
ADIB 2020-01-31 39170943.0
2020-02-29 48003966.0
2020-03-31 32691641.0
2020-04-30 11979337.0
2020-05-31 35510726.0
2020-06-30 25677857.0
2020-07-31 12437122.0
2020-08-31 4348936.0
2020-09-30 10547188.0
2020-10-31 5287406.0
2020-11-30 4288930.0
2020-12-31 17066105.0
Result 2
name transaction_date revenue
ADIB 2020-01-31 64128331.0
2020-02-29 54450014.0
2020-03-31 45636192.0
2020-04-30 25016777.0
2020-05-31 11941744.0
2020-06-30 15703151.0
2020-07-31 5517526.0
2020-08-31 4092618.0
2020-09-30 4333433.0
2020-10-31 3944117.0
2020-11-30 6528058.0
2020-12-31 5718196.0
Indeed, it's either a bug or an extremely strange behavior. Consider the following data:
input:
date revenue name
0 2020-10-27 0.744045 n_1
1 2020-10-29 0.074852 n_1
2 2020-11-21 0.560182 n_2
3 2020-12-29 0.208616 n_2
4 2020-05-03 0.325044 n_0
gb = df.groupby("name").resample("M", on="date")
gb.aggregate({'revenue':'sum'})
==>
revenue
name date
n_0 2020-12-31 0.325044
n_1 2020-05-31 0.744045
2020-06-30 0.000000
2020-07-31 0.000000
2020-08-31 0.000000
2020-09-30 0.000000
2020-10-31 0.074852
n_2 2020-10-31 0.560182
2020-11-30 0.208616
print(gb.sum()[['revenue']])
==>
revenue
name date
n_0 2020-05-31 0.325044
n_1 2020-10-31 0.818897
n_2 2020-11-30 0.560182
2020-12-31 0.208616
As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.
Here's an even simpler example:
Data frame:
date revenue name
0 2020-02-24 9 n_1
1 2020-05-12 8 n_2
2 2020-03-28 9 n_2
3 2020-01-14 2 n_0
gb = df.groupby("name").resample("M", on="date")
res1 = gb.sum()[['revenue']]
==>
name date
n_0 2020-01-31 2
n_1 2020-02-29 9
n_2 2020-03-31 9
2020-04-30 0
2020-05-31 8
res2 = gb.aggregate({'revenue':'sum'})
==>
name date
n_0 2020-05-31 2
n_1 2020-01-31 9
n_2 2020-02-29 8
2020-03-31 9
I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173