I have the following dataframe:
print(df)
day month year quantity
6 04 2018 10
8 04 2018 8
12 04 2018 8
I would like to create a column, sum of the "quantity" over the next "n" days, as it follows:
n = 2
print(df1)
day month year quantity final_quantity
6 04 2018 10 10 + 0 + 8 = 18
8 04 2018 8 8 + 0 + 0 = 8
12 04 2018 8 8 + 0 + 0 = 8
Specifically, summing 0 if the product has not been sold in the next "n" days.
I tried rolling sums from Pandas, but does not seem to work on different columns:
n = 2
df.quantity[::-1].rolling(n + 1, min_periods=1).sum()[::-1]
You can use a list comprehension:
import pandas as pd
df['DateTime'] = pd.to_datetime(df[['year', 'month', 'day']])
df['final_quantity'] = [df.loc[df['DateTime'].between(d, d+pd.Timedelta(days=2)), 'quantity'].sum() \
for d in df['DateTime']]
print(df)
# day month year quantity DateTime final_quantity
# 0 6 4 2018 10 2018-04-06 18
# 1 8 4 2018 8 2018-04-08 8
# 2 12 4 2018 8 2018-04-12 8
You can use set_index and rolling with sum:
df_out = df.set_index(pd.to_datetime(df['month'].astype(str)+
df['day'].astype(str)+
df['year'].astype(str), format='%m%d%Y'))['quantity']
d1 = df_out.resample('D').asfreq(fill_value=0)
d2 = d1[::-1].reset_index()
df['final_quantity'] = d2['quantity'].rolling(3, min_periods=1).sum()[::-1].to_frame()\
.set_index(d1.index)\
.reindex(df_out.index).values
Output:
day month year quantity final_quantity
0 6 4 2018 10 18.0
1 8 4 2018 8 8.0
2 12 4 2018 8 8.0
Related
I am trying to see "how much time each employee has spent on his/her Rank and levels".
The dataset consists of employee tenure i.e. if an employee is active for 5 months then 5 records would be available.
Below is the dataset
Employee Month Rank Level
A 01-07-2022 10 1
A 01-09-2022 10 1
A 01-08-2022 10 1
A 01-10-2022 10 2
A 01-12-2022 10 3
A 01-01-2023 11 1
B 01-07-2022 07 1
B 01-09-2022 07 1
B 01-08-2022 09 4
B 01-10-2022 09 2
B 01-12-2022 11 3
B 01-01-2023 12 1
Code:
df= df.groupby(['Employee','Rank','Level'])['Rank']\
.count()\
.to_frame('Tenure_rank_grade').reset_index()
But the above code gives the count, however, the expected is time spent.
How can this be achievable?
You can use a groupby.diff combined with groupby.sum:
delta = -df.sort_values(by='Month').groupby('Employee')['Month'].diff(-1)
out = (df.assign(total_time=delta)
.groupby(['Employee', 'Rank', 'Level'])
.sum()
)
print(out)
Output:
total_time
Employee Rank Level
A 10 1 92 days
2 61 days
3 31 days
11 1 0 days
B 7 1 61 days
9 2 61 days
4 31 days
11 3 31 days
12 1 0 days
If you want to include the count to the current day instead of 0 for the current position:
delta = (df.sort_values(by='Month')
.groupby('Employee')['Month']
.apply(lambda s: -s.diff(-1).fillna(s.max()-pd.Timestamp('today').floor('D')))
)
out = (df.assign(total_time=delta)
.groupby(['Employee', 'Rank', 'Level'])
.sum()
)
Output:
total_time
Employee Rank Level
A 10 1 92 days
2 61 days
3 31 days
11 1 11 days
B 7 1 61 days
9 2 61 days
4 31 days
11 3 31 days
12 1 11 days
The objective is to subtract a row (N) with previous row (N-1) separated by groups.
Given a df
years nchar nval
0 2019 a 1
1 2019 b 1
2 2019 c 1
3 2020 a 1
4 2020 s 4
Lets,separate into group of year 2019, and we denote it as df_2019
For df_2019, there we assign constant 10.
Then,only for index 0, we do the following operation and assign to a new column 'B`
df_2019.loc[df_2019.index[0], 'B']= 10 - df_2019['nval'].values[0]
Whereas, the other index
df_2019.loc[df_2019.index[N], 'B'] = df_2019['B'].values[N-1] - df_2019['nval'].values[N]
This, will produced the following table
years nchar nval C D B
1 2019 a 1 9
2 2019 b 1 8
3 2019 c 1 7
For the group 2020, the same computation apply. However, the only difference is, the constant value is the 7, which is taken from the last index of column B.
To answer this requirement, the following code is produced with extra possible groups.
import pandas as pd
year=[2019,2019,2019,2020,2020,2020,2020,2022,2022,2022]
nval=[1,1,1,1,4,1,4,5,6,7]
nchar=['a','b','c','a','s','c','a','b','c','g']
df=pd.DataFrame(zip(year,nchar,nval),columns=['years','nchar','nval'])
print(df)
year_ls=[2019,2020,2022]
nspacing_total=2
nspacing_between_df=4
all_df=[]
default_val=10
for idx,dyear in enumerate(year_ls):
df_=df[df['years']==dyear].reset_index(drop=True)
t=pd.DataFrame([[''] * 3]*len(df_), columns=["C", "D", "B"])
df_=pd.concat([df_,t],axis=1)
Total = df_['nval'].sum()
df_=pd.DataFrame([[''] * len(df.columns)]*1, columns=df.columns).append(df_).reset_index(drop=True)
if idx ==0:
df_.loc[df_.index[0], 'B']=default_val
if idx !=0:
pre_df=all_df[idx-1]
pre_val=pre_df['B'].values[-1]
nposi=1
pre_years=pre_df['years'].values[nposi]
df_.loc[df_.index[0], 'nchar']=f'From {pre_years}'
df_.loc[df_.index[0], 'B']=pre_val
for ndexd in range(df_.shape[0]-1):
df_.loc[df_.index[ndexd+1], 'B']=df_['B'].values[ndexd]-df_['nval'].values[ndexd+1]
df_=df_.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_total, columns=df.columns)).reset_index(drop=True)
df_.loc[df_.index[-1], 'nval']=Total
df_.loc[df_.index[-1], 'nchar']='Total'
df_.loc[df_.index[-1], 'B']=df_['B'].values[0]-df_['nval'].values[-1]
all_df.append(df_)
However, I wonder whether this proposal can be further simplified further using pandas groupby or other. I really appreciate for any tips.
Ultimately, I would like to express the table as below, which will be exported to excel
years nchar nval C D B
0 10
1 2019 a 1 9
2 2019 b 1 8
3 2019 c 1 7
4
5 Total 3 7
6
7
8
9
10 From 2019 7
11 2020 a 1 6
12 2020 s 4 2
13 2020 c 1 1
14 2020 a 4 -3
15
16 Total 10 -3
17
18
19
20
21 From 2020 -3
22 2022 b 5 -8
23 2022 c 6 -14
24 2022 g 7 -21
25
26 Total 18 -21
27
28
29
30
The code to produced the above table
# Optional to represent the table above
all_ap_df=[]
for a_df in all_df:
df=a_df.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_between_df, columns=df.columns)).reset_index(drop=True)
all_ap_df.append(df)
df=pd.concat(all_ap_df,axis=0).reset_index(drop=True)
df.loc[df_.index[0], 'D']=df['B'].values[0]
df.loc[df_.index[0], 'B']=''
df = df.fillna('')
I think this is actually quite simple. Use groupby + cumsum:
df['B'] = 10 - df['nval'].cumsum()
Output:
>>> df
years nchar nval B
0 2019 a 1 9
1 2019 b 1 8
2 2019 c 1 7
3 2020 a 1 6
4 2020 s 4 2
In your case chain with groupby
df['new'] = df.groupby('years')['nval'].cumsum().rsub(10)
Out[8]:
0 9
1 8
2 7
3 9
4 5
Name: nval, dtype: int64
I have a range of days starting on January 1 1930 and ending on May 7 2020 in df. I want columns that divide the year in different ways: so far I have columns denoting the Year, Month and Week. I also want columns denoting Dekad and Semi-Month increments.
Dekad is 10-day period where January 1-10 is dekad "1", Jan 11-20 is dekad "2", etc and the final dekad "37" will have a length less than 10 because 365 does not divide evenly by 10.
For semi-month, I want to divide each month in halve and increment over the year. This is a little trickier because months have different lengths, but basically Jan 1-15 would be "1" and Jan 16-31 would be "2" and Feb 1-14 would be "3" and Feb 15-28 would be "4", etc. (in a non leap year.)
In other words, I want custom date time splits or custom periods of the calendar year. This should be relatively easy to do for the dekads, so that is my priority more so than the semi-monthly split.
Is there something baked into the datetime package that can already do this or do I need to write custom function(s)?
If the latter, a starting off point for Dekad is to maybe take the first_day_of_year object and then add datetime.timedelta(days=10) and increment from 1 to 37 for each dekad? Suggestions welcome.
# import packages
import pandas as pd
import datetime
from dateutil.relativedelta import *
# create dataframe with dates
df = pd.DataFrame()
df['Datetime'] = pd.date_range(start='1/1/1930', periods=33000, freq='D')
# extract the Year, Month, etc. from the Datetime
df['Year'] = [dt.year for dt in df['Datetime']]
df['Month'] = [dt.month for dt in df['Datetime']]
df['Week'] = [dt.week for dt in df['Datetime']]
This is what I eventually want:
Datetime Year Month Week Semi_Month Dekad
0 1930-01-01 1930 1 1 1 1
1 1930-01-02 1930 1 1 1 1
2 1930-01-03 1930 1 1 1 1
3 1930-01-04 1930 1 1 1 1
4 1930-01-05 1930 1 1 1 1
... ... ... ... ...
32995 2020-05-03 2020 5 18 9 13
32996 2020-05-04 2020 5 19 9 13
32997 2020-05-05 2020 5 19 9 13
32998 2020-05-06 2020 5 19 9 13
32999 2020-05-07 2020 5 19 9 13
for Dekad, it is actually the dayofyear integer divided by 10 plus 1. for the Semi_month, the idea is to check where the day of the month is greater than (gt) than the last day of the month obtained with MonthEnd divided by 2, add the month number times 2 minus 1.
df['Semi_Month'] = (df['Datetime'].dt.day
.gt((df['Datetime']+pd.tseries.offsets.MonthEnd()).dt.day//2)
+ df['Month']*2 -1)
df['Dekad'] = df['Datetime'].dt.dayofyear//10+1
print(df)
Datetime Year Month Week Semi_Month Dekad
0 1930-01-01 1930 1 1 1 1
1 1930-01-02 1930 1 1 1 1
2 1930-01-03 1930 1 1 1 1
3 1930-01-04 1930 1 1 1 1
4 1930-01-05 1930 1 1 1 1
... ... ... ... ... ... ...
32995 2020-05-03 2020 5 18 9 13
32996 2020-05-04 2020 5 19 9 13
32997 2020-05-05 2020 5 19 9 13
32998 2020-05-06 2020 5 19 9 13
32999 2020-05-07 2020 5 19 9 13
I have the following dataframe:
df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'test':[3,2,6,8,7,4,25,2],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021'])
df2.index= pd.to_datetime(df2.index)
df2.index = df2.index.year
print(df2)
season test test2 value
2020 1 3 4 -2
2020 1 2 5 3
2020 1 6 7 1
2020 2 8 8 5
2020 2 7 9 8
2021 2 4 10 6
2021 3 25 11 7
2021 3 2 12 5
I would like to filter it to obtain for each year and each season of that year the maximum value of the column 'value'. How can I do that efficiently?
Expected result:
print(df_result)
season value test test2
year
2020 1 3 2 5
2020 2 8 7 9
2021 2 6 4 10
2021 3 7 25 11
Thank you for your help,
Pierre
This is a groupby operation, but a little non-trivial, so posting as an answer.
(df2.set_index('season', append=True)
.groupby(level=[0, 1])
.value.max()
.reset_index(level=1)
)
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7
You can elevate your index to a series, then perform a groupby operation on a list of columns:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])['value'].max().reset_index()
print(df_result)
year season value
0 2020 1 4
1 2020 2 8
2 2021 2 6
3 2021 3 7
If you wish, you can make year your index again via df_result = df_result.set_index('year').
To keep other columns use:
df2['year'] = df2.index
df2['value'] = df2.groupby(['year', 'season'])['value'].transform('max')
Then drop any duplicates via pd.DataFrame.drop_duplicates.
Update #1
For your new requirement, you need to apply an aggregation function for 2 series:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])\
.agg({'value': 'max', 'test': 'last'})\
.reset_index()
print(df_result)
year season value test
0 2020 1 4 6
1 2020 2 8 7
2 2021 2 6 2
3 2021 3 7 2
Update #2
For your finalised requirement:
df2['year'] = df2.index
df2['max_value'] = df2.groupby(['year', 'season'])['value'].transform('max')
df_result = df2.loc[df2['value'] == df2['max_value']]\
.drop_duplicates(['year', 'season'])\
.drop('max_value', 1)
print(df_result)
season value test test2 year
2020 1 3 2 5 2020
2020 2 8 7 9 2020
2021 2 6 4 10 2021
2021 3 7 25 11 2021
You can using get_level_values for bring index value into groupby
df2.groupby([df2.index.get_level_values(0),df2.season]).value.max().reset_index(level=1)
Out[38]:
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7
I have a pandas dataframe where the index is the date, from year 2007 to 2017.
I'd like to calculate the mean of each weekday for each year. I am able to group by year:
groups = df.groupby(TimeGrouper('A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
This is the way I create a new dataframe (years) where in each column I obtain each year of the time series.
If I want to see the statistics of each years (for example, the mean):
print(years.mean())
But now I would like to separate each day of the week for each year, in order to obtain the mean of each weekday for all of then.
The only thing I know is:
year=df[(df.index.year==2007)]
day_week=df[(df.index.weekday==2)]
The problem with this is that I have to change 7 times the day of the week, and then repeat this for 11 years (my time series begins on 2007 and ends on 2017), so I must do it 77 times!
Is there a way to group time by years and weekday in order to make this faster?
It seems you need groupby by DatetimeIndex.year with DatetimeIndex.weekday:
rng = pd.date_range('2017-04-03', periods=10, freq='10M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2017-04-30 0
2018-02-28 1
2018-12-31 2
2019-10-31 3
2020-08-31 4
2021-06-30 5
2022-04-30 6
2023-02-28 7
2023-12-31 8
2024-10-31 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean()
print (df1)
a
2017 6 0
2018 0 2
2 1
2019 3 3
2020 0 4
2021 2 5
2022 5 6
2023 1 7
6 8
2024 3 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean().reset_index()
df1 = df1.rename(columns={'level_0':'years','level_1':'weekdays'})
print (df1)
years weekdays a
0 2017 6 0
1 2018 0 2
2 2018 2 1
3 2019 3 3
4 2020 0 4
5 2021 2 5
6 2022 5 6
7 2023 1 7
8 2023 6 8
9 2024 3 9