Pandas Grouper by weekday? - python

I have a pandas dataframe where the index is the date, from year 2007 to 2017.
I'd like to calculate the mean of each weekday for each year. I am able to group by year:
groups = df.groupby(TimeGrouper('A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
This is the way I create a new dataframe (years) where in each column I obtain each year of the time series.
If I want to see the statistics of each years (for example, the mean):
print(years.mean())
But now I would like to separate each day of the week for each year, in order to obtain the mean of each weekday for all of then.
The only thing I know is:
year=df[(df.index.year==2007)]
day_week=df[(df.index.weekday==2)]
The problem with this is that I have to change 7 times the day of the week, and then repeat this for 11 years (my time series begins on 2007 and ends on 2017), so I must do it 77 times!
Is there a way to group time by years and weekday in order to make this faster?

It seems you need groupby by DatetimeIndex.year with DatetimeIndex.weekday:
rng = pd.date_range('2017-04-03', periods=10, freq='10M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2017-04-30 0
2018-02-28 1
2018-12-31 2
2019-10-31 3
2020-08-31 4
2021-06-30 5
2022-04-30 6
2023-02-28 7
2023-12-31 8
2024-10-31 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean()
print (df1)
a
2017 6 0
2018 0 2
2 1
2019 3 3
2020 0 4
2021 2 5
2022 5 6
2023 1 7
6 8
2024 3 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean().reset_index()
df1 = df1.rename(columns={'level_0':'years','level_1':'weekdays'})
print (df1)
years weekdays a
0 2017 6 0
1 2018 0 2
2 2018 2 1
3 2019 3 3
4 2020 0 4
5 2021 2 5
6 2022 5 6
7 2023 1 7
8 2023 6 8
9 2024 3 9

Related

Doing joins between 2 csv files [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

merging two csv using python [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

Sort by both index and value in Multi-indexed data of Pandas dataframe

Suppose, I have a dataframe as below:
year month message
0 2018 2 txt1
1 2017 4 txt2
2 2019 5 txt3
3 2017 5 txt5
4 2017 5 txt4
5 2020 4 txt3
6 2020 6 txt3
7 2020 6 txt3
8 2020 6 txt4
I want to figure out top three number of messages in each year. So, I grouped the data as below:
df.groupby(['year','month']).count()
which results:
message
year month
2017 4 1
5 2
2018 2 1
2019 5 1
2020 4 1
6 3
The data is in ascending order for both indexes. But how to find the results as shown below where the data is sorted by year (ascending) and count (descending) for top n values. 'month' index will be free.
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
value_counts gives you sort by default:
df.groupby('year')['month'].value_counts()
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
If you want only 2 top values for each year, do another groupby:
(df.groupby('year')['month'].value_counts()
.groupby('year').head(2)
)
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
This will sort by year (ascending) and count (descending).
df = df.groupby(['year', 'month']).count().sort_values(['year', 'message'], ascending=[True, False])
You can use sort_index, specifying ascending=[True,False] so that only the second level is sorted in descending order:
df = df.groupby(['year','month']).count().sort_index(ascending=[True,False])
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
here you go
df.groupby(['year', 'month']).count().sort_values(axis=0, ascending=False, by='message').sort_values(axis=0, ascending=True, by='year')
you can use this code for it.
df.groupby(['year', 'month']).count().sort_index(axis=0, ascending=False).sort_values(by="year", ascending=True)

Pandas dataframe multiple groupby filtering

I have the following dataframe:
df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'test':[3,2,6,8,7,4,25,2],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021'])
df2.index= pd.to_datetime(df2.index)
df2.index = df2.index.year
print(df2)
season test test2 value
2020 1 3 4 -2
2020 1 2 5 3
2020 1 6 7 1
2020 2 8 8 5
2020 2 7 9 8
2021 2 4 10 6
2021 3 25 11 7
2021 3 2 12 5
I would like to filter it to obtain for each year and each season of that year the maximum value of the column 'value'. How can I do that efficiently?
Expected result:
print(df_result)
season value test test2
year
2020 1 3 2 5
2020 2 8 7 9
2021 2 6 4 10
2021 3 7 25 11
Thank you for your help,
Pierre
This is a groupby operation, but a little non-trivial, so posting as an answer.
(df2.set_index('season', append=True)
.groupby(level=[0, 1])
.value.max()
.reset_index(level=1)
)
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7
You can elevate your index to a series, then perform a groupby operation on a list of columns:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])['value'].max().reset_index()
print(df_result)
year season value
0 2020 1 4
1 2020 2 8
2 2021 2 6
3 2021 3 7
If you wish, you can make year your index again via df_result = df_result.set_index('year').
To keep other columns use:
df2['year'] = df2.index
df2['value'] = df2.groupby(['year', 'season'])['value'].transform('max')
Then drop any duplicates via pd.DataFrame.drop_duplicates.
Update #1
For your new requirement, you need to apply an aggregation function for 2 series:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])\
.agg({'value': 'max', 'test': 'last'})\
.reset_index()
print(df_result)
year season value test
0 2020 1 4 6
1 2020 2 8 7
2 2021 2 6 2
3 2021 3 7 2
Update #2
For your finalised requirement:
df2['year'] = df2.index
df2['max_value'] = df2.groupby(['year', 'season'])['value'].transform('max')
df_result = df2.loc[df2['value'] == df2['max_value']]\
.drop_duplicates(['year', 'season'])\
.drop('max_value', 1)
print(df_result)
season value test test2 year
2020 1 3 2 5 2020
2020 2 8 7 9 2020
2021 2 6 4 10 2021
2021 3 7 25 11 2021
You can using get_level_values for bring index value into groupby
df2.groupby([df2.index.get_level_values(0),df2.season]).value.max().reset_index(level=1)
Out[38]:
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7

A more complex rolling sum over the next n rows

I have the following dataframe:
print(df)
day month year quantity
6 04 2018 10
8 04 2018 8
12 04 2018 8
I would like to create a column, sum of the "quantity" over the next "n" days, as it follows:
n = 2
print(df1)
day month year quantity final_quantity
6 04 2018 10 10 + 0 + 8 = 18
8 04 2018 8 8 + 0 + 0 = 8
12 04 2018 8 8 + 0 + 0 = 8
Specifically, summing 0 if the product has not been sold in the next "n" days.
I tried rolling sums from Pandas, but does not seem to work on different columns:
n = 2
df.quantity[::-1].rolling(n + 1, min_periods=1).sum()[::-1]
You can use a list comprehension:
import pandas as pd
df['DateTime'] = pd.to_datetime(df[['year', 'month', 'day']])
df['final_quantity'] = [df.loc[df['DateTime'].between(d, d+pd.Timedelta(days=2)), 'quantity'].sum() \
for d in df['DateTime']]
print(df)
# day month year quantity DateTime final_quantity
# 0 6 4 2018 10 2018-04-06 18
# 1 8 4 2018 8 2018-04-08 8
# 2 12 4 2018 8 2018-04-12 8
You can use set_index and rolling with sum:
df_out = df.set_index(pd.to_datetime(df['month'].astype(str)+
df['day'].astype(str)+
df['year'].astype(str), format='%m%d%Y'))['quantity']
d1 = df_out.resample('D').asfreq(fill_value=0)
d2 = d1[::-1].reset_index()
df['final_quantity'] = d2['quantity'].rolling(3, min_periods=1).sum()[::-1].to_frame()\
.set_index(d1.index)\
.reindex(df_out.index).values
Output:
day month year quantity final_quantity
0 6 4 2018 10 18.0
1 8 4 2018 8 8.0
2 12 4 2018 8 8.0

Categories