How to get a correct mean after using groupby? - python

I have a dataframe of a step counter. It has a column M_DATE (dd-mm-yy hh-mm-ss) that I set to date time. It also has a column M_STEPS that contains the number of steps that are done.
I split the date column in to several columns with also a column named "day_of_week". This one determines what the name of the day is was.
I wanted to use a groupby function on the day_of_week and wanted to have the mean per Monday, Tuesday, Wednesday etc. But I get an answer that doesn't look right.
I have tried
to got the name of the days I did:
df['day_of_week'] = df['M_DATE'].dt.day_name()
then I did :
df.groupby('day_of_week')['M_STEPS'].mean()
I hoped that it would group for example all the Mondays and then give me the mean of the amount of steps taken on Mondays. But the outcome is some is a very big number that I cannot make sense of.
The strange thing is when I use:
df.groupby('day_of_week')['M_STEPS'].sum()
it does give me a correct number.
What am I doing wrong?
Edit
Here I copied and pasted the df.head()
M_ID M_DATE M_CALORIES M_STEPS M_DISTANCE M_METS M_WEEK M_WEEKDAY M_HOUR M_MINUTE year month day day_of_week
0 27 2016-01-24 00:00:00 1 0 0.0 10 3 1 0 0 2016 1 24 Sunday
1 28 2016-01-24 00:01:00 1 0 0.0 10 3 1 0 1 2016 1 24 Sunday
2 29 2016-01-24 00:02:00 1 0 0.0 10 3 1 0 2 2016 1 24 Sunday
3 30 2016-01-24 00:03:00 1 0 0.0 10 3 1 0 3 2016 1 24 Sunday
4 31 2016-01-24 00:04:00 1 0 0.0 10 3 1 0 4 2016 1 24 Sunday

Lets say you have:
day_of_week M_steps
Monday 1
Monday 2
Tuesday 1
Tuesday 3
then df.groupby('day_of_week')['M_STEPS'].mean():
Monday 1.5
Tuesday 2
and df.groupby('day_of_week')['M_STEPS'].sum():
Monday 3
Tuesday 4
This is groupby doing, probably the dataframe is sorted differently. Could you add your original dataframe to your example?

Related

How to count the total sales by year, month

I have a big csv (17985 rows) with sales in different days.The csv looks like this:
Customer Date Sale
Larry 1/2/2018 20$
Mike 4/3/2020 40$
John 12/5/2017 10$
Sara 3/2/2020 90$
Charles 9/8/2022 75$
Below is how many times that exact day appears in my csv (how many sales were made that day):
occur = df.groupby(['Date']).size()
occur
2018-01-02 32
2018-01-03 31
2018-01-04 42
2018-01-05 192
2018-01-06 26
I used crosstab, groupby and several methods but the problem is that they don't add up, or is NaN.
new_df['total_sales_that_month'] = df.groupby('Date')['Sale'].sum()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
17980 NaN
17981 NaN
17982 NaN
17983 NaN
17984 NaN
I want to group them by year and month in a dataframe, based on total sales. Using dt.year and dt.month I managed to do this:
year
month
1 2020
1 2020
7 2019
8 2019
2 2018
... ...
4 2020
4 2020
4 2020
4 2020
4 2020
What I want to have is: month/year/total_sales_that_month. What method should I apply? This is the expected output:
Month Year Total_sale_that_month
1 2018 420$
2 2018 521$
3 2018 124$
4 2018 412$
5 2018 745$
You can use groupby_sum but before you have to strip '$' from Sale column and convert as numeric:
# Clean your dataframe first
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Sale'] = df['Sale'].str.strip('$').astype(float)
out = (df.groupby([df['Date'].dt.month.rename('Month'),
df['Date'].dt.year.rename('Year')])
['Sale'].sum()
.rename('Total_sale_that_month')
# .astype(str).add('$') # uncomment if '$' matters
.reset_index())
Output:
>>> out
Month Year Total_sale_that_month
0 2 2018 20.0
1 2 2020 90.0
2 3 2020 40.0
3 5 2017 10.0
4 8 2022 75.0
i share you my code,
pivot_table, reset_index and sorting,
convert your col name:
df["Dt_Customer_Y"] = pd.DatetimeIndex(df['Dt_Customer']).year
df["Dt_Customer_M"] = pd.DatetimeIndex(df['Dt_Customer']).month
pvtt = df.pivot_table(index=['Dt_Customer_Y', 'Dt_Customer_M'], aggfunc={'Income':sum})
pvtt.reset_index().sort_values(['Dt_Customer_Y', 'Dt_Customer_M'])
Dt_Customer_Y Dt_Customer_M Income
0 2012 1 856039.0
1 2012 2 487497.0
2 2012 3 921940.0
3 2012 4 881203.0

pandas get a sum column for next 7 days

I want to get the sum of values for next 7 days of a column
my dataframe :
date value
0 2021-04-29 1
1 2021-05-03 2
2 2021-05-06 1
3 2021-05-15 1
4 2021-05-17 2
5 2021-05-18 1
6 2021-05-21 2
7 2021-05-22 5
8 2021-05-24 4
i tried to make a new column that contains date 7 days from current date
df['temp'] = df['date'] + timedelta(days=7)
then calculate value between date range :
df['next_7days'] = df[(df.date > df.date) & (df.date <= df.temp)].value.sum()
But this gives me answer as all 0.
intended result:
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
The method iam using currently is quite tedious, are their any better methods to get the intended result.
With a list comprehension:
tomorrow_dates = df.date + pd.Timedelta("1 day")
next_week_dates = df.date + pd.Timedelta("7 days")
df["next_7days"] = [df.value[df.date.between(tomorrow, next_week)].sum()
for tomorrow, next_week in zip(tomorrow_dates, next_week_dates)]
where we first define tomorrow and next week's dates and store them. Then zip them together and use between of pd.Series to get a boolean series if the date is indeed between the desired range. Then using boolean indexing to get the actual values and sum them. Do this for each date pair.
to get
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0

How to do custom time (semi-monthly and 10-day) splits of calendar year data in python?

I have a range of days starting on January 1 1930 and ending on May 7 2020 in df. I want columns that divide the year in different ways: so far I have columns denoting the Year, Month and Week. I also want columns denoting Dekad and Semi-Month increments.
Dekad is 10-day period where January 1-10 is dekad "1", Jan 11-20 is dekad "2", etc and the final dekad "37" will have a length less than 10 because 365 does not divide evenly by 10.
For semi-month, I want to divide each month in halve and increment over the year. This is a little trickier because months have different lengths, but basically Jan 1-15 would be "1" and Jan 16-31 would be "2" and Feb 1-14 would be "3" and Feb 15-28 would be "4", etc. (in a non leap year.)
In other words, I want custom date time splits or custom periods of the calendar year. This should be relatively easy to do for the dekads, so that is my priority more so than the semi-monthly split.
Is there something baked into the datetime package that can already do this or do I need to write custom function(s)?
If the latter, a starting off point for Dekad is to maybe take the first_day_of_year object and then add datetime.timedelta(days=10) and increment from 1 to 37 for each dekad? Suggestions welcome.
# import packages
import pandas as pd
import datetime
from dateutil.relativedelta import *
# create dataframe with dates
df = pd.DataFrame()
df['Datetime'] = pd.date_range(start='1/1/1930', periods=33000, freq='D')
# extract the Year, Month, etc. from the Datetime
df['Year'] = [dt.year for dt in df['Datetime']]
df['Month'] = [dt.month for dt in df['Datetime']]
df['Week'] = [dt.week for dt in df['Datetime']]
This is what I eventually want:
Datetime Year Month Week Semi_Month Dekad
0 1930-01-01 1930 1 1 1 1
1 1930-01-02 1930 1 1 1 1
2 1930-01-03 1930 1 1 1 1
3 1930-01-04 1930 1 1 1 1
4 1930-01-05 1930 1 1 1 1
... ... ... ... ...
32995 2020-05-03 2020 5 18 9 13
32996 2020-05-04 2020 5 19 9 13
32997 2020-05-05 2020 5 19 9 13
32998 2020-05-06 2020 5 19 9 13
32999 2020-05-07 2020 5 19 9 13
for Dekad, it is actually the dayofyear integer divided by 10 plus 1. for the Semi_month, the idea is to check where the day of the month is greater than (gt) than the last day of the month obtained with MonthEnd divided by 2, add the month number times 2 minus 1.
df['Semi_Month'] = (df['Datetime'].dt.day
.gt((df['Datetime']+pd.tseries.offsets.MonthEnd()).dt.day//2)
+ df['Month']*2 -1)
df['Dekad'] = df['Datetime'].dt.dayofyear//10+1
print(df)
Datetime Year Month Week Semi_Month Dekad
0 1930-01-01 1930 1 1 1 1
1 1930-01-02 1930 1 1 1 1
2 1930-01-03 1930 1 1 1 1
3 1930-01-04 1930 1 1 1 1
4 1930-01-05 1930 1 1 1 1
... ... ... ... ... ... ...
32995 2020-05-03 2020 5 18 9 13
32996 2020-05-04 2020 5 19 9 13
32997 2020-05-05 2020 5 19 9 13
32998 2020-05-06 2020 5 19 9 13
32999 2020-05-07 2020 5 19 9 13

Group python pandas dataframe per weeks (starting on Monday)

I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right

count of last n days per group

I have a DataFrame like this
df = pd.DataFrame({'Team':['CHI','IND','CHI','CHI','IND','CHI','CHI','IND'],
'Date':[datetime.date(2015,10,27),datetime.date(2015,10,28),datetime.date(2015,10,29),datetime.date(2015,10,30),datetime.date(2015,11,1),datetime.date(2015,11,2),datetime.date(2015,11,4),datetime.date(2015,11,4)]})
I can find the number of rest days between games using this.
df['TeamRest'] = df.groupby('Team')['Date'].diff() - datetime.timedelta(1)
I would like to also add a row to the DataFrame that keeps track of how many games each team has played in the last 5 days.
With Date converted to datetime so it can be used as DateTimeIndex, which will be important for the rolling_count with daily frequency
df.Date = pd.to_datetime(df.Date)
1) calculate the difference in days between games per team:
df['days_between'] = df.groupby('Team')['Date'].diff() - timedelta(days=1)
2) calculate the rolling count of games for the last 5 days per team:
df['game_count'] = 1
rolling_games_count = df.set_index('Date').groupby('Team').apply(lambda x: pd.rolling_count(x, window=5, freq='D')).reset_index()
df = df.drop('game_count', axis=1).merge(rolling_games_count, on=['Team', 'Date'], how='left')
to get:
Date Team days_between game_count
0 2015-10-27 CHI NaT 1
1 2015-10-28 IND NaT 1
2 2015-10-29 CHI 1 days 2
3 2015-10-30 CHI 0 days 3
4 2015-11-01 IND 3 days 2
5 2015-11-02 CHI 2 days 3
6 2015-11-04 CHI 1 days 2
7 2015-11-04 IND 2 days 2
If you were to
df = pd.DataFrame({'Team':['CHI','IND','CHI','CHI','IND','CHI','CHI','IND'], 'Date': [date(2015,10,27),date(2015,10,28),date(2015,10,29),date(2015,10,30),date(2015,11,1),date(2015,11,2),date(2015,11,4),date(2015,12,10)]})
df['game'] = 1 # initialize a game to count.
df['nb_games'] = df.groupby('Team')['game'].apply(pd.rolling_count, 5)
you get the surprising result (one Date changed to one month later)
Date Team game nb_games
0 2015-10-27 CHI 1 1
2 2015-10-29 CHI 1 2
3 2015-10-30 CHI 1 3
5 2015-11-02 CHI 1 4
6 2015-11-04 CHI 1 5
1 2015-10-28 IND 1 1
4 2015-11-01 IND 1 2
7 2015-12-10 IND 1 3
of nb_games=3 for a later date in December, when there were no games during the last five days. Unless you convert to datetime, you only count the last five entries in the DataFrame, so you'll always get five for a team with more than five games played.

Categories