I would like to group a pandas dataframe by week from the last entry day, and make the sum for each column /per week.
(1 week : monday -> sunday, if the last entry is tuesday, the week is is composed of monday and tuesday data only, not today - 7 days)
df:
a b c d e
2019-01-01 1 2 5 0 1
...
2020-01-25 2 3 6 1 0
2020-01-26 1 2 3 4 5
expected output:
week a b c d e
104 9 8 8 8 7
...
1 7 8 8 8 9
code:
df = df.rename_axis('date').reset_index()
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.groupby(DF.date.dt.strftime('%W')).sum()
Problem: not the week number I want and the weeks n of each year are grouped in the same line
Try extract the iso calendar (year-week-day), then groupby:
s = dt.index.isocalendar()
dt.groupby([s.year, s.week]).sum()
You would get something like this:
a b c d e
year week
2019 1 18 33 31 26 25
2 36 31 25 28 31
3 33 22 44 22 29
4 36 36 35 33 31
5 27 30 26 31 36
Related
We can apply a 30D monthly rolling sum operations as:
df.rolling("30D").sum()
However, how can I achieve a month-to-date (or even year-to-date) rolling sum in a similar fashion?
Month-to-date meaning that we only sum from the beginning of the month up to the current date (or row)?
Consider the following database:
Year Month week Revenue
0 2020 1 1 10
1 2020 1 2 20
2 2020 1 3 10
3 2020 1 4 20
4 2020 2 1 10
5 2020 2 2 20
6 2020 2 3 10
7 2020 2 4 20
8 2020 3 1 10
9 2020 3 2 20
10 2020 3 3 10
11 2020 3 4 20
12 2021 1 1 10
13 2021 1 2 20
14 2021 1 3 10
15 2021 1 4 20
16 2021 2 1 10
17 2021 2 2 20
18 2021 2 3 10
19 2021 2 4 20
20 2021 3 1 10
21 2021 3 2 20
22 2021 3 3 10
23 2021 3 4 20
You could use a combination of group_by + cumsum to get what you want:
df['Year_To_date'] = df.groupby('Year')['Revenue'].cumsum()
df['Month_To_date'] = df.groupby(['Year', 'Month'])['Revenue'].cumsum()
Results:
Year Month week Revenue Year_To_date Month_To_date
0 2020 1 1 10 10 10
1 2020 1 2 20 30 30
2 2020 1 3 10 40 40
3 2020 1 4 20 60 60
4 2020 2 1 10 70 10
5 2020 2 2 20 90 30
6 2020 2 3 10 100 40
7 2020 2 4 20 120 60
8 2020 3 1 10 130 10
9 2020 3 2 20 150 30
10 2020 3 3 10 160 40
11 2020 3 4 20 180 60
12 2021 1 1 10 10 10
13 2021 1 2 20 30 30
14 2021 1 3 10 40 40
15 2021 1 4 20 60 60
16 2021 2 1 10 70 10
17 2021 2 2 20 90 30
18 2021 2 3 10 100 40
19 2021 2 4 20 120 60
20 2021 3 1 10 130 10
21 2021 3 2 20 150 30
22 2021 3 3 10 160 40
23 2021 3 4 20 180 60
Note that Month-to-date makes sense only if you have a week/date column in your data model.
EXTRAS:
The goal of cumsum is to compute the cumulative sum over date by different periods. However, if the index of the original data frame is not ordered in the desired sequence,cumsum is computed by the original index within a group.That's because Pandas operates sequence by row indexes.
Thus, data frame first needs to be sorted by the desired order([Year,Month,Week] or [Date]), followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of periods , in the chronological order.
df=df.sort_values(['Year', 'Month','Week']).reset_index(drop=True)
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data.
Flasks DataFrame
year month day hour minute second... gas1 gas2 gas3
0 2018 4 8 16 27 48... 10 25 191
1 2018 4 8 16 40 20... 45 34 257
...
229 2018 5 12 14 10 05... 3 72 108
one_sec_flt DataFrame
Year Month Day Hour Min Second... temp wind
0 2018 4 8 14 30 20... 300 10
1 2018 4 8 14 45 15... 310 8
...
305,212 2018 5 12 14 10 05... 308 24
I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp.
for i in range(len(flasks)):
for j in range(len(one_sec_flt)):
if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]):
if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]):
if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]):
print('match')
My output goal would look like:
Year Month Day Hour Min Second... temp wind gas1 gas2 gas3
0 2018 4 8 14 30 20... 300 10 nan nan nan
1 2018 4 8 14 45 15... 310 8 nan nan nan
2 2018 4 8 15 15 47... ... ... nan nan nan
3 2018 4 8 16 27 48... ... ... 10 25 191
4 2018 4 8 16 30 11... ... ... nan nan nan
5 2018 4 8 16 40 20... ... ... 45 34 257
... ... ... ... ... ... ... ... ... ... ... ...
305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly).
Flasks
Out[13]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
one_sec
Out[14]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res = pd.concat([Flasks,one_sec])
df_res
Out[16]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res.sort_values(by=['year','month','day','hour','minute','second'])
Out[17]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
I have a data set from 2015-2018 which has months and days as 2nd and third col like below:
Year Month Day rain temp humidity snow
2015 1 1 0 20 60 0
2015 1 2 2 18 58 0
2015 1 3 0 20 62 2
2015 1 4 5 15 62 0
2015 1 5 2 18 61 1
2015 1 6 0 19 60 2
2015 1 7 3 20 59 0
2015 1 8 2 17 65 0
2015 1 9 1 17 61 0
I wanted to use pivot_table to calculate something like (the mean of temperature for year 2016 and months (1,2,3)
I was wondering if anyone could help me with this?
You can do with pd.cut then groupby
df.temp.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[93]:
Year Month
2015 Winter 18.222222
I have created a days difference column in a pandas dataframe, and I'm looking to add a column that has the sum of a specific value over a given days window backwards
Notice that I can supply a date column for each row if it is needed, but the diff was created as days difference from the first day of the data.
Example
df = pd.DataFrame.from_dict({'diff': [0,0,1,2,2,2,2,10,11,15,18],
'value': [10,11,15,2,5,7,8,9,23,14,15]})
df
Out[12]:
diff value
0 0 10
1 0 11
2 1 15
3 2 2
4 2 5
5 2 7
6 2 8
7 10 9
8 11 23
9 15 14
10 18 15
I want to add 5_days_back_sum column that will sum the past 5 days, including same day so the result would be like this
Out[15]:
5_days_back_sum diff value
0 21 0 10
1 21 0 11
2 36 1 15
3 58 2 2
4 58 2 5
5 58 2 7
6 58 2 8
7 9 10 9
8 32 11 23
9 46 15 14
10 29 18 15
How can I achieve that? Originally I have a date column to create the diff column, if that helps its available
Use custom function with boolean indexing for filtering range with sum:
def f(x):
return df.loc[(df['diff'] >= x - 5) & (df['diff'] <= x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
Similar solution with between:
def f(x):
return df.loc[df['diff'].between(x - 5, x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
I am using python/pandas, and want to know how to get the week number in the year of one day while Saturday as the first day of the week.
i did search a lot, but all the way takes either Monday or Sunday as the first day of week...
Please help...thanks
Thanks all! really appreciated all your quick answers..but i have to apology that i am not making my question clearly.
I want to know the week number in the year. for example, 2015-08-09 is week 32 while Monday as first day of week, but week 33 while Saturday as first day of week.
Thanks #Cyphase and everyone, I changed a bit the code of Cyphase and it works.
def week_number(start_week_on, date_=None):
assert 1 <= start_week_on <= 7 #Monday=1, Sunday=7
if not date_:
date_ = date.today()
__, normal_current_week, normal_current_day = date_.isocalendar()
print date_, normal_current_week, normal_current_day
if normal_current_day >= start_week_on:
week = normal_current_week + 1
else:
week = normal_current_week
return week
If I understand correctly the following does what you want:
In [101]:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':pd.date_range(start=dt.datetime(2015,8,9), end=dt.datetime(2015,9,1))})
df['week'] = df['date'].dt.week.shift(-2).ffill()
df['orig week'] = df['date'].dt.week
df['day of week'] = df['date'].dt.dayofweek
df
Out[101]:
date week orig week day of week
0 2015-08-09 33 32 6
1 2015-08-10 33 33 0
2 2015-08-11 33 33 1
3 2015-08-12 33 33 2
4 2015-08-13 33 33 3
5 2015-08-14 33 33 4
6 2015-08-15 34 33 5
7 2015-08-16 34 33 6
8 2015-08-17 34 34 0
9 2015-08-18 34 34 1
10 2015-08-19 34 34 2
11 2015-08-20 34 34 3
12 2015-08-21 34 34 4
13 2015-08-22 35 34 5
14 2015-08-23 35 34 6
15 2015-08-24 35 35 0
16 2015-08-25 35 35 1
17 2015-08-26 35 35 2
18 2015-08-27 35 35 3
19 2015-08-28 35 35 4
20 2015-08-29 36 35 5
21 2015-08-30 36 35 6
22 2015-08-31 36 36 0
23 2015-09-01 36 36 1
The above uses dt.week and shifts by 2 rows and then forward fills the NaN values.
import datetime
datetime.date(2015, 8, 9).isocalendar()[1]
You could just do this:
from datetime import date
def week_number(start_week_on, date_=None):
assert 0 <= start_week_on <= 6
if not date_:
date_ = date.today()
__, normal_current_week, normal_current_day = date_.isocalendar()
if normal_current_day >= start_week_on:
week = normal_current_week
else:
week = normal_current_week - 1
return week
print("Week starts We're in")
for start_week_on in range(7):
this_week = week_number(start_week_on)
print(" day {0} week {1}".format(start_week_on, this_week))
Output on day 4 (Thursday):
Week starts We're in
day 0 week 33
day 1 week 33
day 2 week 33
day 3 week 33
day 4 week 33
day 5 week 32
day 6 week 32