Check if Pandas column values are not in list - python

I have a dataframe consisting of counts within 10 minute time intervals, how would I set count = 0 if the time interval doesn't exist?
DF1
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
print(df)
COUNT City Day Time
0 441 PHOENIX Thursday 10:20:00
1 641 ATLANTA Monday 14:30:00
2 661 PHOENIX Saturday 03:50:00
3 570 MIAMI Tuesday 21:00:00
4 222 CHICAGO Friday 15:00:00
DF2 - My approach is to create all the 10 minute time slots in a day (6*24 = 144 entries) and then use "not in"
df2 = pd.DataFrame({'TIME_BIN': np.arange(0, 86401, 600), })
df2['TIME_BIN'] = pd.to_datetime(df2['TIME_BIN'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
TIME_BIN
0 00:00:00
1 00:10:00
2 00:20:00
3 00:30:00
4 00:40:00
5 00:50:00
6 01:00:00
7 01:10:00
8 01:20:00
How do I check if the timeslots in DF2 do not exist in DF1 for each city and day and if so, set count = 0? I basically just need to fill in all the missing time slots in DF1.
Attempt:
for each_city in df.City.unique():
for each_day in df.Day.unique():
df['Time'] = df.apply(lambda row: df2['TIME_BIN'] if row['Time'] not in (df2['TIME_BIN'].tolist()) else None)

I think need reindex by MultiIndex from_product:
np.random.seed(123)
df = pd.DataFrame({ 'City' : np.random.choice(['PHOENIX','ATLANTA','CHICAGO', 'MIAMI', 'DENVER'], 10000),
'Day': np.random.choice(['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 10000),
'Time': np.random.randint(1, 86400, size=10000),
'COUNT': np.random.randint(1, 700, size=10000)})
df['Time'] = pd.to_datetime(df['Time'], unit='s').dt.round('10min').dt.strftime('%H:%M:%S')
df = df.drop_duplicates(['City','Day','Time'])
#print(df)
times = (pd.to_datetime(pd.Series(np.arange(0, 86401, 600)), unit='s')
.dt.round('10min')
.dt.strftime('%H:%M:%S'))
mux = pd.MultiIndex.from_product([df['City'].unique(),
df['Day'].unique(),
times],names=['City','Day','Time'])
df = (df.set_index(['City','Day','Time'])
.reindex(mux, fill_value=0)
.reset_index())
print (df.head(20))
City Day Time COUNT
0 CHICAGO Wednesday 00:00:00 66
1 CHICAGO Wednesday 00:10:00 205
2 CHICAGO Wednesday 00:20:00 260
3 CHICAGO Wednesday 00:30:00 127
4 CHICAGO Wednesday 00:40:00 594
5 CHICAGO Wednesday 00:50:00 683
6 CHICAGO Wednesday 01:00:00 203
7 CHICAGO Wednesday 01:10:00 0
8 CHICAGO Wednesday 01:20:00 372
9 CHICAGO Wednesday 01:30:00 109
10 CHICAGO Wednesday 01:40:00 32
11 CHICAGO Wednesday 01:50:00 184
12 CHICAGO Wednesday 02:00:00 630
13 CHICAGO Wednesday 02:10:00 108
14 CHICAGO Wednesday 02:20:00 35
15 CHICAGO Wednesday 02:30:00 604
16 CHICAGO Wednesday 02:40:00 500
17 CHICAGO Wednesday 02:50:00 367
18 CHICAGO Wednesday 03:00:00 118
19 CHICAGO Wednesday 03:10:00 546

One way is to convert to categories and use groupby to calculate Cartesian product.
In fact, given your data is largely categorical, this is a good idea and would yield memory benefits for large number of Time-City-Day combinations.
for col in ['Time', 'City', 'Day']:
df[col] = df[col].astype('category')
bin_cats = sorted(set(pd.Series(pd.to_datetime(np.arange(0, 86401, 600), unit='s'))\
.dt.round('10min').dt.strftime('%H:%M:%S')))
df['Time'] = df['Time'].cat.set_categories(bin_cats, ordered=True)
res = df.groupby(['Time', 'City', 'Day'], as_index=False)['COUNT'].sum()
res['COUNT'] = res['COUNT'].fillna(0).astype(int)
# Time City Day COUNT
# 0 00:00:00 ATLANTA Friday 521
# 1 00:00:00 ATLANTA Monday 767
# 2 00:00:00 ATLANTA Saturday 474
# 3 00:00:00 ATLANTA Sunday 1126
# 4 00:00:00 ATLANTA Thursday 157
# 5 00:00:00 ATLANTA Tuesday 720
# 6 00:00:00 ATLANTA Wednesday 0
# 7 00:00:00 CHICAGO Friday 1114
# 8 00:00:00 CHICAGO Monday 813
# 9 00:00:00 CHICAGO Saturday 137
# 10 00:00:00 CHICAGO Sunday 134
# 11 00:00:00 CHICAGO Thursday 0
# 12 00:00:00 CHICAGO Tuesday 168
# ..........

Then you can try following
df.groupby(['City','Day']).apply(lambda x : x.set_index('Time').reindex(df2.TIME_BIN.unique()).fillna({'COUNT':0}).ffill())

Related

Pandas Time Series: Count weekdays with min value in annual data

this is my first question on Stackoverflow and I hope I describe my problem detailed enough.
I'm starting to learn data analysis with Pandas and I've created a time series with daily data for gas prices of a certain station. I've already grouped the hourly data into daily data.
I've been successfull with a simple scatter plot over the year with plotly but in the next step I would like to analyze which weekday is the cheapest or most expensive in every week, count the daynames and then look if there is a pattern over the whole year.
count mean std min 25% 50% 75% max \
2022-01-01 35.0 1.685000 0.029124 1.649 1.659 1.689 1.6990 1.749
2022-01-02 27.0 1.673444 0.024547 1.649 1.649 1.669 1.6890 1.729
2022-01-03 28.0 1.664000 0.040597 1.599 1.639 1.654 1.6890 1.789
2022-01-04 31.0 1.635129 0.045069 1.599 1.599 1.619 1.6490 1.779
2022-01-05 33.0 1.658697 0.048637 1.599 1.619 1.649 1.6990 1.769
2022-01-06 35.0 1.658429 0.050756 1.599 1.619 1.639 1.6940 1.779
2022-01-07 30.0 1.637333 0.039136 1.599 1.609 1.629 1.6565 1.759
2022-01-08 41.0 1.655829 0.041740 1.619 1.619 1.639 1.6790 1.769
2022-01-09 35.0 1.647857 0.031602 1.619 1.619 1.639 1.6590 1.769
2022-01-10 31.0 1.634806 0.041374 1.599 1.609 1.619 1.6490 1.769
...
week weekday
2022-01-01 52 Saturday
2022-01-02 52 Sunday
2022-01-03 1 Monday
2022-01-04 1 Tuesday
2022-01-05 1 Wednesday
2022-01-06 1 Thursday
2022-01-07 1 Friday
2022-01-08 1 Saturday
2022-01-09 1 Sunday
2022-01-10 2 Monday
...
I tried with grouping and resampling but unfortunately I didn't get the result I was hoping for.
Can someone suggest a way how to deal with this problem? Thanks!
Here's a way to do what I believe your question asks:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'count':[35,27,28,31,33,35,30,41,35,31]*40,
'mean':
[1.685,1.673444,1.664,1.635129,1.658697,1.658429,1.637333,1.655829,1.647857,1.634806]*40
},
index=pd.Series(pd.to_datetime(pd.date_range("2022-01-01", periods=400, freq="D"))))
print( '','input df:',df,sep='\n' )
df_date = df.reset_index()['index']
df['weekday'] = list(df_date.dt.day_name())
df['year'] = df_date.dt.year.to_numpy()
df['week'] = df_date.dt.isocalendar().week.to_numpy()
df['year_week_started'] = df.year - np.where((df.week>=52)&(df.week.shift(-7)==1),1,0)
print( '','input df with intermediate columns:',df,sep='\n' )
cols = ['year_week_started', 'week']
dfCheap = df.loc[df.groupby(cols)['mean'].idxmin(),:].set_index(cols)
dfCheap = ( dfCheap.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfCheap:',dfCheap,sep='\n' )
dfExpensive = df.loc[df.groupby(cols)['mean'].idxmax(),:].set_index(cols)
dfExpensive = ( dfExpensive.groupby(['year_week_started', 'weekday'])['mean'].count()
.rename('freq').to_frame().set_index('freq', append=True)
.reset_index(level='weekday').sort_index(ascending=[True,False]) )
print( '','dfExpensive:',dfExpensive,sep='\n' )
Sample input:
input df:
count mean
2022-01-01 35 1.685000
2022-01-02 27 1.673444
2022-01-03 28 1.664000
2022-01-04 31 1.635129
2022-01-05 33 1.658697
... ... ...
2023-01-31 35 1.658429
2023-02-01 30 1.637333
2023-02-02 41 1.655829
2023-02-03 35 1.647857
2023-02-04 31 1.634806
[400 rows x 2 columns]
input df with intermediate columns:
count mean weekday year week year_week_started
2022-01-01 35 1.685000 Saturday 2022 52 2021
2022-01-02 27 1.673444 Sunday 2022 52 2021
2022-01-03 28 1.664000 Monday 2022 1 2022
2022-01-04 31 1.635129 Tuesday 2022 1 2022
2022-01-05 33 1.658697 Wednesday 2022 1 2022
... ... ... ... ... ... ...
2023-01-31 35 1.658429 Tuesday 2023 5 2023
2023-02-01 30 1.637333 Wednesday 2023 5 2023
2023-02-02 41 1.655829 Thursday 2023 5 2023
2023-02-03 35 1.647857 Friday 2023 5 2023
2023-02-04 31 1.634806 Saturday 2023 5 2023
[400 rows x 6 columns]
Sample output:
dfCheap:
weekday
year_week_started freq
2021 1 Monday
2022 11 Tuesday
10 Thursday
10 Wednesday
6 Sunday
5 Friday
5 Monday
5 Saturday
2023 2 Thursday
1 Saturday
1 Sunday
1 Wednesday
dfExpensive:
weekday
year_week_started freq
2021 1 Saturday
2022 16 Monday
10 Tuesday
6 Sunday
5 Friday
5 Saturday
5 Thursday
5 Wednesday
2023 2 Monday
1 Friday
1 Thursday
1 Tuesday

How to iterate two data frames in python?

Dataset1:
Date Weekday OpenPrice ClosePrice
_______________________________________________
28/07/2022 Thursday 5678 5674
04/08/2022 Thursday 5274 5674
11/08/2022. Thursday 7650 7652
Dataset2:
Date Weekday Open Price Close Price
______________________________________________
29/07/2022 Friday 4371 4387
05/08/2022 Friday 6785 6790
12/08/2022 Friday 4367 6756
I would like to iterate these two datasets and create a new dataset with shows data as below. This is the difference between Open Price of Week1 (Week n-1) on Friday and Close price of Week2 (Week n) on Thursday.
Week Difference
______________________________
Week2 543 (i.e 5674 - 4371)
Week3 867 (i.e 7652 - 6785)
Here is the real file:
https://github.com/ravindraprasad75/HRBot/blob/master/DatasetforSOF.xlsx
Don't iterate over dataframes. Merge them instead.
Reconstruction of your data (cf. How to make good reproducible pandas examples on how to share dataframes)
from io import StringIO
from datetime import datetime
cols = ['Date', 'Weekday', 'OpenPrice', 'ClosePrice']
data1 = """28/07/2022 Thursday 5674 5678
04/08/2022 Thursday 5274 5278
11/08/2022. Thursday 7652 7687"""
data2 = """29/07/2022 Friday 4371 4387
05/08/2022 Friday 6785 6790
12/08/2022 Friday 4367 6756"""
df1, df2 = (pd.read_csv(StringIO(d),
header = None,
sep="\s+",
names=cols,
parse_dates=["Date"],
dayfirst=True) for d in (data1, data2))
Add Week column
df1['Week'] = df1.Date.dt.isocalendar().week
df2['Week'] = df2.Date.dt.isocalendar().week
Resulting dataframes:
>>> df1
Date Weekday OpenPrice ClosePrice Week
0 2022-07-28 Thursday 5674 5678 30
1 2022-08-04 Thursday 5274 5278 31
2 2022-08-11 Thursday 7652 7687 32
>>> df2
Date Weekday OpenPrice ClosePrice Week
0 2022-07-29 Friday 4371 4387 30
1 2022-08-05 Friday 6785 6790 31
2 2022-08-12 Friday 4367 6756 32
Merge on Week
df3 = df1.merge(df2, on="Week", suffixes=("_Thursday", "_Friday"))
Result:
>>> df3
Date_Thursday Weekday_Thursday OpenPrice_Thursday ClosePrice_Thursday \
0 2022-07-28 Thursday 5674 5678
1 2022-08-04 Thursday 5274 5278
2 2022-08-11 Thursday 7652 7687
Week Date_Friday Weekday_Friday OpenPrice_Friday ClosePrice_Friday
0 30 2022-07-29 Friday 4371 4387
1 31 2022-08-05 Friday 6785 6790
2 32 2022-08-12 Friday 4367 6756
Now you can simply do df3.OpenPrice_Friday - df3.ClosePrice_Thursday, using shift where you need to compare different weeks.

Sum of individual labels over a month of granular data

I have a dataframe which contains life logging data gathered over several years from 44 unique individuals.
Int64Index: 77171 entries, 0 to 4279
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 start 77171 non-null datetime64[ns]
1 end 77171 non-null datetime64[ns]
2 labelName 77171 non-null category
3 id 77171 non-null int64
The start column contains granular datetimes of the format 2020-11-01 11:00:00, in intervals of 30 minutes. The labelName column has 14 different categories.
Categories (14, object): ['COOK', 'EAT', 'GO WALK', 'GO TO BATHROOM', ..., 'DRINK', 'WAKE UP', 'SLEEP', 'WATCH TV']
Here's a sample user's head, which is [2588 rows x 4 columns] and spans from 2020 to 2021. There are also gaps in the data, occasionally.
start end labelName id
0 2020-08-05 00:00:00 2020-08-05 00:30:00 GO TO BATHROOM 486
1 2020-08-05 06:00:00 2020-08-05 06:30:00 WAKE UP 486
2 2020-08-05 09:00:00 2020-08-05 09:30:00 COOK 486
3 2020-08-05 11:00:00 2020-08-05 11:30:00 EAT 486
4 2020-08-05 12:00:00 2020-08-05 12:30:00 EAT 486
.. ... ... ... ...
859 2021-03-10 12:30:00 2021-03-10 13:00:00 GO TO BATHROOM 486
861 2021-03-10 13:30:00 2021-03-10 14:00:00 GO TO BATHROOM 486
862 2021-03-10 18:30:00 2021-03-10 19:00:00 COOK 486
864 2021-03-11 08:00:00 2021-03-11 08:30:00 EAT 486
865 2021-03-11 12:30:00 2021-03-11 13:00:00 COOK 486
I want a sum of each unique labelNames per user per month, but I'm not sure how to do this.
I would first split the data frame by id, which is easy. But how do you split these start datetimes when it records every 30 minutes over several years of data— and then create 14 new columns which record the sums?
The final data frame might look something like this (with fake values):
user
month
SLEEP
...
WATCH TV
486
jun20
324
...
23
486
jul20
234
...
12
The use-case for this data frame is training a few statistical and machine-learning models.
How do I achieve something like this?
Because there are 30 minutes data you can count them by crosstab per months by months periods by Series.dt.to_period and then multiple by 0.5 for output in hours:
If start is 2020-09-30 23:30:00 and end is 2020-10-01 00:00:00 then if need count this record for October use df['end'] in crosstab, if for September use df['start'] .
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df1 = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName']).mul(0.5)
.rename_axis(columns=None, index=['id','month'])
.rename(columns=str)
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df1)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0.0 0.0 1.0 0.5 1.0
1 650 Mar2021 0.5 1.0 0.5 0.5 0.0
For ouput in 30 minutes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName'])
.rename_axis(columns=None, index=['id','month'])
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0 0 2 1 2
1 650 Mar2021 1 2 1 1 0
Use:
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)
Output:
Demonstration:
#Preparing Data
string = """start end labelName id
2020-09-21 14:30:00 2020-09-21 15:00:00 WAKE UP 650
2020-09-21 15:00:00 2020-09-21 15:30:00 GO TO BATHROOM 650
2020-09-21 15:30:00 2020-09-21 16:00:00 SLEEP 650
2020-09-29 17:00:00 2020-09-29 17:30:00 WAKE UP 650
2020-09-29 17:30:00 2020-09-29 18:00:00 GO TO BATHROOM 650
2021-03-11 13:00:00 2021-03-11 13:30:00 EAT 650
2021-03-11 14:30:00 2021-03-11 15:00:00 GO TO BATHROOM 650
2021-03-11 15:00:00 2021-03-11 15:30:00 COOK 650
2021-03-11 15:30:00 2021-03-11 16:00:00 EAT 650
2021-03-11 16:00:00 2021-03-11 16:30:00 SLEEP 650"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
df['start'] = pd.to_datetime(df['start'])
#Solution
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)

Get delta values of each quarter from cumulated income statement reports with pandas

Get value of each quarter from cumulated income statement reports with pandas
Is there any way to do it with python/pandas?
I have an example dataset like below.
(please suppose that this company's fiscal year is from Jan to Dec)
qend revenue profit
2015-03-31 2,453 298
2015-06-30 5,076 520
2015-09-30 8,486 668
2015-12-31 16,724 820
2016-03-31 1,880 413
2016-06-30 3,989 568
2016-09-30 7,895 621
2016-12-31 16,621 816
I want to know how much revenue and profit that this company earns per each quarter.
But the report is only showing the number in cumulative.
In this case, Q1 is fine. But from Q2-Q4, I have to get the difference from each last quarter.
This is my expecting results.
qend revenue profit mycommment
2015-03-31 2,453 298 copy from Q1
2015-06-30 2,623 222 delta of Q1 and Q2
2015-09-30 3,410 148 delta of Q2 and Q3
2015-12-31 8,238 152 delta of Q3 and Q4
2016-03-31 1,880 413 copy from Q1
2016-06-30 2,109 155 delta of Q1 and Q2
2016-09-30 3,906 53 delta of Q2 and Q3
2016-12-31 8,726 195 delta of Q3 and Q4
The difficulty is it is not simply getting delta from last row, because each Q1 needs no delta value while rest of Q2-4 needs delta value.
If there is no easy way in pandas, I'll code it with python.
I think you need quarter for find first and then add value of diff by condition:
m = df['qend'].dt.quarter == 1
df['diff_profit'] = np.where(m, df['profit'], df['profit'].diff())
#same as
#df['diff_profit'] = df['profit'].where(m, df['profit'].diff())
print (df)
qend revenue profit diff_profit
0 2015-03-31 2,453 298 298.0
1 2015-06-30 5,076 520 222.0
2 2015-09-30 8,486 668 148.0
3 2015-12-31 16,724 820 152.0
4 2016-03-31 1,880 413 413.0
5 2016-06-30 3,989 568 155.0
6 2016-09-30 7,895 621 53.0
7 2016-12-31 16,621 816 195.0
Or:
df['diff_profit'] = np.where(m, df['profit'], df['profit'].shift() - df['profit'])
print (df)
qend revenue profit diff_profit
0 2015-03-31 2,453 298 298.0
1 2015-06-30 5,076 520 -222.0
2 2015-09-30 8,486 668 -148.0
3 2015-12-31 16,724 820 -152.0
4 2016-03-31 1,880 413 413.0
5 2016-06-30 3,989 568 -155.0
6 2016-09-30 7,895 621 -53.0
7 2016-12-31 16,621 816 -195.0
Detail:
print (df['qend'].dt.quarter)
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 4
Name: qend, dtype: int64

Sum set of values from pandas dataframe within certain time frame

I have a fairly complicated question. I need to select rows from a data frame within a certain set of start and end dates, and then sum those values and put them in a new dataframe.
So I start off with with data frame, df:
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
and define some start and end dates:
import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
Then what needs to happen is that I need to create a new data frame with weekly_sum which sums the value column of df which occur in between the the start_date and end_date.
So for example, the first row of the new data frame would return the sum of the values between 2-13-16 and 2-20-16. I imagine I'd use groupby.sum() or something similar.
It might look like this:
id start_date end_date weekly_sum
65 2016-02-13 2016-02-20 100
Any direction is greatly appreciated!
P.S. I know my use of random.choice is a little wonky so if you have a better way of generating random numbers, I'd love to see it!
You can use
def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
# Return first id and sum of values
return n['id'].values[0],n['value'].sum()
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
'time_stamp': dates,
'value': np.random.randint(2,60,size=(1000,))
})
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
id time_stamp value
0 770 2015-01-01 02:00:00 59
1 781 2015-01-02 02:00:00 32
2 761 2015-01-03 02:00:00 40
3 317 2015-01-04 02:00:00 16
4 538 2015-01-05 02:00:00 20
print(ndf.head(5))
end start id value
0 2016-02-20 2016-02-13 569 221
1 2016-03-01 2016-02-23 28 216
2 2016-03-24 2016-03-17 152 258
3 2016-03-31 2016-03-24 892 265
4 2016-04-02 2016-03-26 606 244
You can calculate a weekly summary with the following code. The code below is based on Monday.
import pandas as pd
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
df['day_of_week'] = df['time_stamp'].dt.weekday_name
df['start'] = np.where(df["day_of_week"]=="Monday", 1, 0)
df['week'] = df["start"].cumsum()
# It is based on Monday.
df.head(20)
# Out[109]:
# _id time_stamp value day_of_week start week
# 0 396 2015-01-01 02:00:00 59 Thursday 0 0
# 1 396 2015-01-02 02:00:00 59 Friday 0 0
# 2 396 2015-01-03 02:00:00 59 Saturday 0 0
# 3 396 2015-01-04 02:00:00 59 Sunday 0 0
# 4 396 2015-01-05 02:00:00 59 Monday 1 1
# 5 396 2015-01-06 02:00:00 59 Tuesday 0 1
# 6 396 2015-01-07 02:00:00 59 Wednesday 0 1
# 7 396 2015-01-08 02:00:00 59 Thursday 0 1
# 8 396 2015-01-09 02:00:00 59 Friday 0 1
# 9 396 2015-01-10 02:00:00 59 Saturday 0 1
# 10 396 2015-01-11 02:00:00 59 Sunday 0 1
# 11 396 2015-01-12 02:00:00 59 Monday 1 2
# 12 396 2015-01-13 02:00:00 59 Tuesday 0 2
# 13 396 2015-01-14 02:00:00 59 Wednesday 0 2
# 14 396 2015-01-15 02:00:00 59 Thursday 0 2
# 15 396 2015-01-16 02:00:00 59 Friday 0 2
# 16 396 2015-01-17 02:00:00 59 Saturday 0 2
# 17 396 2015-01-18 02:00:00 59 Sunday 0 2
# 18 396 2015-01-19 02:00:00 59 Monday 1 3
# 19 396 2015-01-20 02:00:00 59 Tuesday 0 3
aggfunc = {"time_stamp": [np.min, np.max], "value": [np.sum]}
df2 = df.groupby("week", as_index=False).agg(aggfunc)
df2.columns = ["week", "start_date", "end_date", "weekly_sum"]
df2.iloc[58:61]
# Out[110]:
# week start_date end_date weekly_sum
# 58 58 2016-02-08 02:00:00 2016-02-14 02:00:00 413
# 59 59 2016-02-15 02:00:00 2016-02-21 02:00:00 413
# 60 60 2016-02-22 02:00:00 2016-02-28 02:00:00 413

Categories