How to use group-by on timestamps correctly? - python

I have a timeseries data for a full year for every minute.
timestamp day hour min rainfall_rate
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to combine the timestamps such that I can get the combined rainfall_rate for every month, i.e I want to use group-by to combine them based on the date and also the plot them with the axis as timestamp for further analysis.
How can I perform this using pandas?
I used -
daily_groups = rainfall_df.groupby(rainfall_df.index.date) then
daily_groups.get_group(pd.Timestamp(2010,1,1))['rainfall_rate'].sum() but of course I could not plot them because they are of different shape.

Use pd.Grouper with freq="M":
print (df.groupby(pd.Grouper(freq="M"))["rainfall_rate"].count())
#
timestamp
2010-01-31 5
2010-02-28 0
2010-03-31 0
2010-04-30 0
2010-05-31 0
2010-06-30 0
2010-07-31 0
2010-08-31 0
2010-09-30 0
2010-10-31 0
2010-11-30 0
2010-12-31 0

Related

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

How can I select values in one column base on a condition in another using python?

Is there a way to filter rows if the column2 has all zeroes 10 minutes ahead from the current value in columnn1. How can I do this while keeping datetime index?
2020-01-01 00:01:00 60 0
2020-01-01 00:02:00 70 0
2020-01-01 00:03:00 80 0
2020-01-01 00:04:00 70 0
2020-01-01 00:05:00 60 0
2020-01-01 00:06:00 60 0
2020-01-01 00:07:00 70 0
2020-01-01 00:08:00 80 0
2020-01-01 00:09:00 80 2
2020-01-01 00:10:00 80 0
2020-01-01 00:11:00 70 0
2020-01-01 00:12:00 70 0
2020-01-01 00:13:00 50 0
2020-01-01 00:14:00 50 0
2020-01-01 00:15:00 60 0
2020-01-01 00:16:00 60 0
2020-01-01 00:17:00 70 0
2020-01-01 00:18:00 70 0
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
2020-01-01 00:21:00 80 1
2020-01-01 00:22:00 90 2
Expected output
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
I figured it out. It's actually simple.
input['col3'] = input['col2'].rolling(10).sum()
output = input.loc[(input['col3'] == 0)]
Just a guess, because I do not know pandas, but assuming it is a bit like SQL or linq or linkable datasets in C# - what about linking/joining your table (A) with itself (B) for all 12 minutes, grouping by each row of A and then summing the column2 of B (if only positive values there) and filter (SQL having) by the ones whose sum is 0?
As result report A.column0, A.column1 and SUM(B.column2)?
Using pandas.DataFrame.query, pandas.DataFrame.query - documentation
df.query(f'column_1 == {0} and column_2 == {value} or column_3 == {another_value}')

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

How to use groupby on day and month in pandas?

I have a timeseries data for a full year for every minute.
timestamp day hour min somedata
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to group-by the data based on the day, i.e 2010-01-01 data should be one group, 2010-01-02 should be another upto 2010-12-31.
I used daily_groupby = dataframe.groupby(pd.to_datetime(dataframe.index.day, unit='D', origin=pd.Timestamp('2009-12-31'))). This creates the group based on the days so all jan, feb upto dec 01 day are in one group. But I want to also group by using month so that jan, feb .. does not get mixed up.
I am a beginner in pandas.
if timestamp is the index use DatetimeIndex.date
df.groupby(pd.to_datetime(df.index).date)
else Series.dt.date
df.groupby(pd.to_datetime(df['timestamp']).dt.date)
If you don't want group by year use:
time_index = pd.to_datetime(df.index)
df.groupby([time_index.month,time_index.day])

Sum set of values from pandas dataframe within certain time frame

I have a fairly complicated question. I need to select rows from a data frame within a certain set of start and end dates, and then sum those values and put them in a new dataframe.
So I start off with with data frame, df:
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
and define some start and end dates:
import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
Then what needs to happen is that I need to create a new data frame with weekly_sum which sums the value column of df which occur in between the the start_date and end_date.
So for example, the first row of the new data frame would return the sum of the values between 2-13-16 and 2-20-16. I imagine I'd use groupby.sum() or something similar.
It might look like this:
id start_date end_date weekly_sum
65 2016-02-13 2016-02-20 100
Any direction is greatly appreciated!
P.S. I know my use of random.choice is a little wonky so if you have a better way of generating random numbers, I'd love to see it!
You can use
def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
# Return first id and sum of values
return n['id'].values[0],n['value'].sum()
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
'time_stamp': dates,
'value': np.random.randint(2,60,size=(1000,))
})
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
id time_stamp value
0 770 2015-01-01 02:00:00 59
1 781 2015-01-02 02:00:00 32
2 761 2015-01-03 02:00:00 40
3 317 2015-01-04 02:00:00 16
4 538 2015-01-05 02:00:00 20
print(ndf.head(5))
end start id value
0 2016-02-20 2016-02-13 569 221
1 2016-03-01 2016-02-23 28 216
2 2016-03-24 2016-03-17 152 258
3 2016-03-31 2016-03-24 892 265
4 2016-04-02 2016-03-26 606 244
You can calculate a weekly summary with the following code. The code below is based on Monday.
import pandas as pd
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
df['day_of_week'] = df['time_stamp'].dt.weekday_name
df['start'] = np.where(df["day_of_week"]=="Monday", 1, 0)
df['week'] = df["start"].cumsum()
# It is based on Monday.
df.head(20)
# Out[109]:
# _id time_stamp value day_of_week start week
# 0 396 2015-01-01 02:00:00 59 Thursday 0 0
# 1 396 2015-01-02 02:00:00 59 Friday 0 0
# 2 396 2015-01-03 02:00:00 59 Saturday 0 0
# 3 396 2015-01-04 02:00:00 59 Sunday 0 0
# 4 396 2015-01-05 02:00:00 59 Monday 1 1
# 5 396 2015-01-06 02:00:00 59 Tuesday 0 1
# 6 396 2015-01-07 02:00:00 59 Wednesday 0 1
# 7 396 2015-01-08 02:00:00 59 Thursday 0 1
# 8 396 2015-01-09 02:00:00 59 Friday 0 1
# 9 396 2015-01-10 02:00:00 59 Saturday 0 1
# 10 396 2015-01-11 02:00:00 59 Sunday 0 1
# 11 396 2015-01-12 02:00:00 59 Monday 1 2
# 12 396 2015-01-13 02:00:00 59 Tuesday 0 2
# 13 396 2015-01-14 02:00:00 59 Wednesday 0 2
# 14 396 2015-01-15 02:00:00 59 Thursday 0 2
# 15 396 2015-01-16 02:00:00 59 Friday 0 2
# 16 396 2015-01-17 02:00:00 59 Saturday 0 2
# 17 396 2015-01-18 02:00:00 59 Sunday 0 2
# 18 396 2015-01-19 02:00:00 59 Monday 1 3
# 19 396 2015-01-20 02:00:00 59 Tuesday 0 3
aggfunc = {"time_stamp": [np.min, np.max], "value": [np.sum]}
df2 = df.groupby("week", as_index=False).agg(aggfunc)
df2.columns = ["week", "start_date", "end_date", "weekly_sum"]
df2.iloc[58:61]
# Out[110]:
# week start_date end_date weekly_sum
# 58 58 2016-02-08 02:00:00 2016-02-14 02:00:00 413
# 59 59 2016-02-15 02:00:00 2016-02-21 02:00:00 413
# 60 60 2016-02-22 02:00:00 2016-02-28 02:00:00 413

Categories