I have the following dataframe:
X
Datetime
2017-01-01 01:00:00 3129.3460
2017-01-01 02:00:00 5433.4315
2017-01-01 03:00:00 2351.8391
2017-01-01 04:00:00 6788.3210
2017-01-01 05:00:00 1232.8655
...
2022-08-14 20:00:00 8905.5340
2022-08-14 21:00:00 8623.0765
2022-08-14 22:00:00 9054.8312
2022-08-14 23:00:00 10341.4785
2022-08-15 00:00:00 9341.1234
How can i remove the whole day of data, if the first hour of that day is different from zero? In this case, i need to remove the whole 2017-01-01 day.
So i thought of using an if condition with df.drop()
first_data = df.index.min()
if first_data.day != 0:
df = df.drop(first_data)
But i am not sure what i should pass as argument to df.drop. In the above code, it will only drop the first hour of that first day, since first_data gives me the whole timestamp from years until seconds.
Let's try groupby and filter
out = (df.groupby(df.index.date)
.filter(lambda g: str(g.index.min().time()) == '00:00:00'))
print(out)
X
2022-08-15 00:00:00 9341.1234
2022-08-15 01:00:00 9341.1234
Related
I have large df with datettime index with hourly time step and precipitation values in several columns. My precipitation valuesare a cumulative total during the day (from 1:00 am to 0:00 am of the next day) and are reset after every day, example:
datetime S1
2000-01-01 00:00:00 4.5 ...
2000-01-01 01:00:00 0 ...
2000-01-01 02:00:00 0 ...
2000-01-01 03:00:00 0 ...
2000-01-01 04:00:00 0
2000-01-01 05:00:00 0
2000-01-01 06:00:00 0
2000-01-01 07:00:00 0
2000-01-01 08:00:00 0
2000-01-01 09:00:00 0
2000-01-01 10:00:00 0
2000-01-01 11:00:00 6.5
2000-01-01 12:00:00 7.5
2000-01-01 13:00:00 8.7
2000-01-01 14:00:00 8.7
...
2000-01-01 22:00:00 8.7
2000-01-01 23:00:00 8.7
2000-01-02 00:00:00 8.7
2000-01-02 01:00:00 0
I am trying to go from this to the actual hourly values, so the value for 1:00 am for every day is fine and then I want to substract the value from the timestep before.
Can I somehow use if statement inside of df.apply?
I thought of smth like:
df_copy = df.copy()
df = df.apply(lambda x: if df.hour !=1: era5_T[x]=era5_T[x]-era5_T_copy[x-1])
But this is not working since I'm not calling a function? I could work with a for loop but that doesn't seem like the most efficient way as I'm working with a big dataset.
You can use numpy.where and pd.Series.shift to acheive the result
import numpy as np
df['hourly_S1'] = np.where(df.hour ==1, df.S1, df.S1-df.S1.shift())
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I have a dataframe with hourly values for several years. My dataframe is already in datetime format, and the column containing the values is called say "value column".
date = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24 , 31.71 , 34.39 , 34.49 ]
df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)
print(df.head())
value column
index
2015-02-03 23:00:00 33.24
2015-02-03 23:30:00 31.71
2015-02-04 00:00:00 34.39
2015-02-04 00:30:00 34.49
I know how to get the mean of the "value column" for each year efficiently with for instance the following command:
df = df.groupby(df.index.year).mean()
Now, I would like to divide all hourly values of the column "value column" by the mean of its values for its corresponding year (for instance dividing all the 2015 hourly values by the mean of 2015 values, and same for the other years).
Is there an efficient way to do that in pandas?
Expected result:
value column Value column/mean of year
index
2015-02-03 23:00:00 33.24 0.993499
2015-02-03 23:30:00 31.71 0.94777
2015-02-04 00:00:00 34.39 1.027871
2015-02-04 00:30:00 34.49 1.03086
Many thanks,
Try the following:
df.groupby(df.index.year).transform(lambda x: x/x.mean())
Refer: Group By: split-apply-combine
Transformation is recommended as it is meant to perform some group-specific computations and return a like-indexed object.
I just found another way, which im not sure to understand but works!
df['result'] = df['value column'].groupby(df.index.year).apply(lambda x: x/x.mean())
I thought that in apply functions, x was refering to single values of the array but it seems that it refers to the group itself.
You should be able to do:
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
So you set the index to be the year in order to divide by the groupby object, and then reset the index to keep the original timestamps.
Full example:
import pandas as pd
import numpy as np
np.random.seed(1)
dr = pd.date_range('1-1-2010','1-1-2020', freq='H')
df = pd.DataFrame({'value column':np.random.rand(len(dr))}, index=dr)
print(df, '\n')
print(df.groupby(df.index.year).mean(), '\n')
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
print(df)
Output:
#original data
value column
2010-01-01 00:00:00 0.417022
2010-01-01 01:00:00 0.720324
2010-01-01 02:00:00 0.000114
2010-01-01 03:00:00 0.302333
2010-01-01 04:00:00 0.146756
...
2019-12-31 20:00:00 0.530828
2019-12-31 21:00:00 0.224505
2019-12-31 22:00:00 0.459977
2019-12-31 23:00:00 0.931504
2020-01-01 00:00:00 0.581869
[87649 rows x 1 columns]
#grouped by year
value column
2010 0.497135
2011 0.503547
2012 0.501023
2013 0.497848
2014 0.497065
2015 0.501417
2016 0.498303
2017 0.499266
2018 0.499533
2019 0.492220
2020 0.581869
#final output
value column
2010-01-01 00:00:00 0.838851
2010-01-01 01:00:00 1.448952
2010-01-01 02:00:00 0.000230
2010-01-01 03:00:00 0.608150
2010-01-01 04:00:00 0.295203
...
2019-12-31 20:00:00 1.078436
2019-12-31 21:00:00 0.456107
2019-12-31 22:00:00 0.934494
2019-12-31 23:00:00 1.892455
2020-01-01 00:00:00 1.000000
[87649 rows x 1 columns]
I have a dataframe that looks like this:
I'm using python 3.6.5 and a datetime.time object for the index
print(sum_by_time)
Trips
Time
00:00:00 10
01:00:00 10
02:00:00 10
03:00:00 10
04:00:00 20
05:00:00 20
06:00:00 20
07:00:00 20
08:00:00 30
09:00:00 30
10:00:00 30
11:00:00 30
How can I group this dataframe by time interval to get something like this:
Trips
Time
00:00:00 - 03:00:00 40
04:00:00 - 07:00:00 80
08:00:00 - 11:00:00 120
I think need convert index values to timedeltas by to_timedelta and then resample:
df.index = pd.to_timedelta(df.index.astype(str))
df = df.resample('4H').sum()
print (df)
Trips
00:00:00 40
04:00:00 80
08:00:00 120
EDIT:
For your format need:
df['d'] = pd.to_datetime(df.index.astype(str))
df = df.groupby(pd.Grouper(freq='4H', key='d')).agg({'Trips':'sum', 'd':['first','last']})
df.columns = df.columns.map('_'.join)
df = df.set_index(df['d_first'].dt.strftime('%H:%M:%S') + ' - ' + df['d_last'].dt.strftime('%H:%M:%S'))[['Trips_sum']]
print (df)
Trips_sum
00:00:00 - 03:00:00 40
04:00:00 - 07:00:00 80
08:00:00 - 11:00:00 120
I have a dataframe, the index is timestamp format with 'YYYY-MM-DD HH:MM:SS'
Now i want to divide this data frame into two parts.
one is the data with time before 12pm('YYYY-MM-DD 12:00:00') everyday
another is the data with time after 12pm for everyday.
I'm just stuck with this question for several days. Any suggestions?
Thank you.
If you have a DatetimeIndex (and if you don't, df.index = pd.to_datetime(df.index) should work to get one), then you can access .hour, e.g. df.index.hour, and select using that:
>>> df.head()
A
2015-01-01 00:00:00 0
2015-01-01 01:00:00 1
2015-01-01 02:00:00 2
2015-01-01 03:00:00 3
2015-01-01 04:00:00 4
>>> morning = df[df.index.hour < 12]
>>> afternoon = df[df.index.hour >= 12]
>>> morning.head()
A
2015-01-01 00:00:00 0
2015-01-01 01:00:00 1
2015-01-01 02:00:00 2
2015-01-01 03:00:00 3
2015-01-01 04:00:00 4
>>> afternoon.head()
A
2015-01-01 12:00:00 12
2015-01-01 13:00:00 13
2015-01-01 14:00:00 14
2015-01-01 15:00:00 15
2015-01-01 16:00:00 16
You could also use groupby, e.g. df.groupby(df.index.hour < 12), but that seems like overkill here. If you wanted a more complex division that might be the way to go, though.