I have DataFrame in following format:
Date Open High Low Close
0 2015-06-19 20:00:00 1201.60 1202.84 1201.55 1202.13
1 2015-06-19 21:00:00 1202.13 1202.50 1200.84 1200.88
2 2015-06-19 22:00:00 1200.88 1201.55 1200.61 1201.06
3 2015-06-19 23:00:00 1201.06 1201.26 1200.02 1200.57
4 2015-06-22 01:00:00 1200.57 1201.48 1197.04 1198.94
5 2015-06-22 02:00:00 1198.94 1199.79 1198.49 1199.34
6 2015-06-22 03:00:00 1199.34 1200.05 1198.64 1199.74
7 2015-06-22 04:00:00 1199.74 1200.34 1199.14 1199.66
I am trying to split this DataFrame by dates and after that i am trying to split dates in eveery 4 hours. Here is how i select DataFrame by date:
i = 0
this_date = df["Date"][i:i+1].values[0].split(" ")[0]
today = df[df["Date"].apply(lambda x: x.split(" ")[0]) == this_date]
Now i need to split today dataframe in every 4 hours. The last size will be 3 in total as it ends at 23:00
How can i do this? Are there any easy way or do i need to map over DataFrame and do it manually?
Related
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I have a timeseries data for a full year for every minute.
timestamp day hour min somedata
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to group-by the data based on the day, i.e 2010-01-01 data should be one group, 2010-01-02 should be another upto 2010-12-31.
I used daily_groupby = dataframe.groupby(pd.to_datetime(dataframe.index.day, unit='D', origin=pd.Timestamp('2009-12-31'))). This creates the group based on the days so all jan, feb upto dec 01 day are in one group. But I want to also group by using month so that jan, feb .. does not get mixed up.
I am a beginner in pandas.
if timestamp is the index use DatetimeIndex.date
df.groupby(pd.to_datetime(df.index).date)
else Series.dt.date
df.groupby(pd.to_datetime(df['timestamp']).dt.date)
If you don't want group by year use:
time_index = pd.to_datetime(df.index)
df.groupby([time_index.month,time_index.day])
I have got two dataframes (i.e. df1 and df2).
df1 contains date and time columns. Time columns contains 30 minutes interval of time series:
df1:
date time
0 2015-04-01 00:00:00
1 2015-04-01 00:30:00
2 2015-04-01 01:00:00
3 2015-04-01 01:30:00
4 2015-04-01 02:00:00
df2 contains date, start-time, end-time, value:
df2
INCIDENT_DATE INTERRUPTION_TIME RESTORE_TIME WASTED_MINUTES
0 2015-04-01 00:32 01:15 1056.0
1 2015-04-01 01:20 02:30 3234.0
2 2015-04-01 01:22 03:30 3712.0
3 2015-04-01 01:30 03:15 3045.0
Now I want to copy the wasted_minutes column from df2 to df1 when date columns of both data frames are the same and Interruption_time of the column of df2 lies in the time column of df1. So the output should look like:
df1:
date time Wasted_columns
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
I tried merge command (on the basis of date column), but didn't produce the desired result, because I am not sure how to check whether time falls in 30 minutes intervals or not? Could anyone guide how to fix the issue?
Convert time to timedelta and assign back to df1. Convert INTERRUPTION_TIME to timedelta and floor it to 30-minute interval and assign to s. Groupby df2 by INCIDENT_DATE, s and call sum of WASTED_MINUTES. Finally, join the result of groupby back to df1
df1['time'] = pd.to_timedelta(df1['time'].astype(str)) #cast to str before calling `to_timedelta`
s = pd.to_timedelta(df2.INTERRUPTION_TIME+':00').dt.floor('30Min')
df_final = df1.join(df2.groupby(['INCIDENT_DATE', s]).WASTED_MINUTES.sum(),
on=['date', 'time'])
Out[631]:
date time WASTED_MINUTES
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
You can do this
df1['time']=pd.to_datetime(df1['time'])
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= x['time']) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< x['time']+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
df1['time']=df1['time'].dt.time
If you convert the 'time' column in the lambda function itself, then it is just one line of code as below
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= pd.to_datetime(x['time'])) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< pd.to_datetime(x['time'])+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
Output
date time Wasted_columns
0 2015-04-01 00:00:00 0.0
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 0.0
The idea:
+ Convert to datetime
+ Round to nearest 30 mins
+ Merge
from datetime import datetime, timedelta
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
# Convert
df1['dt'] = (df1['date'] + ' ' + df1['time']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M:%S'])
df2['dt'] = (df2['INCIDENT_DATE '] + ' ' + df2['INTERRUPTION_TIME']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M'])
# Round
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
df2['dt'] = df2['dt'].apply(ceil_dt, args=[timedelta(minutes=30)])
# Merge
final = df1.merge(df2.loc[:, ['dt', 'wasted_column'], on='dt', how='left'])
Also if multiple incidents happens in 30 mins timeframe, you would want to group by on df2 with rounded dt col first to sum up wasted then merge
I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1