Before everyone down votes, this is a tricky question to phrase in a single title. For a given timestamp, I want to round it to the previous 15 min when it's more than 10 mins away (i.e 11-15 mins). If it's less than 10 mins away I want to round that to the previous, previous 15 min.
This may be easier to display:
1st timestamp = 08:12:00. More than 10 mins so round to nearest 15 min = 08:00:00
2nd timestamp = 08:07:00. Less than 10 mins so round to the previous, previous 15 min = 7:45:00
I can round values greater than 10 mins easily enough. The ones less than 10 mins is where I'm struggling. I attempted to convert the timestamps to total seconds to determine if it's less than 600 seconds (10 mins). If less than 600 seconds I would take another 15 mins off. If more than 600 seconds I would leave as is. Below is my attempt.
import pandas as pd
from datetime import datetime, timedelta
d = ({
'Time' : ['8:10:00'],
})
df = pd.DataFrame(data=d)
df['Time'] = pd.to_datetime(df['Time'])
def hour_rounder(t):
return t.replace(second=0, microsecond=0, minute=(t.minute // 15 * 15), hour=t.hour)
FirstTime = df['Time'].iloc[0]
StartTime = hour_rounder(FirstTime)
#Strip date
FirstTime = datetime.time(FirstTime)
StartTime = datetime.time(StartTime)
#Convert timestamps to total seconds
def get_sec(time_str):
h, m, s = time_str.split(':')
return int(h) * 3600 + int(m) * 60 + int(s)
FirstTime = str(FirstTime)
FirstTime_secs = get_sec(FirstTime)
StartTime = str(StartTime)
StartTime_secs = get_sec(StartTime)
#Determine difference
diff = FirstTime_secs - StartTime_secs
If possible working with timedeltas first use to_timedelta, then Series.dt.floor and if modulo 15 is less or equal 10 remove 15 minutes:
d = {'Time': ['08:00:00', '08:01:00', '08:02:00', '08:03:00', '08:04:00',
'08:05:00', '08:06:00', '08:07:00', '08:08:00', '08:09:00',
'08:10:00', '08:11:00', '08:12:00', '08:13:00', '08:14:00',
'08:15:00', '08:16:00', '08:17:00', '08:18:00', '08:19:00',
'08:20:00', '08:21:00', '08:22:00', '08:23:00', '08:24:00',
'08:25:00', '08:26:00', '08:27:00', '08:28:00', '08:29:00',
'08:30:00', '08:31:00', '08:32:00', '08:33:00', '08:34:00',
'08:35:00', '08:36:00', '08:37:00', '08:38:00', '08:39:00']}
df = pd.DataFrame(d)
df['Time'] = pd.to_timedelta(df['Time'])
s = df['Time'].dt.floor(freq='15T')
#https://stackoverflow.com/a/14190143 for convert timedeltas to minutes
df['new'] = np.where(((df['Time'].dt.total_seconds() % 3600) // 60) % 15 <= 10,
s - pd.Timedelta(15 * 60, 's'), s)
print (df)
Time new
0 08:00:00 07:45:00
1 08:01:00 07:45:00
...
9 08:09:00 07:45:00
10 08:10:00 07:45:00
11 08:11:00 08:00:00
12 08:12:00 08:00:00
...
24 08:24:00 08:00:00
25 08:25:00 08:00:00
26 08:26:00 08:15:00
27 08:27:00 08:15:00
...
38 08:38:00 08:15:00
39 08:39:00 08:15:00
If need working with datetimes solution is similar with Series.dt.minute:
df = pd.DataFrame({'Time':pd.date_range('2015-01-01 08:00:00', freq='T', periods=40)})
s = df['Time'].dt.floor(freq='15T')
df['new'] = np.where(df['Time'].dt.minute % 15 <= 10, s - pd.Timedelta(15*60, 's'), s)
print (df)
Time new
0 2015-01-01 08:00:00 2015-01-01 07:45:00
1 2015-01-01 08:01:00 2015-01-01 07:45:00
...
9 2015-01-01 08:09:00 2015-01-01 07:45:00
10 2015-01-01 08:10:00 2015-01-01 07:45:00
11 2015-01-01 08:11:00 2015-01-01 08:00:00
12 2015-01-01 08:12:00 2015-01-01 08:00:00
13 2015-01-01 08:13:00 2015-01-01 08:00:00
...
24 2015-01-01 08:24:00 2015-01-01 08:00:00
25 2015-01-01 08:25:00 2015-01-01 08:00:00
26 2015-01-01 08:26:00 2015-01-01 08:15:00
27 2015-01-01 08:27:00 2015-01-01 08:15:00
...
38 2015-01-01 08:38:00 2015-01-01 08:15:00
39 2015-01-01 08:39:00 2015-01-01 08:15:00
Alternative solution from comment:
df['new1'] = df['Time'].sub(pd.Timedelta(11*60, 's')).dt.floor(freq='15T')
Related
Given a DataFrame having timestamp (ts), I'd like to these by the hour (downsample). Values that were previously indexed by ts should now be divided into ratios based on the number of minutes left in an hour. [note: divide data in ratios for NaN columns while doing resampling]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
I want to estimate a produced based on the ratio of time spend and produced. This can be compared to how many lines of code have been completed or find how many actual lines per hour?
for example: at "2020-09-10 00:10:00" we have 22. Then during the period from 21:01 - 00:10, we produced based on
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
the result should be something like.
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
Problem with this approach:
It appears to me that, we are having a serious issue with this approach. If you look at the rows[1] -> 2020-09-10 07:00:00, we have 4, we need to divide it between 3 hours. Considering base duration value as 1 (base unit), we however get
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
Source
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
Expected Output
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0
The first step is to add "previous ts" column to the source DataFrame:
df['tsPrev'] = df.ts.shift()
Then set ts column as the index:
df.set_index('ts', inplace=True)
The third step is to create an auxiliary index, composed of the original
index and "full hours":
ind = df.event.resample('H').asfreq().index.union(df.index)
Then create an auxiliary DataFrame, reindexed with the just created index
and "back fill" event column:
df2 = df.reindex(ind)
df2.event = df2.event.bfill()
Define a function to be applied to each group of rows from df2:
def parts(grp):
lstRow = grp.iloc[-1] # Last row from group
if pd.isna(lstRow.tsPrev): # First group
return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
# Other groups
return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)\
.interpolate(method='index').round().diff(-1)[:-1].astype(int)
Then generate the source data for "produced" column in 2 steps:
Generate detailed data:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1],
sort=False).apply(parts).reset_index(level=0, drop=True)
The source is df2 grouped this way that each group is terminated
with a row with a non-null value in duration column. Then each group
is processed with parts function.
The result is:
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
Generate aggregated data, for the time being also as a Series:
prod = prodDet.resample('H').sum().rename('produced')
This time prodDet is resampled (broken down by hours) and the
result is the sum of values.
The result is:
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
Let's describe the content of prodDet:
There is no row for 2020-09-09 20:00:00, because no source row is
from this hour (your data start from 21:01:00).
Row 21:00:00 12 comes from the first source row (you forgot about
it writing the expected result).
Rows for 21:01:00, 22:00:00, 23:00:00 and 00:00:00 come from
"partitioning" of row 00:10:00 a 22, just as a part of your
expected result.
Rows with 80 and 50 come from row containing 130, divided
between rows with hours 00:01:00 and 01:00:00.
And so on.
Now we start to assemble the final result.
Join prod (converted to a DataFrame) with event column:
result = prod.to_frame().join(df2.event)
Add tsMin column - the minimal ts in each hour (as you asked
in one of comments):
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
Change the index into a regular column and set its name to ts
(like in the source DataFrame):
result = result.reset_index().rename(columns={'index': 'ts'})
The final result is:
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
E.g. the value of 81 for 00:00:00 is a sum of 1 and 80 (the first
part resulting from row with 130), see prodDet above.
Some values in tsMin column are empty, for hours in which there is no
source row.
If you want to totally drop the result from the first row (with
duration == 12), change return pd.Series([lstRow.duration]... to
return pd.Series([0]... (the 4-th row of parts function).
To sum up, my solution is more pandasonic and significantly shorter
than yours (17 rows (my solution) vs. about 70 (yours), excluding comments).
I was not able to find a solution in pandas, so I created a solution with plain python.
Basically, I am iterating over all the values after sorting and sending two datetimes viz start_time and end_time to a function, which does the processing.
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
Now, let's run the code for each iteration
time_val=[]
split_f_val=[]
split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()
for ix, row in df.iterrows():
time_val.append(row[str(time_field)])
split_f_val.append(int(row[str(split_field)]))
# Skipping First Element for Processing. Therefore, having minimum two data values
if ix !=0:
# getting Last Two Values
new_time_list=time_val[-2:]
new_data_list=split_f_val[-2:]
# get times to compare
start_time=new_time_list[: -1][0]
end_time=new_time_list[1:][0]
# get latest Data to divide
data_ = new_data_list[1:][0]
# print(start_time)
# print(end_time)
df2 = get_ratio_per_hour(start_time,end_time, data_ )
df_final = pd.concat([df_final
, df2], ignore_index=True)
else:
# Create Empty DataFrame for First Value.
df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ],
columns=['hours', 'mins', 'ratios'])
df_final = pd.concat([df_final
, df2], ignore_index=True)
result = df_final.groupby(['hours'])['ratios'].sum()
Intermediate DataFrame:
hours mins ratios
0
0 2020-09-09 21:00:00 59.0 7.0
1 2020-09-09 22:00:00 60.0 7.0
2 2020-09-09 23:00:00 60.0 7.0
3 2020-09-10 00:00:00 10.0 1.0
0 2020-09-10 00:00:00 50.0 80.0
1 2020-09-10 01:00:00 31.0 50.0
0 2020-09-10 01:00:00 19.0 60.0
0 2020-09-10 01:00:00 1.0 50.0
0 2020-09-10 01:00:00 8.0 26.0
0 2020-09-10 01:00:00 1.0 36.0
1 2020-09-10 02:00:00 1.0 36.0
0 2020-09-10 02:00:00 50.0 51.0
0 2020-09-10 02:00:00 9.0 57.0
1 2020-09-10 03:00:00 1.0 6.0
0 2020-09-10 03:00:00 59.0 78.0
1 2020-09-10 04:00:00 1.0 1.0
0 2020-09-10 04:00:00 59.0 85.0
1 2020-09-10 05:00:00 60.0 87.0
2 2020-09-10 06:00:00 5.0 7.0
Final Output:
hours ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0
I have the following time series data of temperature readings:
DT Temperature
01/01/2019 0:00 41
01/01/2019 1:00 42
01/01/2019 2:00 44
......
01/01/2019 23:00 41
01/02/2019 0:00 44
I am trying to write a function that compares the hourly change in temperature for a given day. Any change greater than 3 will increment quickChange counter. Something like this:
def countChange(day):
for dt in day:
if dt+1 - dt > 3: quickChange = quickChange+1
I can call the function for a day ex: countChange(df.loc['2018-01-01'])
Use Series.diff with compare by 3 and count Trues values by sum:
np.random.seed(2019)
rng = (pd.date_range('2018-01-01', periods=10, freq='H').tolist() +
pd.date_range('2018-01-02', periods=10, freq='H').tolist())
df = pd.DataFrame({'Temperature': np.random.randint(100, size=20)}, index=rng)
print (df)
Temperature
2018-01-01 00:00:00 72
2018-01-01 01:00:00 31
2018-01-01 02:00:00 37
2018-01-01 03:00:00 88
2018-01-01 04:00:00 62
2018-01-01 05:00:00 24
2018-01-01 06:00:00 29
2018-01-01 07:00:00 15
2018-01-01 08:00:00 12
2018-01-01 09:00:00 16
2018-01-02 00:00:00 48
2018-01-02 01:00:00 71
2018-01-02 02:00:00 83
2018-01-02 03:00:00 12
2018-01-02 04:00:00 80
2018-01-02 05:00:00 50
2018-01-02 06:00:00 95
2018-01-02 07:00:00 5
2018-01-02 08:00:00 24
2018-01-02 09:00:00 28
#if necessary create DatetimeIndex if DT is column
df = df.set_index("DT")
def countChange(day):
return (day['Temperature'].diff() > 3).sum()
print (countChange(df.loc['2018-01-01']))
4
print (countChange(df.loc['2018-01-02']))
9
try pandas.DataFrame.diff:
df = pd.DataFrame({'dt': ["01/01/2019 0:00","01/01/2019 1:00","01/01/2019 2:00","01/01/2019 23:00","01/02/2019 0:00"],
'Temperature': [41, 42, 44, 41, 44]})
df = df.sort_values("dt")
df = df.set_index("dt")
def countChange(df):
df["diff"] = df["Temperature"].diff()
return df.loc[df["diff"] > 3, "diff"].count()
quickchange = countChange(df.loc["2018-01-01"])
I have raw data like this want to find the difference between this two time in mint .....problem is data which is in data frame...
source:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
Need a output like this:
duration
540mint
798mint
162mint
1140mint
420mint
Your expected output seems to be incorrect. That aside, we can use base R's difftime:
transform(
df,
duration = difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
# start.time end.time duration
#0 08:30:00 17:30:00 540 mins
#1 11:00:00 17:30:00 390 mins
#2 08:00:00 21:30:00 810 mins
#3 19:30:00 22:00:00 150 mins
#4 19:00:00 00:00:00 -1140 mins
#5 08:30:00 15:30:00 420 mins
or as a difftime vector
with(df, difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
#Time differences in mins
#[1] 540 390 810 150 -1140 420
Sample data
df <- read.table(text =
" 'start time' 'end time'
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00", header = T, row.names = 1)
import pandas as pd
df = pd.DataFrame({'start time':['08:30:00','11:00:00','08:00:00','19:30:00','19:00:00','08:30:00'],'end time':['17:30:00','17:30:00','21:30:00','22:00:00','00:00:00','15:30:00']},columns=['start time','end time'])
df
Out[355]:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
(pd.to_datetime(df['end time']) - pd.to_datetime(df['start time'])).dt.seconds/60
Out[356]:
0 540.0
1 390.0
2 810.0
3 150.0
4 300.0
5 420.0
dtype: float64
Yes, definitely datetime is what you need here. Specifically, the strptime function, which parses a string into a time object.
from datetime import datetime
s1 = '10:33:26'
s2 = '11:15:49' # for example
FMT = '%H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
That gets you a timedelta object that contains the difference between the two times. You can do whatever you want with that, e.g. converting it to seconds or adding it to another datetime.
This will return a negative result if the end time is earlier than the start time, for example s1 = 12:00:00 and s2 = 05:00:00. If you want the code to assume the interval crosses midnight in this case (i.e. it should assume the end time is never earlier than the start time), you can add the following lines to the above code:
if tdelta.days < 0:
tdelta = timedelta(days=0,
seconds=tdelta.seconds, microseconds=tdelta.microseconds)
(of course you need to include from datetime import timedelta somewhere). Thanks to J.F. Sebastian for pointing out this use case.
I have a dataframe that looks like this:
I'm using python 3.6.5 and a datetime.time object for the index
print(sum_by_time)
Trips
Time
00:00:00 10
01:00:00 10
02:00:00 10
03:00:00 10
04:00:00 20
05:00:00 20
06:00:00 20
07:00:00 20
08:00:00 30
09:00:00 30
10:00:00 30
11:00:00 30
How can I group this dataframe by time interval to get something like this:
Trips
Time
00:00:00 - 03:00:00 40
04:00:00 - 07:00:00 80
08:00:00 - 11:00:00 120
I think need convert index values to timedeltas by to_timedelta and then resample:
df.index = pd.to_timedelta(df.index.astype(str))
df = df.resample('4H').sum()
print (df)
Trips
00:00:00 40
04:00:00 80
08:00:00 120
EDIT:
For your format need:
df['d'] = pd.to_datetime(df.index.astype(str))
df = df.groupby(pd.Grouper(freq='4H', key='d')).agg({'Trips':'sum', 'd':['first','last']})
df.columns = df.columns.map('_'.join)
df = df.set_index(df['d_first'].dt.strftime('%H:%M:%S') + ' - ' + df['d_last'].dt.strftime('%H:%M:%S'))[['Trips_sum']]
print (df)
Trips_sum
00:00:00 - 03:00:00 40
04:00:00 - 07:00:00 80
08:00:00 - 11:00:00 120
I have a dataframe of many days that look like this....consecutive rows of 30 min intervals:
a b
2006-05-08 09:30:00 10 13
2006-05-08 10:00:00 11 12
.
.
.
2006-05-08 15:30:00 15 14
2006-05-08 16:00:00 16 15
However, I only care about certain specific times, so I want EVERY DAY of the df to look like:
2006-05-08 09:30:00 10 13
2006-05-08 11:30:00 14 15
2006-05-08 13:00:00 18 15
2006-05-08 16:00:00 16 15
Meaning, I just want to keep the rows at times (16, 13, 11:30, 9:30), for all the different days in the dataframe.
Thanks
Update:
I made a bit of progress, using
hour = df.index.hour
selector = ((hour == 16) | (hour == 13) | (hour == 11) | (hour == 9))
df = df[selector]
However, I need to account for the minutes too, so I tried:
minute = df.index.minute
selector = ((hour == 16) & (minute == 0) | (hour == 3) & (minute == 0) | (hour == 9) & (minute == 30) | (hour == 12) & (minute == 0))
But I get error:
ValueError: operands could not be broadcast together with shapes (96310,) (16500,)
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame(range(N), index=pd.date_range('2000-1-1', freq='30T',
periods=N))
mask = np.in1d((df.index.hour)*100+(df.index.minute), [930, 1130, 1300, 1600])
print(df.loc[mask])
yields
0
2000-01-01 09:30:00 19
2000-01-01 11:30:00 23
2000-01-01 13:00:00 26
2000-01-01 16:00:00 32
2000-01-02 09:30:00 67
2000-01-02 11:30:00 71
2000-01-02 13:00:00 74
2000-01-02 16:00:00 80