I want to check whether values in a column are constant or have little variation(<1) at a location for a certain duration (referred to as "constant phenomenon"). If so, I want to identify and remove the last row (with the latest timestamp) that contributes to the "constant phenomenon". This is because removing this row will keep the constant duration under the given threshold.
An set of input and output example is like below:
Input:
LOCATION DATE_TIME random value
0 111 2016-01-01 00:45:00 2 1
1 111 2016-01-01 12:15:00 9 9
2 111 2016-01-01 12:30:00 6 9
3 111 2016-01-01 12:45:00 3 9
4 111 2016-01-01 13:00:00 6 9
5 222 2016-01-01 12:00:00 3 9
6 222 2016-01-01 12:15:00 3 4
7 222 2016-01-01 12:30:00 5 5
8 333 2016-01-01 12:00:00 9 5
9 333 2016-01-01 12:15:00 2 5
10 333 2016-01-01 12:30:00 6 5
11 333 2016-01-02 12:30:00 8 5
12 444 2016-01-01 12:00:00 4 5
Output (Original row 3, 4, 10, 11 are removed):
LOCATION DATE_TIME random value
0 111 2016-01-01 00:45:00 1 1
1 111 2016-01-01 12:15:00 7 9
2 111 2016-01-01 12:30:00 1 9
3 222 2016-01-01 12:00:00 8 9
4 222 2016-01-01 12:15:00 1 4
5 222 2016-01-01 12:30:00 9 5
6 333 2016-01-01 12:00:00 1 5
7 333 2016-01-01 12:15:00 1 5
8 444 2016-01-01 12:00:00 6 5
ADDED EXPLANATION TO THE EXAMPLE:
The reason why original row 3 is removed is because the value are "constant" from 12:15:00-12:45:00, which reaches the threshold of 30min. By removing it, the time window starting with 12:15:00 are all clear. However, the next time window starting with 12:30:00 encountered the same issue due to row 4, i.e. constant value again lasting for 30min. Therefore, row 4 is removed.
A few other things to note: 1) the values are not at a fixed time interval; 2) the cells under 'value' and 'random' can be integer or float. 3) once identified, the row that contributes to the "constant phenomenon" will be removed immediately, i.e. reflected in the next round of checking.
I have a working code as shown below that checks within each time window, but it seems to be very exhaustive and time-consuming.
I feel like it could be done much faster, e.g. using generators instead of iterators or using a recursive function, but I am not sure how to code it.
Thank you for your time.
import pandas as pd
import random
import datetime
def check_duration(df_all, measure, maxd):
link_group = df_all.groupby('LOCATION', as_index = False)
df_all_cleansed = pd.DataFrame()
df_remove = pd.DataFrame()
for k in link_group.groups:
df_link = link_group.get_group(k)
df_link = df_link.sort_values(['DATE_TIME'])
df_link.reset_index(drop=True,inplace=True)
# Until which row is the check necessary
num_check = len(df_link.loc[df_link['DATE_TIME'] <= max(df_link['DATE_TIME'])-maxd])
i=0
while i < num_check:
start = df_link['DATE_TIME'].iloc[i]
end = start+maxd
if len(df_link.loc[df_link['DATE_TIME']>=end]) == 0:
break
else:
# A time window with duration maxd as the base
df_target = df_link.loc[(df_link['DATE_TIME'] < end)&(df_link['DATE_TIME'] >= start)]
# The critical row that may or may not be removed depending on the below if statement
df_critical = df_link.loc[[df_link.loc[df_link['DATE_TIME']>=end, 'DATE_TIME'].idxmin()]]
df_appended = df_critical.append(df_target)
# Check within this window, whether 1) there is more than one data and 2) the max-min
if len(df_appended)>1 and (max(df_appended[measure])-min(df_appended[measure]))<1:
that_time = df_critical['DATE_TIME'].iloc[0]
print('Removed timestamp: ', that_time)
that_index_list = df_link[df_link['DATE_TIME'] == that_time].index.tolist()
# Just a precautionary step in case of duplicated date_time (very unlikely)
if len(that_index_list)==1:
that_index = that_index_list[0]
else:
print('Duplicated date_time!!!')
sys.exit(0)
df_link.drop(that_index,inplace=True)
df_remove = df_remove.append(df_critical)
df_link.reset_index(drop=True,inplace=True)
else:
i += 1
df_all_cleansed = df_all_cleansed.append(df_link)
df_all_cleansed.reset_index(drop=True,inplace=True)
return df_all_cleansed, df_remove
d = {'LOCATION':[111,111,111,111,111,222,222,222,333,333,333,333,444],
'DATE_TIME': ['2016-01-01 00:45:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00', '2016-01-01 13:00:00',
'2016-01-01 12:00:00', '2016-01-01 12:15:00', '2016-01-01 12:30:00',
'2016-01-01 12:00:00', '2016-01-01 12:15:00','2016-01-01 12:30:00',
'2016-01-02 12:30:00', '2016-01-01 12:00:00'],
'random': [random.randint(0, 9) for p in range(13)],
'value': [1,9,9,9,9,9,4,5,5,5,5,5,5]}
df = pd.DataFrame(data=d)
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
print('******************* BEFORE: *******************\n', df)
threshold_duration = datetime.timedelta(days=0, hours=0, minutes=30, seconds=0)
df, df_remove = check_duration(df,'value',threshold_duration)
print('******************* AFTER: *******************\n', df)
Related
This is a real use case that I am trying to implement in my work.
Sample data (fake data but similar data structure)
Lap Starttime Endtime
1 10:00:00 10:05:00
format: hh:mm:ss
Desired output
Lap time
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
so far only trying to think of the logic and techniques required... the codes are not correct
import re
import pandas as pd
df = pd.read_csv('sample.csv')
#1. to determine how many rows to generate. eg. 1000 to 1005 is 6 rows
df['time'] = df['Endtime'] - df['Startime']
#2. add one new row with 1 added minute. eg. 6 rows
for i in No_of_rows:
if df['time'] < df['Endtime']: #if 'time' still before end time, then continue append
df['time'] = df['Startime'] += 1 #not sure how to select Minute part only
else:
continue
pardon my limited coding skills. appreciate all the help from you experts.. thanks!
Try with pd.date_range and explode:
#convert to datetime if needed
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
#create list of 1min ranges
df["Range"] = df.apply(lambda x: pd.date_range(x["Starttime"], x["Endtime"], freq="1min"), axis=1)
#explode, drop unneeded columns and keep only time
df = df.drop(["Starttime", "Endtime"], axis=1).explode("Range")
df["Range"] = df["Range"].dt.time
>>> df
Range
Lap
1 10:00:00
1 10:01:00
1 10:02:00
1 10:03:00
1 10:04:00
1 10:05:00
Input df:
df = pd.DataFrame({"Lap": [1],
"Starttime": ["10:00:00"],
"Endtime": ["10:05:00"]}).set_index("Lap")
>>> df
Starttime Endtime
Lap
1 10:00:00 10:05:00
You can convert the times to datetimes, that will arbitrarily prepend the date of today (at whatever date you’re running) but we can then remove that later and it allows for easier manupulation:
>>> bounds = df[['Starttime', 'Endtime']].transform(pd.to_datetime)
>>> bounds
Starttime Endtime
0 2021-09-29 10:00:00 2021-09-29 10:05:00
1 2021-09-29 10:00:00 2021-09-29 10:02:00
Then we can simply use pd.date_range with a 1 minute frequency:
>>> times = bounds.agg(lambda s: pd.date_range(*s, freq='1min'), axis='columns')
>>> times
0 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
1 DatetimeIndex(['2021-09-29 10:00:00', '2021-09...
dtype: object
Now joining that with the Lap info and using df.explode():
>>> result = df[['Lap']].join(times.rename('time')).explode('time').reset_index(drop=True)
>>> result
Lap time
0 1 2021-09-29 10:00:00
1 1 2021-09-29 10:01:00
2 1 2021-09-29 10:02:00
3 1 2021-09-29 10:03:00
4 1 2021-09-29 10:04:00
5 1 2021-09-29 10:05:00
6 2 2021-09-29 10:00:00
7 2 2021-09-29 10:01:00
8 2 2021-09-29 10:02:00
Finally we wanted to remove the day:
>>> result['time'] = result['time'].dt.time
>>> result
Lap time
0 1 10:00:00
1 1 10:01:00
2 1 10:02:00
3 1 10:03:00
4 1 10:04:00
5 1 10:05:00
6 2 10:00:00
7 2 10:01:00
8 2 10:02:00
The objects in your series are now datetime.time
Here is another way without using apply/agg:
Convert to datetime first:
df["Starttime"] = pd.to_datetime(df["Starttime"], format="%H:%M:%S")
df["Endtime"] = pd.to_datetime(df["Endtime"], format="%H:%M:%S")
Get difference between the end and start times and then using index.repeat, repeat the rows. Then using groupby & cumcount, get pd.to_timedelta in minutes and add to the existing start time:
repeats = df['Endtime'].sub(df['Starttime']).dt.total_seconds()//60
out = df.loc[df.index.repeat(repeats+1),['Lap','Starttime']]
out['Starttime'] = (out['Starttime'].add(
pd.to_timedelta(out.groupby("Lap").cumcount(),'min')).dt.time)
print(out)
Lap Starttime
0 1 10:00:00
0 1 10:01:00
0 1 10:02:00
0 1 10:03:00
0 1 10:04:00
0 1 10:05:00
I want to concatenate 2 DataFrames with different PeriodIndex frequencies, and using for sorting the second level index which is a position.
For example, I have following 2 DataFrames.
import pandas as pd
pr1h = pd.period_range(start='2020-01-01 08:00', end='2020-01-01 11:00', freq='1h')
pr2h = pd.period_range(start='2020-01-01 08:00', end='2020-01-01 11:00', freq='2h')
n_array_1h = [2, 2, 2, 2]
n_array_2h = [0, 1, 0, 1]
index_labels_1h = [pr1h, n_array_1h]
index_labels_2h = [[pr2h[0],pr2h[0],pr2h[1],pr2h[1]], n_array_2h]
values_1h = [[1], [2], [3], [4]]
values_2h = [[10], [20], [30], [40]]
df1h = pd.DataFrame(values_1h, index=index_labels_1h, columns=['Data'])
df1h.index.names=['Period','Position']
df2h = pd.DataFrame(values_2h, index=index_labels_2h, columns=['Data'])
df2h.index.names=['Period','Position']
df1h
Data
Period Position
2020-01-01 08:00 2 1
2020-01-01 09:00 2 2
2020-01-01 10:00 2 3
2020-01-01 11:00 2 4
df2h
Data
Period Position
2020-01-01 08:00 0 10
1 20
2020-01-01 10:00 0 30
1 40
I woudl like to obtain df1h_new, which:
keeps the PeriodIndex from df1h,
keeps data from the block in df2h having a period.start_time immediately lower or equal than current perdiod.start_time in df1h,
keeps data from df1h obviously
So result would be.
df1h_new
Data
Period Position
2020-01-01 08:00 0 10 # |---> data coming from df2h, block with
1 20 # | start_time =< df1h.index[0].start_time
2 1 # ----> data from df1h.index[0]
2020-01-01 09:00 0 10 # |---> data coming from df2h, block with
1 20 # | start_time =< df1h.index[1].start_time
2 2 # ----> data from df1h.index[1]
2020-01-01 10:00 0 30 # and so on...
1 40
2 3
2020-01-01 11:00 0 30
1 40
2 4
Please, what would be the recommended way to achieve that?
I thank you for your help and support! Bests,
One idea is use concat with Series.unstack and change frequency to same by Series.asfreq, then back filling misisng values and reshape back to MultiIndex:
df = (pd.concat([df1h['Data'].unstack(),
df2h['Data'].unstack().asfreq('H')], axis=1)
.bfill()
.stack()
.sort_index()
.to_frame('Data'))
print (df)
Data
Period Position
2020-01-01 08:00 0 10.0
1 20.0
2 1.0
2020-01-01 09:00 0 10.0
1 20.0
2 2.0
2020-01-01 10:00 0 30.0
1 40.0
2 3.0
2020-01-01 11:00 0 30.0
1 40.0
2 4.0
I want to create session based on Location and Timestamp. If location is new or time has exceed 15 minutes interval then a new session is assigned to the record in the dataframe. Example below
Location | Time | Session
A 2016-01-01 00:00:15 1
A 2016-01-01 00:05:00 1
A 2016-01-01 00:10:08 1
A 2016-01-01 00:14:08 1
A 2016-01-01 00:15:49 2
B 2016-01-01 00:15:55 3
C 2016-01-01 00:15:58 4
C 2016-01-01 00:26:55 4
C 2016-01-01 00:29:55 4
C 2016-01-01 00:31:08 5
This is the code which, doesn't work for given problem.
from datetime import timedelta
cond1 = df.DateTime-df.DateTime.shift(1) > pd.Timedelta(15, 'm')
#OR
#15_min = df.DateTime.diff() > pd.Timedelta(minutes=15)
cond2 = df.location != df.location.shift(1)
session_id = (cond1|cond2).cumsum()
df['session_id'] = session_id.map(pd.Series(range(0,10000)))
I want a new session, if new location is found or 15 minutes are up for current location.
You can groupby both Location and using pd.Grouper to bin into 15 minute intervals and Location, then use ngroup to number each group:
df['Session'] = (df.groupby(['Location',pd.Grouper(key='Time',freq='15min')])
.ngroup()+1)
>>> df
Location Time Session
0 A 2016-01-01 00:00:15 1
1 A 2016-01-01 00:05:00 1
2 A 2016-01-01 00:10:08 1
3 A 2016-01-01 00:14:08 1
4 A 2016-01-01 00:15:49 2
5 B 2016-01-01 00:15:55 3
6 C 2016-01-01 00:15:58 4
7 C 2016-01-01 00:26:55 4
8 C 2016-01-01 00:29:55 4
9 C 2016-01-01 00:31:08 5
I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
I want to sum gap column. I have time slots in dict like that.
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
Now after summation, above dataframe should like that.
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
But it doesn't work.
Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum and last map values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut or searchsorted:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
Output
gap
slot
-----------
0 2
1000 1
1500 3
The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.
The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum
None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.
df['timeNew'] = answerlist
print (df)
time gap timeNew
0 00:00:08 1 00:10:00/slot1
1 00:00:48 0 00:10:00/slot1
2 00:02:50 1 00:10:00/slot1
3 00:00:52 0 00:10:00/slot1
4 00:10:01 0 00:20:00/slot2
5 00:10:03 1 00:20:00/slot2
6 00:10:05 0 00:20:00/slot2
7 00:10:08 0 00:20:00/slot2
#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df)
time gap timeNew aggregate sum of gap
0 00:00:08 1 00:10:00/slot1 2
1 00:00:48 0 00:10:00/slot1 2
2 00:02:50 1 00:10:00/slot1 2
3 00:00:52 0 00:10:00/slot1 2
4 00:10:01 0 00:20:00/slot2 1
5 00:10:03 1 00:20:00/slot2 1
6 00:10:05 0 00:20:00/slot2 1
7 00:10:08 0 00:20:00/slot2 1
This is a follow up to my previous question here.
Assume a dataset like this (which originally is read in from a .csv):
data = pd.DataFrame({'id': [1,2,3,1,2,3,1,2,3],
'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00',
'2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00',
'2017-01-01 12:20:00','2017-01-01 12:20:00','2017-01-01 12:20:00'],
'values': [10,11,12,10,12,13,10,13,13]})
data = data.set_index('id')
=>
id time values
0 1 2017-01-01 12:00:00 10
1 2 2017-01-01 12:00:00 11
2 3 2017-01-01 12:00:00 12
3 1 2017-01-01 12:10:00 10
4 2 2017-01-01 12:10:00 12
5 3 2017-01-01 12:10:00 13
6 1 2017-01-01 12:20:00 10
7 2 2017-01-01 12:20:00 13
8 3 2017-01-01 12:20:00 13
Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.
Previously, I learned how to get the total number of changes in values between two consecutive periods for each id:
data.groupby(data.index).values.apply(lambda x: (x != x.shift()).sum() - 1)
This works great and is really fast. Now, I am interested in adding a new column to the df. It should be a dummy indicating for each row in values if there was a change between the current and previous row. Thus, the result would be as follows:
=>
id time values change
0 1 2017-01-01 12:00:00 10 0
1 2 2017-01-01 12:00:00 11 0
2 3 2017-01-01 12:00:00 12 0
3 1 2017-01-01 12:10:00 10 0
4 2 2017-01-01 12:10:00 12 1
5 3 2017-01-01 12:10:00 13 1
6 1 2017-01-01 12:20:00 10 0
7 2 2017-01-01 12:20:00 13 1
8 3 2017-01-01 12:20:00 13 0
After fiddling around, I came up with a solution. However, it is really slow. It won't run on my actual dataset which is rather big:
def calc_change(x):
x = (x != x.shift())
x.iloc[0,] = False
return x
changes = data.groupby(data.index, as_index=False).values.apply(
calc_change).reset_index().iloc[:,2]
data = data.sort_index().reset_index()
data.loc[changes, 'change'] = 1
data = data.fillna(0)
I'm sure there are better and appreciate any help!
You can use this solution if your id column is not set as index.
data['change'] = data.groupby(['id'])['values'].apply(lambda x: x.diff() > 0).astype(int)
You get
id time values change
0 1 2017-01-01 12:00:00 10 0
1 2 2017-01-01 12:00:00 11 0
2 3 2017-01-01 12:00:00 12 0
3 1 2017-01-01 12:10:00 10 0
4 2 2017-01-01 12:10:00 12 1
5 3 2017-01-01 12:10:00 13 1
6 1 2017-01-01 12:20:00 10 0
7 2 2017-01-01 12:20:00 13 1
8 3 2017-01-01 12:20:00 13 0
With id as index,
data = data.sort_index()
data['change'] = data.groupby(data.index)['values'].apply(lambda x: x.diff() > 0).astype(int)