I want to create session based on Location and Timestamp. If location is new or time has exceed 15 minutes interval then a new session is assigned to the record in the dataframe. Example below
Location | Time | Session
A 2016-01-01 00:00:15 1
A 2016-01-01 00:05:00 1
A 2016-01-01 00:10:08 1
A 2016-01-01 00:14:08 1
A 2016-01-01 00:15:49 2
B 2016-01-01 00:15:55 3
C 2016-01-01 00:15:58 4
C 2016-01-01 00:26:55 4
C 2016-01-01 00:29:55 4
C 2016-01-01 00:31:08 5
This is the code which, doesn't work for given problem.
from datetime import timedelta
cond1 = df.DateTime-df.DateTime.shift(1) > pd.Timedelta(15, 'm')
#OR
#15_min = df.DateTime.diff() > pd.Timedelta(minutes=15)
cond2 = df.location != df.location.shift(1)
session_id = (cond1|cond2).cumsum()
df['session_id'] = session_id.map(pd.Series(range(0,10000)))
I want a new session, if new location is found or 15 minutes are up for current location.
You can groupby both Location and using pd.Grouper to bin into 15 minute intervals and Location, then use ngroup to number each group:
df['Session'] = (df.groupby(['Location',pd.Grouper(key='Time',freq='15min')])
.ngroup()+1)
>>> df
Location Time Session
0 A 2016-01-01 00:00:15 1
1 A 2016-01-01 00:05:00 1
2 A 2016-01-01 00:10:08 1
3 A 2016-01-01 00:14:08 1
4 A 2016-01-01 00:15:49 2
5 B 2016-01-01 00:15:55 3
6 C 2016-01-01 00:15:58 4
7 C 2016-01-01 00:26:55 4
8 C 2016-01-01 00:29:55 4
9 C 2016-01-01 00:31:08 5
Related
I want to check whether values in a column are constant or have little variation(<1) at a location for a certain duration (referred to as "constant phenomenon"). If so, I want to identify and remove the last row (with the latest timestamp) that contributes to the "constant phenomenon". This is because removing this row will keep the constant duration under the given threshold.
An set of input and output example is like below:
Input:
LOCATION DATE_TIME random value
0 111 2016-01-01 00:45:00 2 1
1 111 2016-01-01 12:15:00 9 9
2 111 2016-01-01 12:30:00 6 9
3 111 2016-01-01 12:45:00 3 9
4 111 2016-01-01 13:00:00 6 9
5 222 2016-01-01 12:00:00 3 9
6 222 2016-01-01 12:15:00 3 4
7 222 2016-01-01 12:30:00 5 5
8 333 2016-01-01 12:00:00 9 5
9 333 2016-01-01 12:15:00 2 5
10 333 2016-01-01 12:30:00 6 5
11 333 2016-01-02 12:30:00 8 5
12 444 2016-01-01 12:00:00 4 5
Output (Original row 3, 4, 10, 11 are removed):
LOCATION DATE_TIME random value
0 111 2016-01-01 00:45:00 1 1
1 111 2016-01-01 12:15:00 7 9
2 111 2016-01-01 12:30:00 1 9
3 222 2016-01-01 12:00:00 8 9
4 222 2016-01-01 12:15:00 1 4
5 222 2016-01-01 12:30:00 9 5
6 333 2016-01-01 12:00:00 1 5
7 333 2016-01-01 12:15:00 1 5
8 444 2016-01-01 12:00:00 6 5
ADDED EXPLANATION TO THE EXAMPLE:
The reason why original row 3 is removed is because the value are "constant" from 12:15:00-12:45:00, which reaches the threshold of 30min. By removing it, the time window starting with 12:15:00 are all clear. However, the next time window starting with 12:30:00 encountered the same issue due to row 4, i.e. constant value again lasting for 30min. Therefore, row 4 is removed.
A few other things to note: 1) the values are not at a fixed time interval; 2) the cells under 'value' and 'random' can be integer or float. 3) once identified, the row that contributes to the "constant phenomenon" will be removed immediately, i.e. reflected in the next round of checking.
I have a working code as shown below that checks within each time window, but it seems to be very exhaustive and time-consuming.
I feel like it could be done much faster, e.g. using generators instead of iterators or using a recursive function, but I am not sure how to code it.
Thank you for your time.
import pandas as pd
import random
import datetime
def check_duration(df_all, measure, maxd):
link_group = df_all.groupby('LOCATION', as_index = False)
df_all_cleansed = pd.DataFrame()
df_remove = pd.DataFrame()
for k in link_group.groups:
df_link = link_group.get_group(k)
df_link = df_link.sort_values(['DATE_TIME'])
df_link.reset_index(drop=True,inplace=True)
# Until which row is the check necessary
num_check = len(df_link.loc[df_link['DATE_TIME'] <= max(df_link['DATE_TIME'])-maxd])
i=0
while i < num_check:
start = df_link['DATE_TIME'].iloc[i]
end = start+maxd
if len(df_link.loc[df_link['DATE_TIME']>=end]) == 0:
break
else:
# A time window with duration maxd as the base
df_target = df_link.loc[(df_link['DATE_TIME'] < end)&(df_link['DATE_TIME'] >= start)]
# The critical row that may or may not be removed depending on the below if statement
df_critical = df_link.loc[[df_link.loc[df_link['DATE_TIME']>=end, 'DATE_TIME'].idxmin()]]
df_appended = df_critical.append(df_target)
# Check within this window, whether 1) there is more than one data and 2) the max-min
if len(df_appended)>1 and (max(df_appended[measure])-min(df_appended[measure]))<1:
that_time = df_critical['DATE_TIME'].iloc[0]
print('Removed timestamp: ', that_time)
that_index_list = df_link[df_link['DATE_TIME'] == that_time].index.tolist()
# Just a precautionary step in case of duplicated date_time (very unlikely)
if len(that_index_list)==1:
that_index = that_index_list[0]
else:
print('Duplicated date_time!!!')
sys.exit(0)
df_link.drop(that_index,inplace=True)
df_remove = df_remove.append(df_critical)
df_link.reset_index(drop=True,inplace=True)
else:
i += 1
df_all_cleansed = df_all_cleansed.append(df_link)
df_all_cleansed.reset_index(drop=True,inplace=True)
return df_all_cleansed, df_remove
d = {'LOCATION':[111,111,111,111,111,222,222,222,333,333,333,333,444],
'DATE_TIME': ['2016-01-01 00:45:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00', '2016-01-01 13:00:00',
'2016-01-01 12:00:00', '2016-01-01 12:15:00', '2016-01-01 12:30:00',
'2016-01-01 12:00:00', '2016-01-01 12:15:00','2016-01-01 12:30:00',
'2016-01-02 12:30:00', '2016-01-01 12:00:00'],
'random': [random.randint(0, 9) for p in range(13)],
'value': [1,9,9,9,9,9,4,5,5,5,5,5,5]}
df = pd.DataFrame(data=d)
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
print('******************* BEFORE: *******************\n', df)
threshold_duration = datetime.timedelta(days=0, hours=0, minutes=30, seconds=0)
df, df_remove = check_duration(df,'value',threshold_duration)
print('******************* AFTER: *******************\n', df)
I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
I want to sum gap column. I have time slots in dict like that.
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
Now after summation, above dataframe should like that.
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
But it doesn't work.
Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum and last map values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut or searchsorted:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
Output
gap
slot
-----------
0 2
1000 1
1500 3
The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.
The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum
None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.
df['timeNew'] = answerlist
print (df)
time gap timeNew
0 00:00:08 1 00:10:00/slot1
1 00:00:48 0 00:10:00/slot1
2 00:02:50 1 00:10:00/slot1
3 00:00:52 0 00:10:00/slot1
4 00:10:01 0 00:20:00/slot2
5 00:10:03 1 00:20:00/slot2
6 00:10:05 0 00:20:00/slot2
7 00:10:08 0 00:20:00/slot2
#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df)
time gap timeNew aggregate sum of gap
0 00:00:08 1 00:10:00/slot1 2
1 00:00:48 0 00:10:00/slot1 2
2 00:02:50 1 00:10:00/slot1 2
3 00:00:52 0 00:10:00/slot1 2
4 00:10:01 0 00:20:00/slot2 1
5 00:10:03 1 00:20:00/slot2 1
6 00:10:05 0 00:20:00/slot2 1
7 00:10:08 0 00:20:00/slot2 1
I need to groupby and filter out duplicates in a pandas dataframe based on conditions. My dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,4,4],'Date':['1/1/2001','1/1/1999','1/1/2010','1/1/2004','1/1/2000','1/1/2001','1/1/2000'], 'type':['yes','yes','yes','yes','no','no','no'], 'source':[3,1,1,2,2,2,1]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('ID')
df
Date source type
ID
1 2001-01-01 3 yes
1 1999-01-01 1 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
I need to groupby ID and type and anywhere type == yes keep the most current record only if it has the highest source. If the most current record does not have the highest source keep both records
Desired output:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
I have tried using transform but cannot figure out how to apply conditions:
grouped = df.groupby(['ID','type'])['Date'].transform(max)
df = df.loc[df['Date'] == grouped]
df
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
any help is greatly appreciated
WEN here is the problem if I have a dataframe with more rows (I have about 70 columns and 5000 rows) it does not take into consideration the source max.
Date source type
ID
1 2001-01-01 3 yes
1 1999-01-01 1 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
using you code I get:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
it should be:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
This will need pd.concat
grouped = df.groupby(['type'])['Date'].transform(max)# I change this line seems like you need groupby type
s = df.loc[df['Date'] == grouped].index
#here we split the df into two part , one need to drop the not match row , one should keep all row
pd.concat([df.loc[df.index.difference(s)].sort_values('Date').groupby('ID').tail(1),df.loc[s]]).sort_index()
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
Update
grouped = df.groupby(['type'])['source'].transform(max)
s = df.loc[df['source'] == grouped].index
pd.concat([df.loc[s].sort_values('Date').groupby('ID').tail(1),df.loc[df.index.difference(s)]]).sort_index()
Out[445]:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
This is a follow up to my previous question here.
Assume a dataset like this (which originally is read in from a .csv):
data = pd.DataFrame({'id': [1,2,3,1,2,3,1,2,3],
'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00',
'2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00',
'2017-01-01 12:20:00','2017-01-01 12:20:00','2017-01-01 12:20:00'],
'values': [10,11,12,10,12,13,10,13,13]})
data = data.set_index('id')
=>
id time values
0 1 2017-01-01 12:00:00 10
1 2 2017-01-01 12:00:00 11
2 3 2017-01-01 12:00:00 12
3 1 2017-01-01 12:10:00 10
4 2 2017-01-01 12:10:00 12
5 3 2017-01-01 12:10:00 13
6 1 2017-01-01 12:20:00 10
7 2 2017-01-01 12:20:00 13
8 3 2017-01-01 12:20:00 13
Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.
Previously, I learned how to get the total number of changes in values between two consecutive periods for each id:
data.groupby(data.index).values.apply(lambda x: (x != x.shift()).sum() - 1)
This works great and is really fast. Now, I am interested in adding a new column to the df. It should be a dummy indicating for each row in values if there was a change between the current and previous row. Thus, the result would be as follows:
=>
id time values change
0 1 2017-01-01 12:00:00 10 0
1 2 2017-01-01 12:00:00 11 0
2 3 2017-01-01 12:00:00 12 0
3 1 2017-01-01 12:10:00 10 0
4 2 2017-01-01 12:10:00 12 1
5 3 2017-01-01 12:10:00 13 1
6 1 2017-01-01 12:20:00 10 0
7 2 2017-01-01 12:20:00 13 1
8 3 2017-01-01 12:20:00 13 0
After fiddling around, I came up with a solution. However, it is really slow. It won't run on my actual dataset which is rather big:
def calc_change(x):
x = (x != x.shift())
x.iloc[0,] = False
return x
changes = data.groupby(data.index, as_index=False).values.apply(
calc_change).reset_index().iloc[:,2]
data = data.sort_index().reset_index()
data.loc[changes, 'change'] = 1
data = data.fillna(0)
I'm sure there are better and appreciate any help!
You can use this solution if your id column is not set as index.
data['change'] = data.groupby(['id'])['values'].apply(lambda x: x.diff() > 0).astype(int)
You get
id time values change
0 1 2017-01-01 12:00:00 10 0
1 2 2017-01-01 12:00:00 11 0
2 3 2017-01-01 12:00:00 12 0
3 1 2017-01-01 12:10:00 10 0
4 2 2017-01-01 12:10:00 12 1
5 3 2017-01-01 12:10:00 13 1
6 1 2017-01-01 12:20:00 10 0
7 2 2017-01-01 12:20:00 13 1
8 3 2017-01-01 12:20:00 13 0
With id as index,
data = data.sort_index()
data['change'] = data.groupby(data.index)['values'].apply(lambda x: x.diff() > 0).astype(int)
I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1