Add time interval values in new column Pandas - python

I have a large pandas dataframe (40 million rows) with the following format :
ID DATETIME TIMESTAMP
81215545953683710540 2017-01-01 17:39:57 1483243205
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243353
I am trying to assign a value to each row depending on the timestamp so that i have group of rows with the same value if they are in the same time interval.
Let's say I have t0 = 1483243205 and I want a differently value when TIMESTAMP = t0+10 . So here my time interval would be of 10.
I would like something like that :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261 5
48126186377367976994 2017-01-01 17:19:29 1483243263 5
23522333658893375671 2017-01-01 12:50:46 1483243266 6
16194691060240380504 2017-01-01 15:59:23 1483243288 8
Here is my code :
df['VALUE']=''
t=1483243205
j=0
for i in range(0,len(df['TIMESTAMP'])):
while(df.iloc[i][2])<(t+10):
df['VALUE'][i]=j
i+=1
t+=10
j+=1
I have a warning when executing my code (SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame) and I have the following result :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243288
It is not the first time I encounter the warning and I always overcame it, but I am confused with the fact I only got a value for the first row.
Does anyone know what I am missing ?
Thanks

I would suggest using pandas' cut method to achieve this, preventing the need to explicitly loop through your DataFrame.
tmin, tmax = df['TIMESTAMP'].min(), df['TIMESTAMP'].max()
bins = [i for i in range(tmin, tmax+10, 10)]
labels = [i for i in range(len(bins)-1)]
df['VALUE'] = pd.cut(df['TIMESTAMP'], bins=bins, labels=labels, include_lowest=True)
ID DATETIME TIMESTAMP VALUE
0 81215545953683710540 2017-01-01 17:39:57 1483243205 0
1 74994612102903447699 2017-01-01 19:14:12 1483243261 5
2 48126186377367976994 2017-01-01 17:19:29 1483243263 5
3 23522333658893375671 2017-01-01 12:50:46 1483243266 6
4 16194691060240380504 2017-01-01 15:59:23 1483243288 8

Related

Pandas custom re-sample for time series data

I have a time series data in 1 Min frequency. I would like re-sample the data for every 5 min and re-sample data should include the data of first time step, middle time step and last time step.
I have tried like this, but I am not getting what I am expecting...
def my_fun(array)
return array[0],array[-1]
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'
df.resample('5T').apply(my_fun)
If I understood you correctly then you want the data for minutes 0,2,4,5,7,9,10,... in a new dataframe. A faster way than using resample may be:
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
l = len(df)
df.loc[df.iloc[range(2,l,5)].index | df.iloc[range(4,l,5)].index | df.iloc[range(0,l,5)].index]
Output:
0
2017-01-01 00:00:00 0
2017-01-01 00:02:00 2
2017-01-01 00:04:00 4
2017-01-01 00:05:00 5
2017-01-01 00:07:00 7
2017-01-01 00:09:00 9
2017-01-01 00:10:00 10
If you just wanted a combined list of your selected data in one row then you were almost there:
def my_fun(array):
return [array[0], array[2], array[4]]
df=pd.DataFrame({'0':np.arange(60)}, index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
df.resample('5T').apply(my_fun)
Output:
0
2017-01-01 00:00:00 (0, 2, 4)
2017-01-01 00:05:00 (5, 7, 9)
2017-01-01 00:10:00 (10, 12, 14)

Time Ranking DataFrame within the constraints of noise

I have a dataframe df with three columns, viz., Date, Time, Name (there can be more extra columns). df is sorted in ascending order of Time. On any given Date there could be multiple Time values which can either be 5 minutes apart or > 15 minutes apart. On any given day anything within 5 minutes should be treated as same. I want to add column TimeRank which on any given day clusters similar Time within 5 minutes together and give them same TimeRank. For example,
Date Name Time TimeRank
0 2017-01-01 Henry 2017-01-01 09:21:01 1
1 2017-01-01 John 2017-01-01 09:23:43 1
2 2017-01-01 Svetlana 2017-01-01 10:15:01 2
3 2017-01-01 Sara 2017-01-01 11:01:01 3
4 2017-01-01 Whitney 2017-01-01 11:03:03 3
5 2017-01-02 Lara 2017-01-02 11:03:03 1
6 2017-01-02 Eugene 2017-01-02 16:46:00 2
7 2017-01-02 Richard 2017-01-02 16:46:00 2
8 2017-01-03 Andy 2017-01-03 11:01:01 1
9 2017-01-03 Paul 2017-01-03 11:03:03 1
Below I have created a sample df. Unfortunately, I am constrained with using an older version of pandas 0.16.
import pandas as pd
from random import randint
from datetime import time
dates = pd.date_range('2017-01-01', '2017-01-04')
dates2 = [dates[i] for i in [randint(0, len(dates) -1) for i in range (0, 100)]]
timelist = [time(9,20,45), time(9,21,0), time(9,23,43), time(9,50,0), time(10,15,1), time(11,1,1), time(11,3,3), time(16,45,0), time(16,46,0)]
timelist2 = [timelist[i] for i in [randint(0, len(timelist) -1) for i in range (0, 100)]]
names = ['henry', 'tom', 'andy', 'lara', 'whitney', 'eleanor', 'paloma', 'john', 'james', 'svetlana', 'paul']
names2 = [names[i] for i in [randint(0, len(names)-1) for i in range (0, 100)]]
df = pd.DataFrame({'Date':dates2, 'Time':timelist2, 'Name':names2})
df['Time'] = df.apply(lambda r:pd.datetime.combine(r['Date'],r['Time']), axis=1)
df.sort('Time', inplace=True)
df.loc[:, 'minutes'] = df.apply(lambda x:x['Time'].minute + 60*x['Time'].hour, axis=1)
df.loc[:, 'delTime'] = df.groupby('Date')['minutes'].diff()
df.loc[(df['delTime'] <=5) & (df['delTime'] >=-5), 'delTime'] = 0
df.loc[np.isnan(df['delTime']), 'delTime'] = 1.
df.loc[(df['delTime']) == 0, 'delTime'] = np.nan
df.loc[~np.isnan(df['delTime']), 'delTime'] = df['minutes']
df = df.ffill()
df.loc[:, 'TimeRank'] = df.groupby('Date')['delTime'].rank(method='dense')
df.drop(['minutes', 'delTime'], inplace=True, axis=1)

how to merge group rows in dataframe based on differences between datetime?

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.
threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')
i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

Collapsing rows with overlapping dates

I'm trying to collapse the below dataframe into rows containing continuous time periods by id. Continuous means that, within the same id, the start_date is either less than, equal or at most one day greater than the end_date of the previous row (the data is already sorted by id, start_date and end_date). All rows that are continuous should be output as a single row, with start_date being the minimum start_date (i.e. the start_date of the first row in the continuous set) and end_date being the maximum end_date from the continuous set of rows.
Please see the desired output at the bottom.
The only way I can think of approaching this is by parsing the dataframe row by row, but I was wondering if there a more pythonic and elegant way to do it.
id start_date end_date
1 2017-01-01 2017-01-15
1 2017-01-12 2017-01-24
1 2017-01-25 2017-02-03
1 2017-02-05 2017-02-14
1 2017-02-16 2017-02-28
2 2017-01-01 2017-01-19
2 2017-01-24 2017-02-07
2 2017-02-07 2017-02-20
Here is the code to generate the input dataframe:
import numpy as np
import pandas as pd
start_date = ['2017-01-01','2017-01-12','2017-01-25','2017-02-05','2017-02-16',
'2017-01-01','2017-01-24','2017-02-07']
end_date = ['2017-01-15','2017-01-24','2017-02-03','2017-02-14','2017-02-28',
'2017-01-19','2017-02-07','2017-02-20']
df = pd.DataFrame({'id': [1,1,1,1,1,2,2,2],
'start_date': pd.to_datetime(start_date, format='%Y-%m-%d'),
'end_date': pd.to_datetime(end_date, format='%Y-%m-%d')})
Desired output:
id start_date end_date
1 2017-01-01 2017-02-03
1 2017-02-05 2017-02-14
1 2017-02-16 2017-02-28
2 2017-01-01 2017-01-19
2 2017-01-24 2017-02-20
def f(grp):
#define a list to collect valid start and end ranges
d=[]
(
#append a new row if the start date is at least 2 days greater than the last date from previous row,
#otherwise update last rows's end date with current row's end date.
grp.reset_index(drop=True)
.apply(lambda x: d.append({x.start_date:x.end_date})
if x.name==0 or (x.start_date-pd.DateOffset(1))>grp.iloc[x.name-1].end_date
else d[-1].update({list(d[-1].keys())[0]:x.end_date}),
axis=1)
)
#reconstruct a df using only valid start and end dates pairs.
return pd.DataFrame([[list(e.keys())[0],list(e.values())[0]] for e in d],
columns=['start_date','end_date'])
df.groupby('id').apply(f).reset_index().drop('level_1',1)
Out[467]:
id start_date end_date
0 1 2017-01-01 2017-02-03
1 1 2017-02-05 2017-02-14
2 1 2017-02-16 2017-02-28
3 2 2017-01-01 2017-01-19
4 2 2017-01-24 2017-02-20

Boxplot Pandas data

DataFrame is as follows:
ID1 ID2
0 00:00:01.002 00:00:01.002
1 00:00:01.001 00:00:01.006
2 00:00:01.004 00:00:01.011
3 00:00:00.998 00:00:01.012
4 NaT 00:00:01.000
...
20 NaT 00:00:00.998
What I am trying to do is create a boxplot for each ID. There may or may not be multiple IDs depending on the dataset I provide. For right now I am trying to solve this for 2 datasets. If possible I would like a solution that has all the data on the same boxplot and then another with the data displayed on its own boxplot per ID.
I am very new to pandas (trying to learn it...) and am just getting frustrated at how long this is taking to figure out... Here is my code...
deltaTime = pd.DataFrame() #Create blank df
for x in range(0, len(totIDs)):
ID = IDList[x]
df = pd.DataFrame(data[ID]).T
deltaT[ID] = pd.to_datetime(df[TIME_COL]).diff()
deltaT.boxplot()
Pretty simple just cant seem to get it do what I want in plotting a boxplot for each ID. I should not that data is given to me by a homegrown file reader that takes several complex files and sorts them into the data dictionary which is indexed by IDs.
I am running pandas version 0.14.0 and python version 2.7.7
I am not sure how this works in 0.14.0 version, because last is 0.19.2 - I recommend upgrade if possible:
#sample data
np.random.seed(180)
dates = pd.date_range('2017-01-01 10:11:20', periods=10, freq='T')
cols = ['ID1','ID2']
df = pd.DataFrame(np.random.choice(dates, size=(10,2)), columns=cols)
print (df)
ID1 ID2
0 2017-01-01 10:12:20 2017-01-01 10:17:20
1 2017-01-01 10:16:20 2017-01-01 10:20:20
2 2017-01-01 10:18:20 2017-01-01 10:17:20
3 2017-01-01 10:12:20 2017-01-01 10:16:20
4 2017-01-01 10:14:20 2017-01-01 10:18:20
5 2017-01-01 10:18:20 2017-01-01 10:19:20
6 2017-01-01 10:17:20 2017-01-01 10:12:20
7 2017-01-01 10:13:20 2017-01-01 10:17:20
8 2017-01-01 10:16:20 2017-01-01 10:11:20
9 2017-01-01 10:13:20 2017-01-01 10:19:20
Call DataFrame.diff and then convert timedeltas to total_seconds:
df = df.diff().apply(lambda x: x.dt.total_seconds())
print(df)
ID1 ID2
0 NaN NaN
1 240.0 180.0
2 120.0 -180.0
3 -360.0 -60.0
4 120.0 120.0
5 240.0 60.0
6 -60.0 -420.0
7 -240.0 300.0
8 180.0 -360.0
9 -180.0 480.0
Last use DataFrame.plot.box
df.plot.box()
You can also check docs.

Categories