Timestamp String to Seconds Conversion in Python - python

I'm a beginner in Python Data Science. I'm working on clickstream data and want to find out the duration of a session. For that I find the start time and end time of the session. However on subtraction, I'm getting wrong answer for the same.
Here is the data
Sid Tstamp Itemid Category
0 1 2014-04-07T10:51:09.277Z 214536502 0
1 1 2014-04-07T10:54:09.868Z 214536500 0
2 1 2014-04-07T10:54:46.998Z 214536506 0
3 1 2014-04-07T10:57:00.306Z 214577561 0
4 2 2014-04-07T13:56:37.614Z 214662742 0
5 2 2014-04-07T13:57:19.373Z 214662742 0
6 2 2014-04-07T13:58:37.446Z 214825110 0
7 2 2014-04-07T13:59:50.710Z 214757390 0
8 2 2014-04-07T14:00:38.247Z 214757407 0
9 2 2014-04-07T14:02:36.889Z 214551617 0
10 3 2014-04-02T13:17:46.940Z 214716935 0
11 3 2014-04-02T13:26:02.515Z 214774687 0
12 3 2014-04-02T13:30:12.318Z 214832672 0
I referred this question for the code- Timestamp Conversion
Here is my code-
k.columns=['Sid','Tstamp','Itemid','Category']
k=k.loc[:,('Sid','Tstamp')]
#Find max timestamp
idx=k.groupby(['Sid'])['Tstamp'].transform(max) == k['Tstamp']
ah=k[idx].reset_index()
#Find min timestamp
idy=k.groupby(['Sid'])['Tstamp'].transform(min) == k['Tstamp']
ai=k[idy].reset_index()
#grouping by Sid and applying count to retain the distinct Sid values
kgrp=k.groupby('Sid').count()
i=0
for temp1,temp2 in zip(ah['Tstamp'],ai['Tstamp']):
sv1= datetime.datetime.strptime(temp1, "%Y-%m-%dT%H:%M:%S.%fZ")
sv2= datetime.datetime.strptime(temp2, "%Y-%m-%dT%H:%M:%S.%fZ")
d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0)
d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0)
kgrp.loc[i,'duration']= d1-d2
i=i+1
Here is the output.
kgrp
Out[5]:
Tstamp duration
Sid
1 4 359.275
2 6 745.378
3 3 1034.468
For session id 2, the duration should be close to 6 minutes however I'm getting almost 12 minutes. I reckon I'm making some silly mistake here.
Also, I'm grouping by Sid and applying count on it so as to get the Sid column and store each duration as a separate column. Is there any easier method through which I can store only the Sid (not the 'Tstamp' Count Column) and its duration values?

You are assigning the duration value to the wrong label.
In your test data sid starts from 1 but i starts from 0:
# for sid 1, i == 0
kgrp.loc[i,'duration']= d1-d2
i=i+1
Update
A more pythonic way to handle this :)
def calculate_duration(dt1, dt2):
# do the calculation here, return the duration in seconds
k = k.loc[:, ('Sid', 'Tstamp')]
result = k.groupby(['Sid'])['Tstamp'].agg({
'Duration': lambda x: calculate_duration(x.max(), x.min()),
'Count': lambda x: x.count()
})

Related

Time Series from different variables

I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')
Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10

Pandas create a conditional column based on cumulative logic operations of the other columns

I have 2 columns that represent the on switch and off switch indicator. I want to create a column called last switch where it keeps record the 'last' direction of the switch (whether it is on or off). Another condition is that if both on and off switch value is 1 for a particular row, then the 'last switch' output will return the opposite sign of the previous last switch. Currently I managed to find a solution to get this almost correct except facing the situation where both on and off shows 1 that makes my code wrong.
I also attached the screenshot with a desired output. Please help appreciate all.
df=pd.DataFrame([[1,0],[1,0],[0,1],[0,1],[0,0],[0,0],[1,0],[1,1],[0,1],[1,0],[1,1],[1,1],[0,1]], columns=['on','off'])
df['last_switch']=(df['on']-df['off']).replace(0,method='ffill')
Add the following lines to your existing code:
for i in range(df.shape[0]):
df['prev']=df['last_switch'].shift()
df.loc[(df['on']==1) & (df['off']==1), 'last_switch']=df['prev'] * (-1)
df.drop('prev', axis=1, inplace=True)
df['last_switch']=df['last_switch'].astype(int)
Output:
on off last_switch
0 1 0 1
1 1 0 1
2 0 1 -1
3 0 1 -1
4 0 0 -1
5 0 0 -1
6 1 0 1
7 1 1 -1
8 0 1 -1
9 1 0 1
10 1 1 -1
11 1 1 1
12 0 1 -1
Let me know if you need expanation of the code
df=pd.DataFrame([[1,0],[1,0],[0,1],[0,1],[0,0],[0,0],[1,0],[1,1],[0,1],[1,0],[1,1],[1,1],[0,1]], columns=['on','off'])
df['last_switch']=(df['on']-df['off']).replace(0,method='ffill')
prev_row = None
def apply_logic(row):
global prev_row
if prev_row is not None:
if (row["on"] == 1) and (row["off"] == 1):
row["last_switch"] = -prev_row["last_switch"]
prev_row = row.copy()
return row
df.apply(apply_logic,axis=1)
personally i am not a big fan of using loop against dataframe. shift wont work in this case as the "last_switch" column is dynamic and subject to change based on on&off status.
Using your intermediate reesult with apply while carrying the value from previous row should do the trick. Hope it makes sense.

Python: Return rows where the same IDs are within 30 days of the next

test = pd.DataFrame({'ID':[1,2,3,3,4,4],'ID2':[1,1,1,1,2,1]\
,'dts1':['2016-1-25','2016-1-25','2016-1-25','2016-2-20','2016-1-25','2016-2-20']
,'dts2':['2016-1-27','2016-1-27','2016-1-27','2016-2-24','2016-1-27','2016-2-24']})
I have a data frame like:
ID ID2 dts1 dts2
0 1 1 2016-1-25 2016-1-27
1 2 1 2016-1-25 2016-1-27
2 3 1 2016-1-25 2016-1-27
3 3 1 2016-2-20 2016-2-24
4 4 2 2016-1-25 2016-1-27
5 4 1 2016-2-20 2016-2-24
I want rows that are 1) have the same ID 2) have different ID2 3) have a dts2 within 30 days of the next dts1 of the next row with the same ID...
For this dataframe I would want the last two rows (where ID = next ID, ID2 != next ID2 and dts2 < next dts1 + 30 days
****EDIT***
ts_df[ts_df.groupby(['ID']).apply(lambda x: ((x['dts1'].shift(-1)-x['dts2']<=pd.Timedelta('30days'))\
&(x['ID2'].shift(-1)!=x['ID2']))|\
((x['dts1']-x['dts2'].shift(1)<=pd.Timedelta('30days'))\
&(x['ID2']!=x['ID2'].shift(1)))).values]
The only think I have found to work is the above ^
It is super slow (22 min on my dataset), so any improvement would be much appreciated.
test['dts1'] = pd.to_datetime(test['dts1'])
test['dts2'] = pd.to_datetime(test['dts2'])
def get_what_you_need(df):
mask1 = df[df.duplicated(subset='ID', keep=False)]
mask2 = mask1.drop_duplicates(subset=['ID', 'ID2'], keep=False).reset_index(drop=True)
idx = 0
if len(df) >= 2:
mask3 = (mask2.loc[idx + 1, 'dts1'] - mask2.loc[idx, 'dts2']) < pd.Timedelta(days = 30)
else:
return None
if mask3:
return mask2
else:
return None
get_what_you_need(test)
I put idx and days as constants here. If you want, you could set the idx and days as the parameters of the function.

Is there a way to speed up the following pandas for loop?

My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.
The code is:
data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
# some code for each of the small data frames
This is super inefficient, and the code has been running for 10+ hours now.
Is there a way to speed it up?
Full Code:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
c = [new_df.iloc[i:] for i in range(len(new_df.index))]
x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
d = pd.concat([d, x])
To get the data:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])
Note:
Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)
Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.
Code:
def make_3d(csv_filename):
def make_3d_lines(a_df):
a_df['depth'] = 0
depth = 0
prev = None
accum = []
for row in a_df.values.tolist():
row[0] = 0
key = row[1]
if key == prev:
depth += 1
accum.append(row)
else:
if depth == 0:
yield row
else:
depth = 0
to_emit = []
for i in range(len(accum)):
date = accum[i][2]
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j
to_emit[-1][2] = date
for r in to_emit[1:]:
yield r
accum = [row]
prev = key
df_data = pd.read_csv('big-data.csv')
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split())),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Test Code:
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)
df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))
Results:
1.7390995025634766
depth feature0 feature1 feature2
id date
207555809644681 20180104 1 0.03125 0.038623 0.008130
247833985674646 20180106 1 0.03125 0.004378 0.004065
252945024181083 20180107 1 0.03125 0.062836 0.065041
20180107 2 0.00000 0.001870 0.008130
20180109 1 0.00000 0.001870 0.008130
329567241731951 20180117 1 0.00000 0.041952 0.004065
20180117 2 0.03125 0.003101 0.004065
20180117 3 0.00000 0.030780 0.004065
20180118 1 0.03125 0.003101 0.004065
20180118 2 0.00000 0.030780 0.004065
I believe your approach for feature engineering could be done better, but I will stick to answering your question.
In Python, iterating over a Dictionary is way faster than iterating over a DataFrame
Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):
# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index() # the index will be (id, date)
# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))
# iterate over the Dictionary and process the values
for key, value in d.items():
pass # each value is a Dataframe
# concat the values and get the original (processed) Dataframe back
df2 = pd.concat(d.values(), ignore_index=True)
Modified #Stephen's code
def make_3d(dataset):
def make_3d_lines(a_df):
a_df['depth'] = 0 # sets all depth from (1 to n) to 0
depth = 1 # initiate from 1, so that the first loop is correct
prev = None
accum = [] # accumulates blocks of data belonging to given user
for row in a_df.values.tolist(): # for each row in our dataset
row[0] = 0 # NOT SURE
key = row[1] # this is the id of the row
if key == prev: # if this rows id matches previous row's id, append together
depth += 1
accum.append(row)
else: # else if this id is new, previous block is completed -> process it
if depth == 0: # previous id appeared only once -> get that row from accum
yield accum[0] # also remember that depth = 0
else: # process the block and emit each row
depth = 0
to_emit = [] # prepare to emit the list
for i in range(len(accum)): # for each unique day in the accumulated list
date = accum[i][2] # define date to be the first date it sees
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j # define the depth
to_emit[-1][2] = date # define the
for r in to_emit[0:]:
yield r
accum = [row]
prev = key
df_data = dataset.reset_index()
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Testing:
t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t
id date feature result
0 1 20180311 10 1
1 1 20180310 20 1
2 1 20180210 45 0
3 1 20170505 1 0
4 2 20180312 14 0
5 2 20180311 15 0
6 3 20180312 20 1
7 3 20180311 20 0
8 4 20170501 13 1
9 5 20180304 11 1
Output
depth feature result
id date
1 20180311 0 10 1
20180311 1 20 1
20180311 2 45 0
20180311 3 1 0
20180310 0 20 1
20180310 1 45 0
20180310 2 1 0
20180210 0 45 0
20180210 1 1 0
20170505 0 1 0
2 20180312 0 14 0
20180312 1 15 0
20180311 0 15 0
3 20180312 0 20 1
20180312 1 20 0
20180311 0 20 0
4 20170501 0 13 1

Pandas check for overlapping dates in multiple rows

I need to run a function on a large groupby query that checks whether two subGroups have any overlapping dates. Below is an example of a single group tmp:
ID num start stop subGroup
0 21 10 2006-10-10 2008-10-03 1
1 21 46 2006-10-10 2100-01-01 2
2 21 5 1997-11-25 1998-09-29 1
3 21 42 1998-09-29 2100-01-01 2
4 21 3 1997-01-07 1997-11-25 1
5 21 6 2006-10-10 2008-10-03 1
6 21 47 1998-09-29 2006-10-10 2
7 21 4 1997-01-07 1998-09-29 1
The function I wrote to do this looks like this:
def hasOverlap(tmp):
d2_starts = tmp[tmp['subGroup']==2]['start']
d2_stops = tmp[tmp['subGroup']==2]['stop']
return tmp[tmp['subGroup']==1].apply(lambda row_d1:
(
#Check for part nested D2 in D1
((d2_starts >= row_d1['start']) &
(d2_starts < row_d1['stop']) ) |
((d2_stops >= row_d1['start']) &
(d2_stops < row_d1['stop']) ) |
#Check for fully nested D1 in D2
((d2_stops >= row_d1['stop']) &
(d2_starts <= row_d1['start']) )
).any()
,axis = 1
).any()
The problem is that this code has many redundancies and when I run the query:
groups.agg(hasOverlap)
It takes an unreasonably long time to terminate.
Are there any performance fixes (such as using built-in functions or set_index) that I could do to speed this up?
Are you just looking to return "True" or "False" based on the presence of an overlap? If so, I'd just get a list of the dates for each subgroup, and then uses pandas isin method to check if they overlap.
You could try something like this:
#split subgroups into separate DF's
group1 = groups[groups.subgroup==1]
group2 = groups[groups.subgroup==2]
#check if any of the start dates from group 2 are in group 1
if len(group1[group1.start.isin(list(group2.start))]) >0:
print "Group1 overlaps group2"
#check if any of the start dates from group 1 are in group 2
if len(group2[group2.start.isin(list(group1.start))]) >0:
print "Group2 overlaps group1"

Categories