How to change format of timedelta in pandas - python

import pandas as pd
import datetime
import numpy as np
from datetime import timedelta
def diff_func(row):
return (row['Timestamp'] - row['previous_end'])
dfMockLog = [ (1, ("2017-01-01 09:00:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 09:01:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:02:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 09:05:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 09:30:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:33:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 09:37:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:41:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 10:00:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 11:00:00"), "htt://x.org/page2.html"),
(2, ("2017-01-01 09:41:00"), "htt://x.org/page3.html"),
(2, ("2017-01-01 09:42:00"), "htt://x.org/page1.html"),
(2, ("2017-01-01 09:43:00"), "htt://x.org/page2.html")]
dfMockLog = pd.DataFrame(dfMockLog, columns = ['user', 'Timestamp', 'url'])
dfMockLog['Timestamp'] = pd.to_datetime(dfMockLog['Timestamp'])
dfMockLog = dfMockLog.sort_values(['user','Timestamp'])
dfMockLog['previous_end'] = dfMockLog.groupby(['user'])['Timestamp'].shift(1)
dfMockLog['time_diff'] = dfMockLog.apply(diff_func, axis=1)
dfMockLog['cum_sum'] = dfMockLog['time_diff'].cumsum()
print(dfMockLog)
I need "timediff" column to be converted into seconds.And the "cum_sum" column should contain cumulative sum partitioned by "user". It will be great if one can share all possible format for timedelta.

You are close. The way I prefer is to make a new column containing time_diff in seconds via pd.Series.dt.seconds. Then use groupby.transform to extract the cumsum by user:
dfMockLog['time_diff_secs'] = dfMockLog['time_diff'].dt.seconds
dfMockLog['cum_sum'] = dfMockLog.groupby('user')['time_diff_secs'].transform('cumsum')
print(dfMockLog)
user Timestamp url previous_end \
0 1 2017-01-01 09:00:00 htt://x.org/page1.html NaT
1 1 2017-01-01 09:01:00 htt://x.org/page2.html 2017-01-01 09:00:00
2 1 2017-01-01 09:02:00 htt://x.org/page3.html 2017-01-01 09:01:00
3 1 2017-01-01 09:05:00 htt://x.org/page3.html 2017-01-01 09:02:00
4 1 2017-01-01 09:30:00 htt://x.org/page2.html 2017-01-01 09:05:00
5 1 2017-01-01 09:33:00 htt://x.org/page1.html 2017-01-01 09:30:00
6 1 2017-01-01 09:37:00 htt://x.org/page2.html 2017-01-01 09:33:00
7 1 2017-01-01 09:41:00 htt://x.org/page3.html 2017-01-01 09:37:00
8 1 2017-01-01 10:00:00 htt://x.org/page1.html 2017-01-01 09:41:00
9 1 2017-01-01 11:00:00 htt://x.org/page2.html 2017-01-01 10:00:00
10 2 2017-01-01 09:41:00 htt://x.org/page3.html NaT
11 2 2017-01-01 09:42:00 htt://x.org/page1.html 2017-01-01 09:41:00
12 2 2017-01-01 09:43:00 htt://x.org/page2.html 2017-01-01 09:42:00
time_diff time_diff_secs cum_sum
0 NaT NaN NaN
1 00:01:00 60.0 60.0
2 00:01:00 60.0 120.0
3 00:03:00 180.0 300.0
4 00:25:00 1500.0 1800.0
5 00:03:00 180.0 1980.0
6 00:04:00 240.0 2220.0
7 00:04:00 240.0 2460.0
8 00:19:00 1140.0 3600.0
9 01:00:00 3600.0 7200.0
10 NaT NaN NaN
11 00:01:00 60.0 60.0
12 00:01:00 60.0 120.0

Related

Combine dataframes result with a DatetimeIndex index

i have a pandas dataframe with random values at every minute.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,30,size=20), index=pd.date_range("20180101", periods=20, freq='T'))
df
0
2018-01-01 00:00:00 21
2018-01-01 00:01:00 21
2018-01-01 00:02:00 23
2018-01-01 00:03:00 18
2018-01-01 00:04:00 3
2018-01-01 00:05:00 11
2018-01-01 00:06:00 3
2018-01-01 00:07:00 4
2018-01-01 00:08:00 5
2018-01-01 00:09:00 25
2018-01-01 00:10:00 15
2018-01-01 00:11:00 11
2018-01-01 00:12:00 29
2018-01-01 00:13:00 22
2018-01-01 00:14:00 7
2018-01-01 00:15:00 13
2018-01-01 00:16:00 26
2018-01-01 00:17:00 7
2018-01-01 00:18:00 26
2018-01-01 00:19:00 15
Now, I must create a new column in the dataframe df that "reflects" the mean() of a window of 2 periods on an higher frequency(5 minutes).
df2 = df.resample('5T').sum().rolling(2).mean()
df2
0
2018-01-01 00:00:00 NaN
2018-01-01 00:05:00 67.0
2018-01-01 00:10:00 66.0
2018-01-01 00:15:00 85.5
Here comes the problem. I need to "map" somehow the values of the "higher frequency" frame to the lower.
I should get something like:
0 new_column
2018-01-01 00:00:00 21 NaN
2018-01-01 00:01:00 21 NaN
2018-01-01 00:02:00 23 NaN
2018-01-01 00:03:00 18 NaN
2018-01-01 00:04:00 3 NaN
2018-01-01 00:05:00 11 67.0
2018-01-01 00:06:00 3 67.0
2018-01-01 00:07:00 4 67.0
2018-01-01 00:08:00 5 67.0
2018-01-01 00:09:00 25 67.0
2018-01-01 00:10:00 15 66.0
2018-01-01 00:11:00 11 66.0
2018-01-01 00:12:00 29 66.0
2018-01-01 00:13:00 22 66.0
2018-01-01 00:14:00 7 66.0
2018-01-01 00:15:00 13 85.5
2018-01-01 00:16:00 26 85.5
2018-01-01 00:17:00 7 85.5
2018-01-01 00:18:00 26 85.5
2018-01-01 00:19:00 15 85.5
I am using pandas 0.23.4
You can just use:
df['new_column'] = df2[0].repeat(5).values
with 5 being your resampling factor
You can pd.concat both dataframes and fillforward
df3=pd.concat([df,df2],axis=1).ffill()

Getting data for given day from pandas Dataframe

I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]

How can I organize data hour-by-hour and set the missing values to zeros?

I played games several times a day and I got a score each time. I would like to reorganize the data hour-by-hour, and set the missing values to zero.
Here is the original data:
import pandas as pd
df = pd.DataFrame({
'Time': ['2017-01-01 08:45:00', '2017-01-01 09:11:00',
'2017-01-01 11:40:00', '2017-01-01 14:05:00',
'2017-01-01 21:00:00'],
'Score': range(1, 6)})
It looks like this:
Score Time
0 1 2017-01-01 08:45:00
1 2 2017-01-01 09:11:00
2 3 2017-01-01 11:40:00
3 4 2017-01-01 14:05:00
4 5 2017-01-01 15:00:00
How can I get a new dataframe like this:
day Hour Score
2017-01-01 00:00:00 0
...
2017-01-01 08:00:00 1
2017-01-01 09:00:00 2
2017-01-01 10:00:00 0
2017-01-01 11:00:00 3
2017-01-01 12:00:00 0
2017-01-01 13:00:00 0
2017-01-01 14:00:00 4
2017-01-01 15:00:00 5
2017-01-01 16:00:00 0
...
2017-01-01 23:00:00 0
Many thanks!
You can use resample with some aggregate function like sum, then fillna and convert to to int by astype but first add first and last DateTime values:
df.loc[-1, 'Time'] = '2017-01-01 00:00:00'
df.loc[-2, 'Time'] = '2017-01-01 23:00:00'
df['Time'] = pd.to_datetime(df['Time'])
df = df.resample('H', on='Time').sum().fillna(0).astype(int)
print (df)
Score
Time
2017-01-01 00:00:00 0
2017-01-01 01:00:00 0
2017-01-01 02:00:00 0
2017-01-01 03:00:00 0
2017-01-01 04:00:00 0
2017-01-01 05:00:00 0
2017-01-01 06:00:00 0
2017-01-01 07:00:00 0
2017-01-01 08:00:00 1
2017-01-01 09:00:00 2
2017-01-01 10:00:00 0
2017-01-01 11:00:00 3
2017-01-01 12:00:00 0
2017-01-01 13:00:00 0
2017-01-01 14:00:00 4
2017-01-01 15:00:00 0
2017-01-01 16:00:00 0
2017-01-01 17:00:00 0
2017-01-01 18:00:00 0
2017-01-01 19:00:00 0
2017-01-01 20:00:00 0
2017-01-01 21:00:00 5
2017-01-01 22:00:00 0
2017-01-01 23:00:00 0

calculate datetime differenence over multiple rows in dataframe

I have got a python related question about datetimes in a dataframe. I imported the following df via pd.read_csv()
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
I would like to know the time difference over the rows that are labeled with A, B, C as in the following:
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0:02
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B 0:09
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C 0:02
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
So the d_time should be the total time difference over labeled rows. There are approx. 100 different labels, and they can vary from 1 to x in a row. This calculation has to be done for +1 million rows, so a loop will probably not work. Does anybody know how to do this? Thanks in advance.
Assuming the consecutive labels are all the same, and seperated by 1 nan
you can do something like this
idx = pd.Series(df[pd.isnull(df['label'])].index)
idx_begin = idx.iloc[:-1] + 1
idx_end = idx.iloc[1:] - 1
d_time = df.loc[idx_end, 'datetime'].reset_index(drop=True) - df.loc[idx_begin, 'datetime'].reset_index(drop=True)
d_time.index = idx_begin
df.loc[idx_begin, 'd_time'] = d_time
If your dataset looks different, you might look into different ways to get to idx_begin and idx_end, but this works for the dataset you posted
Multiple consecutive nans
If there are multiple consecutive nan-values, you can solve this by adding this to the end
df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None
Consecutive different labels
idx = df[(df['label'] != df['label'].shift(1)) & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))))].index
idx_begin = idx[:-1]
idx_end = idx[1:] -1
This marks different labels as different starts and beginnings. To make this work, you will need the df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None added to the end
The & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))) part is because None != None
Result
datetime label d_time
0 2017-01-03 23:52:00 NaN NaN
1 2017-01-03 23:53:00 A NaN
2 2017-01-03 23:54:00 A NaN
3 2017-01-03 23:52:00 NaN NaN
4 2017-01-03 23:53:00 B NaN
5 2017-01-03 23:54:00 B NaN
6 2017-01-03 23:55:00 NaN NaN
7 2017-01-03 23:56:00 NaN NaN
8 2017-01-03 23:57:00 NaN NaN
9 2017-01-04 00:02:00 A NaN
10 2017-01-04 00:06:00 A NaN
11 2017-01-04 00:09:00 A NaN
12 2017-01-04 00:02:00 B NaN
13 2017-01-04 00:06:00 B NaN
14 2017-01-04 00:09:00 B NaN
15 2017-01-04 00:11:00 NaN NaN
yields
datetime label d_time
0 2017-01-03 23:52:00 NaN NaT
1 2017-01-03 23:53:00 A 00:01:00
2 2017-01-03 23:54:00 A NaT
3 2017-01-03 23:52:00 NaN NaT
4 2017-01-03 23:53:00 B 00:01:00
5 2017-01-03 23:54:00 B NaT
6 2017-01-03 23:55:00 NaN NaT
7 2017-01-03 23:56:00 NaN NaT
8 2017-01-03 23:57:00 NaN NaT
9 2017-01-04 00:02:00 A 00:07:00
10 2017-01-04 00:06:00 A NaT
11 2017-01-04 00:09:00 A NaT
12 2017-01-04 00:02:00 B 00:07:00
13 2017-01-04 00:06:00 B NaT
14 2017-01-04 00:09:00 B NaT
15 2017-01-04 00:11:00 NaN NaT
Last Series
If the last row doesn't have a changed label compared to the one before it, the last series will not register.
You can prevent this by including this after the first line
if idx[-1] != df.index[-1]:
idx = idx.append(df.index[[-1]]+1)
If the datetimes are datetime objects (or pandas.TimeStamp) you can use this for-loop
a_rows = []
for row in df.itertuples():
if row.label == 'A':
a_rows.append(row)
elif a_rows:
d_time = a_rows[-1].datetime - a_rows[0].datetime
df.loc[a_rows[0].Index, 'd_time'] = d_time
a_rows = []
with this result
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0 days 00:02:00
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 A 0 days 00:07:00
6 2017-01-04 00:06:00 A
7 2017-01-04 00:09:00 A
8 2017-01-04 00:11:00
You can later format the timedelta object if you want.
If the datetime column are strings you can easily convert em with df['datetime'] = pd.to_datetime(df['datetime'])

Append two Pandas series to a dataframe by columns with a Loop

I have a dataframe and two Pandas Series ac and cc, i want to append this two series as column with a loop. But the problem is that my dataframe has a time index and Series as integer
A='a'
cc = pd.Series(np.zeros(len(A)*20))
ac = pd.Series(np.random.randn(10))
index = pd.date_range(start=pd.datetime(2017, 1,1), end=pd.datetime(2017, 1, 2), freq='1h')
df = pd.DataFrame(index=index)
I already had an answer to my question but without a loop here
Now, i need to add a loop but i got an error in the keys :
az = [cc, ac]
for i in az:
df.join(
pd.concat(
[pd.Series(s.values, index[:len(s)]) for s in [i]],
axis=1, keys=[i]
)
)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), ,a.any() or a.all().
I tried with keys = [i.all ()], I have the correct answer except that instead of the columns names I have true and false.
The final result should be like this :
cc ac
2017-01-01 00:00:00 1 0.247043
2017-01-01 01:00:00 1 -0.324868
2017-01-01 02:00:00 1 -0.004868
2017-01-01 03:00:00 1 0.047043
2017-01-01 04:00:00 1 -0.447043
2017-01-01 05:00:00 NaN NaN
... ... ...
Create a list of tuples where the first element is the column name and the second is the series itself.
az = [('cc', cc), ('ac', ac)]
for c, s in az:
df[c] = pd.Series(s.values, index[:len(s)])
cc ac
2017-01-01 00:00:00 0.0 2.062265
2017-01-01 01:00:00 0.0 -0.225066
2017-01-01 02:00:00 0.0 -1.698330
2017-01-01 03:00:00 0.0 -1.068081
2017-01-01 04:00:00 0.0 0.142956
2017-01-01 05:00:00 0.0 -1.244232
2017-01-01 06:00:00 0.0 -1.072311
2017-01-01 07:00:00 0.0 0.242069
2017-01-01 08:00:00 0.0 0.120093
2017-01-01 09:00:00 0.0 -0.335500
2017-01-01 10:00:00 0.0 NaN
2017-01-01 11:00:00 0.0 NaN
2017-01-01 12:00:00 0.0 NaN
2017-01-01 13:00:00 0.0 NaN
2017-01-01 14:00:00 0.0 NaN
2017-01-01 15:00:00 0.0 NaN
2017-01-01 16:00:00 0.0 NaN
2017-01-01 17:00:00 0.0 NaN
2017-01-01 18:00:00 0.0 NaN
2017-01-01 19:00:00 0.0 NaN
2017-01-01 20:00:00 NaN NaN
2017-01-01 21:00:00 NaN NaN
2017-01-01 22:00:00 NaN NaN
2017-01-01 23:00:00 NaN NaN
2017-01-02 00:00:00 NaN NaN

Categories