Python datetime problem converting time format - python

I would like to convert the following time format which is located in a panda dataframe column
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
I would like to transform the previous time format into a standard time format of HH:MM as follow
01:00
02:00
03:00
...
15:00
16:00
...
22:00
23:00
00:00
How can I do it in python?
Thank you in advance

This will give you a df with a datetime64[ns] and object dtype column for your data:
import pandas as pd
df = pd.read_csv('hm.txt', sep=r"[ ]{2,}", engine='python', header=None, names=['pre'])
df['pre_1'] = df['pre'].astype(str).str.replace('00', '')
df['datetime_dtype'] = pd.to_datetime(df['pre_1'], format='%H', exact=False)
df['str_dtype'] = df['datetime_dtype'].astype(str).str[11:16]
print(df.head(5))
pre datetime_dtype str_dtype
0 1 1900-01-01 01:00:00 01:00
1 2 1900-01-01 02:00:00 02:00
2 3 1900-01-01 03:00:00 03:00
3 4 1900-01-01 04:00:00 04:00
4 5 1900-01-01 05:00:00 05:00
print(df.dtypes)
pre object
datetime_dtype datetime64[ns]
str_dtype object
dtype: object

Related

Sum of individual labels over a month of granular data

I have a dataframe which contains life logging data gathered over several years from 44 unique individuals.
Int64Index: 77171 entries, 0 to 4279
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 start 77171 non-null datetime64[ns]
1 end 77171 non-null datetime64[ns]
2 labelName 77171 non-null category
3 id 77171 non-null int64
The start column contains granular datetimes of the format 2020-11-01 11:00:00, in intervals of 30 minutes. The labelName column has 14 different categories.
Categories (14, object): ['COOK', 'EAT', 'GO WALK', 'GO TO BATHROOM', ..., 'DRINK', 'WAKE UP', 'SLEEP', 'WATCH TV']
Here's a sample user's head, which is [2588 rows x 4 columns] and spans from 2020 to 2021. There are also gaps in the data, occasionally.
start end labelName id
0 2020-08-05 00:00:00 2020-08-05 00:30:00 GO TO BATHROOM 486
1 2020-08-05 06:00:00 2020-08-05 06:30:00 WAKE UP 486
2 2020-08-05 09:00:00 2020-08-05 09:30:00 COOK 486
3 2020-08-05 11:00:00 2020-08-05 11:30:00 EAT 486
4 2020-08-05 12:00:00 2020-08-05 12:30:00 EAT 486
.. ... ... ... ...
859 2021-03-10 12:30:00 2021-03-10 13:00:00 GO TO BATHROOM 486
861 2021-03-10 13:30:00 2021-03-10 14:00:00 GO TO BATHROOM 486
862 2021-03-10 18:30:00 2021-03-10 19:00:00 COOK 486
864 2021-03-11 08:00:00 2021-03-11 08:30:00 EAT 486
865 2021-03-11 12:30:00 2021-03-11 13:00:00 COOK 486
I want a sum of each unique labelNames per user per month, but I'm not sure how to do this.
I would first split the data frame by id, which is easy. But how do you split these start datetimes when it records every 30 minutes over several years of data— and then create 14 new columns which record the sums?
The final data frame might look something like this (with fake values):
user
month
SLEEP
...
WATCH TV
486
jun20
324
...
23
486
jul20
234
...
12
The use-case for this data frame is training a few statistical and machine-learning models.
How do I achieve something like this?
Because there are 30 minutes data you can count them by crosstab per months by months periods by Series.dt.to_period and then multiple by 0.5 for output in hours:
If start is 2020-09-30 23:30:00 and end is 2020-10-01 00:00:00 then if need count this record for October use df['end'] in crosstab, if for September use df['start'] .
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df1 = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName']).mul(0.5)
.rename_axis(columns=None, index=['id','month'])
.rename(columns=str)
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df1)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0.0 0.0 1.0 0.5 1.0
1 650 Mar2021 0.5 1.0 0.5 0.5 0.0
For ouput in 30 minutes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df = (pd.crosstab([df['id'], df['end'].dt.to_period('m')], df['labelName'])
.rename_axis(columns=None, index=['id','month'])
.reset_index()
.assign(month=lambda x:x['month'].dt.strftime('%b%Y')))
print (df)
id month COOK EAT GO TO BATHROOM SLEEP WAKE UP
0 650 Sep2020 0 0 2 1 2
1 650 Mar2021 1 2 1 1 0
Use:
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)
Output:
Demonstration:
#Preparing Data
string = """start end labelName id
2020-09-21 14:30:00 2020-09-21 15:00:00 WAKE UP 650
2020-09-21 15:00:00 2020-09-21 15:30:00 GO TO BATHROOM 650
2020-09-21 15:30:00 2020-09-21 16:00:00 SLEEP 650
2020-09-29 17:00:00 2020-09-29 17:30:00 WAKE UP 650
2020-09-29 17:30:00 2020-09-29 18:00:00 GO TO BATHROOM 650
2021-03-11 13:00:00 2021-03-11 13:30:00 EAT 650
2021-03-11 14:30:00 2021-03-11 15:00:00 GO TO BATHROOM 650
2021-03-11 15:00:00 2021-03-11 15:30:00 COOK 650
2021-03-11 15:30:00 2021-03-11 16:00:00 EAT 650
2021-03-11 16:00:00 2021-03-11 16:30:00 SLEEP 650"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
df['start'] = pd.to_datetime(df['start'])
#Solution
from collections import Counter
df.groupby([df['start'].dt.to_period('M'), 'id'])['labelName'].apply(lambda x: Counter(x)).reset_index().pivot_table('labelName', ['id', 'start'], 'level_2', fill_value=0)

Subtracting between rows of different columns in an indexed dataframe in python

I have an indexed dataframe (indexed by type then date) and would like to carry out a subtraction between the end time of the top row and start time of the next row in hours :
type date start_time end_time code
A 01/01/2018 01/01/2018 9:00 01/01/2018 14:00 525
01/02/2018 01/02/2018 5:00 01/02/2018 17:00 524
01/04/2018 01/04/2018 8:00 01/04/2018 10:00 528
B 01/01/2018 01/01/2018 5:00 01/01/2018 14:00 525
01/04/2018 01/04/2018 2:00 01/04/2018 17:00 524
01/05/2018 01/05/2018 7:00 01/05/2018 10:00 528
I would like to get the resulting table with a new column['interval']:
type date interval
A 01/01/2018 -
01/02/2018 15
01/04/2018 39
B 01/01/2018 -
01/04/2018 60
01/05/2018 14
The interval column is in hours
You can convert start_time and end_time to datetime format, then use apply to subtract the end_time of the previous row in each group (using groupby). To convert to hours, divide by pd.Timedelta('1 hour'):
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['interval'] = (df.groupby(level=0,sort=False).apply(lambda x: x.start_time-x.end_time.shift(1)) / pd.Timedelta('1 hour')).values
>>> df
start_time end_time code interval
type date
A 01/01/2018 2018-01-01 09:00:00 2018-01-01 14:00:00 525 NaN
01/02/2018 2018-01-02 05:00:00 2018-01-02 17:00:00 524 15.0
01/04/2018 2018-01-04 08:00:00 2018-01-04 10:00:00 528 39.0
B 01/01/2018 2018-01-01 05:00:00 2018-01-01 14:00:00 525 NaN
01/04/2018 2018-01-04 02:00:00 2018-01-04 17:00:00 524 60.0
01/05/2018 2018-01-05 07:00:00 2018-01-05 10:00:00 528 14.0

Substracting rows in different files

I am selecting several csv file in a folder. Each file has a "Time" Column.
I would like to plot an additional column called time duration which substract the time of each row with the first row and this for each file
What should I add in my code?
strong textoutput = pd.DataFrame()
for name in list_files_log:
with folder.get_download_stream(name) as f:
try:
tmp = pd.read_csv(f)
tmp["sn"] = get_sn(name)
tmp["filename"]= os.path.basename(name)
output = output.append(tmp)
except:
pass
If your Time column would look like this:
Time
0 2015-02-04 02:10:00
1 2016-03-05 03:30:00
2 2017-04-06 04:40:00
3 2018-05-07 05:50:00
You could create Duration column using:
df['Duration'] = df['Time'] - df['Time'][0]
And you'd get:
Time Duration
0 2015-02-04 02:10:00 0 days 00:00:00
1 2016-03-05 03:30:00 395 days 01:20:00
2 2017-04-06 04:40:00 792 days 02:30:00
3 2018-05-07 05:50:00 1188 days 03:40:00

Pandas Series of hour values to Series of dates

I have a time series covering January of 1979 with 6 hours time deltas. Time format is in continuous hour range:
1
7
13
18
25
31
.
.
.
739
Is it possible to convert these ints to dates? For instance:
1979/01/01 - 1:00
1979/01/01 - 7:00
1979/01/01 - 13:00
1979/01/01 - 18:00
1979/01/02 - 1:00
Thank you so much!
Setup
df = pd.DataFrame({'hour': [1,7,13,18,25,31]})
Use pd.to_datetime with the unit flag, and set the origin flag to the beginning of your desired year.
pd.to_datetime(df.hour, unit='h', origin='1979-01-01')
0 1979-01-01 01:00:00
1 1979-01-01 07:00:00
2 1979-01-01 13:00:00
3 1979-01-01 18:00:00
4 1979-01-02 01:00:00
5 1979-01-02 07:00:00
Name: hour, dtype: datetime64[ns]
Here is another way:
import pandas as pd
s = pd.Series([1,7,13])
s = pd.to_datetime(s*1e9*60*60+ pd.Timestamp(1979,1,1).value)
print(s)
Returns:
0 1979-01-01 01:00:00
1 1979-01-01 07:00:00
2 1979-01-01 13:00:00
dtype: datetime64[ns]
Could also just do this:
from datetime import datetime, timedelta
s = pd.Series([1,7,13,18,25])
s = s.apply(lambda h: datetime(1979, 1, 1) + timedelta(hours=h))
print(s)
Returns:
0 1979-01-01 01:00:00
1 1979-01-01 07:00:00
2 1979-01-01 13:00:00
3 1979-01-01 18:00:00
4 1979-01-02 01:00:00
dtype: datetime64[ns]

Optimize code to find the median of values of past 30 day for each row in a DataFrame

I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)

Categories