I want to register my dt0 column which is in the format of "1/1/2020 12:00:00 PM" and is incremented for every 0.1 sec with "to_datetime" in python.
I tried the following but it gives an error "Out of bounds nanosecond timestamp: 1-01-01 00:00:00".
df.dt0=pd.to_datetime(df.dt0)
Is it because my intervals are too small? Can someone recommend a better/working solution.
Here are the columns that I want to convert from my table
No.
Date
Time
1
1/15/2020
12:00:00 PM
2
1/15/2020
12:00:00 PM
3
1/15/2020
12:00:00 PM
4
1/15/2020
12:00:00 PM
5
1/15/2020
12:00:00 PM
6
1/15/2020
12:00:00 PM
7
1/15/2020
12:00:00 PM
8
1/15/2020
12:00:00 PM
9
1/15/2020
12:00:00 PM
10
1/15/2020
12:00:00 PM
11
1/15/2020
12:00:00 PM
12
1/15/2020
12:00:00 PM
First convert joined both columns to datetimes:
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'],
format='%m/%d/%Y %I:%M:%S %p')
print (df)
No. Date Time datetime
0 1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
1 2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
2 3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
3 4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
4 5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
5 6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
6 7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
7 8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
8 9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
9 10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
10 11 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
11 12 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
And then add counter of duplicated datetime with GroupBy.cumcount multiple by 100 and converted to timedeltas by to_timedelta:
df['datetime'] += pd.to_timedelta(df.groupby('datetime').cumcount().mul(100), unit='ms')
print (df)
No. Date Time datetime
0 1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.000
1 2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.100
2 3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.200
3 4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.300
4 5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.400
5 6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.500
6 7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.600
7 8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.700
8 9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.800
9 10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00.900
10 11 1/15/2020 12:00:00 PM 2020-01-15 12:00:01.000
11 12 1/15/2020 12:00:00 PM 2020-01-15 12:00:01.100
In order to get your target use the following line
df['dt0'] = df.apply(lambda x : pd.to_datetime(x.Date+' '+x.Time),axis=1)
Verification:
df
Date Time dt0
No.
1 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
2 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
3 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
4 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
5 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
6 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
7 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
8 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
9 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
10 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
11 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
12 1/15/2020 12:00:00 PM 2020-01-15 12:00:00
df.info()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 1 to 12
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 12 non-null object
1 Time 12 non-null object
2 dt0 12 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 384.0+ bytes
as seen dt0 is datetime64[ns]
Related
I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
This question already has an answer here:
Round DateTime using Pandas
(1 answer)
Closed 3 years ago.
I have a dataframe with a column which have time data in the format HH:MM:SS. Sample data is shown below for reference:
Time
09:25:03
09:28:40
09:36:12
09:36:14
09:41:10
09:51:00
09:58:48
10:00:11
10:00:17
10:21:44
10:21:53
10:32:58
11:08:59
11:45:55
11:49:14
12:18:54
12:21:22
13:05:47
13:19:37
13:19:57
13:25:22
14:21:10
I want to get the nearest time previous to current time which is divisible by 5. I want the output like below:
Time Nearest_Time
09:25:03 09:25:00
09:28:40 09:25:00
09:36:12 09:35:00
09:36:14 09:35:00
09:41:10 09:40:00
09:51:00 09:50:00
09:58:48 09:50:00
10:00:11 10:00:00
10:00:17 10:00:00
10:21:44 10:20:00
10:21:53 10:20:00
10:32:58 10:30:00
11:08:59 11:05:00
11:45:55 11:45:00
11:49:14 11:45:00
12:18:54 12:15:00
12:21:22 12:20:00
13:05:47 13:05:00
13:19:37 13:15:00
13:19:57 13:15:00
13:25:22 13:25:00
14:21:10 14:20:00
You can use dt.floor setting freq to 5 minutes:
pd.to_datetime(df.Time).dt.floor('5 min')
0 2020-02-14 09:25:00
1 2020-02-14 09:25:00
2 2020-02-14 09:35:00
3 2020-02-14 09:35:00
4 2020-02-14 09:40:00
5 2020-02-14 09:50:00
6 2020-02-14 09:55:00
7 2020-02-14 10:00:00
8 2020-02-14 10:00:00
9 2020-02-14 10:20:00
10 2020-02-14 10:20:00
11 2020-02-14 10:30:00
12 2020-02-14 11:05:00
13 2020-02-14 11:45:00
14 2020-02-14 11:45:00
15 2020-02-14 12:15:00
16 2020-02-14 12:20:00
17 2020-02-14 13:05:00
18 2020-02-14 13:15:00
19 2020-02-14 13:15:00
20 2020-02-14 13:25:00
21 2020-02-14 14:20:00
Name: Time, dtype: datetime64[ns]
You could change Time to timedelta and do normal arithmetic operations:
df['Time'] = pd.to_timedelta(df['Time'])
period = pd.to_timedelta('5M')
df['nearest_past'] = df['Time'] // period * period
# floor also works
# df['nearest_past'] = df['Time'].dt.floor(period)
Output:
Time nearest_past
0 09:25:03 09:25:00
1 09:28:40 09:25:00
2 09:36:12 09:35:00
3 09:36:14 09:35:00
4 09:41:10 09:40:00
5 09:51:00 09:50:00
6 09:58:48 09:55:00
7 10:00:11 10:00:00
8 10:00:17 10:00:00
9 10:21:44 10:20:00
10 10:21:53 10:20:00
11 10:32:58 10:30:00
12 11:08:59 11:05:00
13 11:45:55 11:45:00
14 11:49:14 11:45:00
15 12:18:54 12:15:00
16 12:21:22 12:20:00
17 13:05:47 13:05:00
18 13:19:37 13:15:00
19 13:19:57 13:15:00
20 13:25:22 13:25:00
21 14:21:10 14:20:00
I'm building a basic rota/schedule for staff, and have a DataFrame from a MySQL cursor which gives a list of IDs, dates and class
id the_date class
0 195593 2017-09-12 14:00:00 3
1 193972 2017-09-13 09:15:00 2
2 195594 2017-09-13 14:00:00 3
3 195595 2017-09-15 14:00:00 3
4 193947 2017-09-16 17:30:00 3
5 195627 2017-09-17 08:00:00 2
6 193948 2017-09-19 11:30:00 2
7 195628 2017-09-21 08:00:00 2
8 193949 2017-09-21 11:30:00 2
9 195629 2017-09-24 08:00:00 2
10 193950 2017-09-24 10:00:00 2
11 193951 2017-09-27 11:30:00 2
12 195644 2017-09-28 06:00:00 1
13 194400 2017-09-28 08:00:00 1
14 195630 2017-09-28 08:00:00 2
15 193952 2017-09-29 11:30:00 2
16 195631 2017-10-01 08:00:00 2
17 194401 2017-10-06 08:00:00 1
18 195645 2017-10-06 10:00:00 1
19 195632 2017-10-07 13:30:00 3
If the class == 1, I need that instance duplicated 5 times.
first_class = df[df['class'] == 1]
non_first_class = df[df['class'] != 1]
first_class_replicated = pd.concat([tests_df]*5,ignore_index=True).sort_values(['the_date'])
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-28 06:00:00 1
4 195644 2017-09-28 06:00:00 1
12 195644 2017-09-28 06:00:00 1
8 195644 2017-09-28 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-28 08:00:00 1
9 194400 2017-09-28 08:00:00 1
5 194400 2017-09-28 08:00:00 1
1 194400 2017-09-28 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-06 08:00:00 1
10 194401 2017-10-06 08:00:00 1
14 194401 2017-10-06 08:00:00 1
2 194401 2017-10-06 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-06 10:00:00 1
15 195645 2017-10-06 10:00:00 1
7 195645 2017-10-06 10:00:00 1
19 195645 2017-10-06 10:00:00 1
I then merge non_first_class and first_class_replicated. Before that though, I need the dates in first_class_replicated to increment by one day, grouped by id. Below is how I need it to look. Is there an elegant Pandas solution to this, or should I be looking at looping over a groupby series to modify the dates?
Desired:
id
0 195644 2017-09-28 6:00:00
16 195644 2017-09-29 6:00:00
4 195644 2017-09-30 6:00:00
12 195644 2017-10-01 6:00:00
8 195644 2017-10-02 6:00:00
17 194400 2017-09-28 8:00:00
13 194400 2017-09-29 8:00:00
9 194400 2017-09-30 8:00:00
5 194400 2017-10-01 8:00:00
1 194400 2017-10-02 8:00:00
6 194401 2017-10-06 8:00:00
18 194401 2017-10-07 8:00:00
10 194401 2017-10-08 8:00:00
14 194401 2017-10-09 8:00:00
2 194401 2017-10-10 8:00:00
11 195645 2017-10-06 10:00:00
3 195645 2017-10-07 10:00:00
15 195645 2017-10-08 10:00:00
7 195645 2017-10-09 10:00:00
19 195645 2017-10-10 10:00:00
You can use cumcount for count categories, then convert to_timedelta and add to column:
#another solution for repeat
first_class_replicated = first_class.loc[np.repeat(first_class.index, 5)]
.sort_values(['the_date'])
df1 = first_class_replicated.groupby('id').cumcount()
first_class_replicated['the_date'] += pd.to_timedelta(df1, unit='D')
print (first_class_replicated)
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-29 06:00:00 1
4 195644 2017-09-30 06:00:00 1
12 195644 2017-10-01 06:00:00 1
8 195644 2017-10-02 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-29 08:00:00 1
9 194400 2017-09-30 08:00:00 1
5 194400 2017-10-01 08:00:00 1
1 194400 2017-10-02 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-07 08:00:00 1
10 194401 2017-10-08 08:00:00 1
14 194401 2017-10-09 08:00:00 1
2 194401 2017-10-10 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-07 10:00:00 1
15 195645 2017-10-08 10:00:00 1
7 195645 2017-10-09 10:00:00 1
19 195645 2017-10-10 10:00:00 1
Data:
ohlc_dict = {
'Open':'first',
'High':'max',
'Low':'min',
'Last': 'last',
'Volume': 'sum'}
data['hod'] = [r.hour for r in data.index]
data.head(10)
Out[61]:
Open High Low Last Volume hod dow
Timestamp
2014-05-08 08:00:00 136.230 136.290 136.190 136.290 7077 8 Thursday
2014-05-08 08:15:00 136.290 136.300 136.240 136.250 3881 8 Thursday
2014-05-08 08:30:00 136.240 136.270 136.230 136.230 2540 8 Thursday
2014-05-08 08:45:00 136.230 136.260 136.230 136.250 2293 8 Thursday
2014-05-08 09:00:00 136.250 136.360 136.240 136.360 15014 9 Thursday
2014-05-08 09:15:00 136.350 136.360 136.260 136.270 11697 9 Thursday
2014-05-08 09:30:00 136.270 136.270 136.190 136.200 15600 9 Thursday
2014-05-08 09:45:00 136.200 136.270 136.200 136.240 9025 9 Thursday
2014-05-08 10:00:00 136.240 136.270 136.240 136.260 7128 10 Thursday
2014-05-08 10:15:00 136.250 136.260 136.200 136.200 6100 10 Thursday
Question:
Both of the following are changing the timeframe from 15 mins to a 1 hour interval:
Approach 1:
data['2016'].groupby('hod').Volume.mean().head()
hod
8 8452.597
9 16485.398
10 15619.626
11 14132.666
12 11470.058
Name: Volume, dtype: float64
Approach 2:
df_h1 = data.resample('1h').agg(ohlc_dict).dropna()
df_h1['hod'] = [r.hour for r in df_h1.index]
df_h1['2016'].groupby('hod')['Volume'].mean()
Timestamp
2014-05-08 08:00:00 15791.000
2014-05-08 09:00:00 51336.000
2014-05-08 10:00:00 28855.000
2014-05-08 11:00:00 56543.000
2014-05-08 12:00:00 25249.000
Name: Volume, dtype: float64
Only approach 2 gives what appears to be accurate output of the volume data.
How do I change approach 1 to give me the same Volume output as approach 2 but using groupby instead of resample? I'm not sure how to use the ohlc_dict with approach 1 and feel this is what is required.
I'm looking for a way to create a datetimeindex in pandas. My data looks as follows:
Date Time AAA
0 06/17/2016 03:00:00 PM 19.13
1 06/17/2016 02:00:00 PM 19.13
2 06/17/2016 01:00:00 PM 19.26
3 06/17/2016 12:00:00 AM 19.28
4 06/17/2016 11:00:00 AM 19.28
The result I want to obtain is:
AAA
Date
2016-06-17 15:00:00 19.16
2016-06-17 14:00:00 19.14
2016-06-17 13:00:00 19.18
2016-06-17 12:00:00 19.27
2016-06-17 11:00:00 19.27
I am note sure how to efficiently do this since my Time column uses 12-hour clock format.
you can do it using to_datetime as:
df
Out[38]:
Date Time AAA
0 06/17/2016 03:00:00 PM 19.13
1 06/17/2016 02:00:00 PM 19.13
2 06/17/2016 01:00:00 PM 19.26
3 06/17/2016 12:00:00 AM 19.28
4 06/17/2016 11:00:00 AM 19.28
In [39]: df['Date']=pd.to_datetime(df['Date']+ ' '+df['Time'])
In [40]: df
Out[40]:
Date Time AAA
0 2016-06-17 15:00:00 03:00:00 PM 19.13
1 2016-06-17 14:00:00 02:00:00 PM 19.13
2 2016-06-17 13:00:00 01:00:00 PM 19.26
3 2016-06-17 00:00:00 12:00:00 AM 19.28
4 2016-06-17 11:00:00 11:00:00 AM 19.28
In [40]: df=df.drop(['Time','Date'],axis=1).set_index(df['Date'])
In [41]: df
Out[41]:
AAA
Date
2016-06-17 15:00:00 19.13
2016-06-17 14:00:00 19.13
2016-06-17 13:00:00 19.26
2016-06-17 00:00:00 19.28
2016-06-17 11:00:00 19.28
Using date objects as opposed to parsing strings
df = pd.DataFrame([
['06/17/2016', '03:00:00 PM', 19.13],
['06/17/2016', '02:00:00 PM', 19.13],
['06/17/2016', '01:00:00 PM', 19.26],
['06/17/2016', '12:00:00 AM', 19.28],
['06/17/2016', '11:00:00 AM', 19.28],
],
columns=['Date', 'Time', 'AAA'],
)
df.Date = pd.to_datetime(df.Date)
df.Time = pd.to_datetime(df.Time) - pd.DatetimeIndex(df.Time).date
df.set_index(df.Date + df.Time)[['AAA']]
AAA
2016-06-17 15:00:00 19.13
2016-06-17 14:00:00 19.13
2016-06-17 13:00:00 19.26
2016-06-17 00:00:00 19.28
2016-06-17 11:00:00 19.28