Exclude time zone information in datatime object - python

datatime available as object
converting to datetime type
df[['x','y']] = df[['x','y']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S %Z', errors='coerce')
here I want to exclude time zone information 'UTC'
output data type

From your code sample, I see that you are wanting to process multiple variables at once. In this case, you can exclude the time zone information by using a lambda function to apply the pd.to_datetime function and extract the time zone naive timestamps with the values property. Here is an example:
import pandas as pd # v 1.1.3
df = pd.DataFrame(dict(x=['2020-12-30 12:00:00 UTC', '2020-12-30 13:00:00 UTC',
'2020-12-30 14:00:00 UTC', '2020-12-30 15:00:00 UTC',
'2020-12-30 16:00:00 UTC'],
y=['2020-12-31 01:00:00 UTC', '2020-12-31 04:00:00 UTC',
'2020-12-31 02:00:00 UTC', '2020-12-31 05:00:00 UTC',
'2020-12-31 03:00:00 UTC']))
df[['x','y']] = df[['x','y']].apply(lambda x: pd.to_datetime(x).values)
print(df[['x','y']])
# x y
# 0 2020-12-30 12:00:00 2020-12-31 01:00:00
# 1 2020-12-30 13:00:00 2020-12-31 04:00:00
# 2 2020-12-30 14:00:00 2020-12-31 02:00:00
# 3 2020-12-30 15:00:00 2020-12-31 05:00:00
# 4 2020-12-30 16:00:00 2020-12-31 03:00:00

Related

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

Change time format from 1/1/1900 10:00:00 PM to 10:00 PM

I'm currently building a program to track the Login in / out, the data is exported as a string like this "5:00AM", I use the following code to convert the data from string to datetime64[ns]
df = pd.DataFrame({ 'LoginTime' : ["10:00PM", "5:00AM", "11:00PM","7:00AM"],
'Logout Time' : ["6:00AM","2:00PM", "5:00AM", "5:00PM"]})
for c in df.columns:
if c == 'LoginTime':
df['LoginTime'] = pd.to_datetime(df['LoginTime'], format='%I:%M%p')
elif c == 'Logout Time':
df['Logout Time'] = pd.to_datetime(df['Logout Time'], format='%I:%M%p')
The output result is the following:
LoginTime Logout Time
0 1900-01-01 22:00:00 1900-01-01 06:00:00
1 1900-01-01 05:00:00 1900-01-01 14:00:00
2 1900-01-01 23:00:00 1900-01-01 05:00:00
3 1900-01-01 07:00:00 1900-01-01 17:00:00
LoginTime datetime64[ns]
Logout Time datetime64[ns]
The code works as expected and changed the string to time format, however, I noticed the format is 1/1/1900 10:00:00 PM, I would like to know if there's a way to get only the time like this 10:00:00 PM without affecte the data type as datetime64[ns] since I have to create validation for the Login in / out
Thanks in advance
Try using dt.strftime after converting it to datetime format:
df = pd.DataFrame({'LoginTime':['1/1/1900 10:00:00 PM', '1/2/2018 05:00:00 AM']})
LoginTime
0 1/1/1900 10:00:00 PM
1 1/2/2018 05:00:00 AM
df['LoginTime'] = pd.to_datetime(df['LoginTime']).dt.strftime('%I:%M %p')
LoginTime
0 10:00 PM
1 05:00 AM

Compute the mean of the interval around the exact hour in a dataframe

I have a dataframe with timestamps as index, the data has a frequency of 10 minutes.
I can't find a way to compute the mean in a interval from h - 30min to h + 30min, where h are all the exact hours (o'clock hours).
In[1]: date_index = pd.date_range('2015-12-01 00:00:00', freq='10Min', periods=70)
df = pd.DataFrame(np.random.rand(70), index= date_index, columns=['Data'])
df.head(10)
Out[1]: Data
2015-12-01 00:00:00 0.653885
2015-12-01 00:10:00 0.605046
2015-12-01 00:20:00 0.438547
2015-12-01 00:30:00 0.062426
2015-12-01 00:40:00 0.415468
2015-12-01 00:50:00 0.458047
2015-12-01 01:00:00 0.523140
2015-12-01 01:10:00 0.736519
2015-12-01 01:20:00 0.934904
2015-12-01 01:30:00 0.799523
I was thinking in using a for loop with df.index as range, and look for every exact hour, and then compute the mean for a the interval around the specific hour, but I can't find an easy way of indexing the data around the hour. Is there an easy way of doing this in Pandas? Thanks.
Not sure about the exact expected output here but you can first resample the data every half and hour and find rolling mean to get the mean of 1.5 Hrs period.
df.resample('30T').mean().rolling(3, center = True).mean()
Data
2015-12-01 00:00:00 NaN
2015-12-01 00:30:00 0.419649
2015-12-01 01:00:00 0.427544
2015-12-01 01:30:00 0.414868
2015-12-01 02:00:00 0.545400
2015-12-01 02:30:00 0.643669
2015-12-01 03:00:00 0.626265
2015-12-01 03:30:00 0.581142
2015-12-01 04:00:00 0.508442
2015-12-01 04:30:00 0.511635
2015-12-01 05:00:00 0.452952
2015-12-01 05:30:00 0.473471
2015-12-01 06:00:00 0.400974
2015-12-01 06:30:00 0.358676
2015-12-01 07:00:00 0.244290
2015-12-01 07:30:00 0.343688
2015-12-01 08:00:00 0.456954
2015-12-01 08:30:00 0.548263
2015-12-01 09:00:00 0.431159
2015-12-01 09:30:00 0.378981
2015-12-01 10:00:00 0.407988
2015-12-01 10:30:00 0.496860
2015-12-01 11:00:00 0.508232
2015-12-01 11:30:00 NaN

subselect pandas dataframe based on index

I have a dataframe and I want to remove certain specific repeating rows:
import numpy as np
import pandas as pd
nrows = 144
df = pd.DataFrame(np.random.rand(nrows,), pd.date_range('2016-02-08 00:00:00', periods=nrows, freq='2h'), columns=['A'])
The dataframe is continuous with time, providing data every two hours ad infinitum, but I've chosen to only show a subset for brevity.I want to remove the data every 72 hours at 8:00 starting on Mondays to coincide with an external event that alters the data.For this snapshot of data I want to remove the rows indexed at 2016-02-08 08:00, 2016-02-11 08:00, +3D etc..
Is there a simple way to do this?
IIUC you could do this:
In [18]:
start = df.index[(df.index.dayofweek == 0) & (df.index.hour == 8)][0]
start
Out[18]:
Timestamp('2016-02-08 08:00:00')
In [45]:
df.loc[df.index.difference(pd.date_range(start, end=df.index[-1], freq='3D'))]
Out[45]:
A
2016-02-08 00:00:00 0.323742
2016-02-08 02:00:00 0.962252
2016-02-08 04:00:00 0.706537
2016-02-08 06:00:00 0.561446
2016-02-08 10:00:00 0.225042
2016-02-08 12:00:00 0.746258
2016-02-08 14:00:00 0.167950
2016-02-08 16:00:00 0.199958
2016-02-08 18:00:00 0.808286
2016-02-08 20:00:00 0.288797
2016-02-08 22:00:00 0.508109
2016-02-09 00:00:00 0.980772
2016-02-09 02:00:00 0.995731
2016-02-09 04:00:00 0.742751
2016-02-09 06:00:00 0.392247
2016-02-09 08:00:00 0.460511
2016-02-09 10:00:00 0.083660
2016-02-09 12:00:00 0.273620
2016-02-09 14:00:00 0.791506
2016-02-09 16:00:00 0.440630
2016-02-09 18:00:00 0.326418
2016-02-09 20:00:00 0.790780
2016-02-09 22:00:00 0.521131
2016-02-10 00:00:00 0.219315
2016-02-10 02:00:00 0.016625
2016-02-10 04:00:00 0.958566
2016-02-10 06:00:00 0.405643
2016-02-10 08:00:00 0.958025
2016-02-10 10:00:00 0.786663
2016-02-10 12:00:00 0.589064
... ...
2016-02-17 12:00:00 0.360848
2016-02-17 14:00:00 0.757499
2016-02-17 16:00:00 0.391574
2016-02-17 18:00:00 0.062812
2016-02-17 20:00:00 0.308282
2016-02-17 22:00:00 0.251520
2016-02-18 00:00:00 0.832871
2016-02-18 02:00:00 0.387108
2016-02-18 04:00:00 0.070969
2016-02-18 06:00:00 0.298831
2016-02-18 08:00:00 0.878526
2016-02-18 10:00:00 0.979233
2016-02-18 12:00:00 0.386620
2016-02-18 14:00:00 0.420962
2016-02-18 16:00:00 0.238879
2016-02-18 18:00:00 0.124069
2016-02-18 20:00:00 0.985828
2016-02-18 22:00:00 0.585278
2016-02-19 00:00:00 0.409226
2016-02-19 02:00:00 0.093945
2016-02-19 04:00:00 0.389450
2016-02-19 06:00:00 0.378091
2016-02-19 08:00:00 0.874232
2016-02-19 10:00:00 0.527629
2016-02-19 12:00:00 0.490236
2016-02-19 14:00:00 0.509008
2016-02-19 16:00:00 0.097061
2016-02-19 18:00:00 0.111626
2016-02-19 20:00:00 0.877099
2016-02-19 22:00:00 0.796201
[140 rows x 1 columns]
So this determines the start range by comparing the dayofweek and hour and taking the first index value, we then generate an index using date_range and call difference on the index to remove these rows and pass these to loc

Issue with dates in statsmodels and pandas

I have a pandas dataframe object that is indexed by a timestamp. I am trying to fit an AR model from statsmodels with the following code
df = pd.read_csv('xxx')
model=tsa.ar_model.AR(df['price'])
However I am getting the error
ValueError: dates must be of type datetime
But as far as I know the dates are of the correct format. It might also help if I show the result of printing df['price'] which is
timestamp
1976-01-01 12:00:00 96541
1976-02-01 12:00:00 90103
1976-03-01 12:00:00 96541
1976-04-01 12:00:00 108112
1976-05-01 12:00:00 115855
1976-06-01 12:00:00 119712
1976-07-01 12:00:00 115855
1976-08-01 12:00:00 114550
1976-09-01 12:00:00 118407
1976-10-01 12:00:00 128702
1976-11-01 12:00:00 115855
1976-12-01 12:00:00 102979
1977-01-01 12:00:00 111969
1977-02-01 12:00:00 106836
1977-03-01 12:00:00 115594
...
2012-05-01 12:00:00 257375
2012-06-01 12:00:00 250850
2012-07-01 12:00:00 246500
2012-08-01 12:00:00 242150
2012-09-01 12:00:00 237452
2012-10-01 12:00:00 230724
2012-11-01 12:00:00 218950
2012-12-01 12:00:00 210250
2013-01-01 12:00:00 210250
2013-02-01 12:00:00 203000
2013-03-01 12:00:00 218950
2013-04-01 12:00:00 232000
2013-05-01 12:00:00 232000
2013-06-01 12:00:00 226548
2013-07-01 12:00:00 226548
As is stated by df.index.dtype, despite you regard your timestamp as a datetime, it is an object. You can easely convert it to datetime with
df.index = pd.to_datetime(df.index)

Categories