Issue with dates in statsmodels and pandas - python

I have a pandas dataframe object that is indexed by a timestamp. I am trying to fit an AR model from statsmodels with the following code
df = pd.read_csv('xxx')
model=tsa.ar_model.AR(df['price'])
However I am getting the error
ValueError: dates must be of type datetime
But as far as I know the dates are of the correct format. It might also help if I show the result of printing df['price'] which is
timestamp
1976-01-01 12:00:00 96541
1976-02-01 12:00:00 90103
1976-03-01 12:00:00 96541
1976-04-01 12:00:00 108112
1976-05-01 12:00:00 115855
1976-06-01 12:00:00 119712
1976-07-01 12:00:00 115855
1976-08-01 12:00:00 114550
1976-09-01 12:00:00 118407
1976-10-01 12:00:00 128702
1976-11-01 12:00:00 115855
1976-12-01 12:00:00 102979
1977-01-01 12:00:00 111969
1977-02-01 12:00:00 106836
1977-03-01 12:00:00 115594
...
2012-05-01 12:00:00 257375
2012-06-01 12:00:00 250850
2012-07-01 12:00:00 246500
2012-08-01 12:00:00 242150
2012-09-01 12:00:00 237452
2012-10-01 12:00:00 230724
2012-11-01 12:00:00 218950
2012-12-01 12:00:00 210250
2013-01-01 12:00:00 210250
2013-02-01 12:00:00 203000
2013-03-01 12:00:00 218950
2013-04-01 12:00:00 232000
2013-05-01 12:00:00 232000
2013-06-01 12:00:00 226548
2013-07-01 12:00:00 226548

As is stated by df.index.dtype, despite you regard your timestamp as a datetime, it is an object. You can easely convert it to datetime with
df.index = pd.to_datetime(df.index)

Related

How to calculate daily average from ERA5 hourly netCDF data?

Hi dear all,
I do apologize for repeating the question. I have downloaded and merged the ERA5 hourly dew-point temperature data (d2m_wb.nc) from the Copernicus web platform. Now, I want to calculate the daily mean from the hourly d2m_wb.nc data. The timestamps are 00, 01, 02...23. The ECMWF provided an example for the calculation of daily total precipitation (https://confluence.ecmwf.int/display/CKB/ERA5%3A+How+to+calculate+daily+total+precipitation). It said to cover total precipitation for 1st January 2017, we need two days of data.
(a) 1st January 2017 time = 01 - 23 will give you total precipitation data to cover 00 - 23 UTC for 1st January 2017
(b) 2nd January 2017 time = 00 will give you total precipitation data to cover 23 - 24 UTC for 1st January 2017
That means I need to shift the -1hour timestamp to account for step (b). Accordingly, I did it using Climate Data Operators (CDO).
cdo daymean -shifttime,-1hour in.nc out.nc
and got the following result.
cdo sinfo d2m_wb.nc
File format : NetCDF2
-1 : Institut Source T Steptype Levels Num Points Num Dtype : Parameter ID
1 : unknown unknown v instant 1 1 475 1 F64 : -1
Grid coordinates :
1 : lonlat : points=475 (19x25)
lon : 85.5 to 90 by 0.25 degrees_east
lat : 21.5 to 27.5 by 0.25 degrees_north
Vertical coordinates :
1 : surface : levels=1
Time coordinate : 25904 steps
RefTime = 1900-01-01 00:00:00 Units = hours Calendar = gregorian Bounds = true
YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss
1949-12-31 23:00:00 1950-01-01 11:00:00 1950-01-02 11:00:00 1950-01-03 11:00:00
1950-01-04 11:00:00 1950-01-05 11:00:00 1950-01-06 11:00:00 1950-01-07 11:00:00
1950-01-08 11:00:00 1950-01-09 11:00:00 1950-01-10 11:00:00 1950-01-11 11:00:00
1950-01-12 11:00:00 1950-01-13 11:00:00 1950-01-14 11:00:00 1950-01-15 11:00:00
1950-01-16 11:00:00 1950-01-17 11:00:00 1950-01-18 11:00:00 1950-01-19 11:00:00
1950-01-20 11:00:00 1950-01-21 11:00:00 1950-01-22 11:00:00 1950-01-23 11:00:00
1950-01-24 11:00:00 1950-01-25 11:00:00 1950-01-26 11:00:00 1950-01-27 11:00:00
1950-01-28 11:00:00 1950-01-29 11:00:00 1950-01-30 11:00:00 1950-01-31 11:00:00
1950-02-01 11:00:00 1950-02-02 11:00:00 1950-02-03 11:00:00 1950-02-04 11:00:00
1950-02-05 11:00:00 1950-02-06 11:00:00 1950-02-07 11:00:00 1950-02-08 11:00:00
1950-02-09 11:00:00 1950-02-10 11:00:00 1950-02-11 11:00:00 1950-02-12 11:00:00
1950-02-13 11:00:00 1950-02-14 11:00:00 1950-02-15 11:00:00 1950-02-16 11:00:00
1950-02-17 11:00:00 1950-02-18 11:00:00 1950-02-19 11:00:00 1950-02-20 11:00:00
1950-02-21 11:00:00 1950-02-22 11:00:00 1950-02-23 11:00:00 1950-02-24 11:00:00
1950-02-25 11:00:00 1950-02-26 11:00:00 1950-02-27 11:00:00 1950-02-28 11:00:00
................................................................................
................................................................................
................................................................................
.................
2020-10-03 11:00:00 2020-10-04 11:00:00 2020-10-05 11:00:00 2020-10-06 11:00:00
2020-10-07 11:00:00 2020-10-08 11:00:00 2020-10-09 11:00:00 2020-10-10 11:00:00
2020-10-11 11:00:00 2020-10-12 11:00:00 2020-10-13 11:00:00 2020-10-14 11:00:00
2020-10-15 11:00:00 2020-10-16 11:00:00 2020-10-17 11:00:00 2020-10-18 11:00:00
2020-10-19 11:00:00 2020-10-20 11:00:00 2020-10-21 11:00:00 2020-10-22 11:00:00
2020-10-23 11:00:00 2020-10-24 11:00:00 2020-10-25 11:00:00 2020-10-26 11:00:00
2020-10-27 11:00:00 2020-10-28 11:00:00 2020-10-29 11:00:00 2020-10-30 11:00:00
2020-10-31 11:00:00 2020-11-01 11:00:00 2020-11-02 11:00:00 2020-11-03 11:00:00
2020-11-04 11:00:00 2020-11-05 11:00:00 2020-11-06 11:00:00 2020-11-07 11:00:00
2020-11-08 11:00:00 2020-11-09 11:00:00 2020-11-10 11:00:00 2020-11-11 11:00:00
2020-11-12 11:00:00 2020-11-13 11:00:00 2020-11-14 11:00:00 2020-11-15 11:00:00
2020-11-16 11:00:00 2020-11-17 11:00:00 2020-11-18 11:00:00 2020-11-19 11:00:00
2020-11-20 11:00:00 2020-11-21 11:00:00 2020-11-22 11:00:00 2020-11-23 11:00:00
2020-11-24 11:00:00 2020-11-25 11:00:00 2020-11-26 11:00:00 2020-11-27 11:00:00
2020-11-28 11:00:00 2020-11-29 11:00:00 2020-11-30 11:00:00 2020-12-31 23:00:00
cdo sinfo: Processed 1 variable over 25904 timesteps [6.03s 37MB
In this case, the timestep shows 11:00:00 (from 1950-01-01 onwards). I guess it should be 12:00:00. What wrong I've done here? Any suggestions will highly be appreciated? Thank you.
This output appears correct. CDO has to make a decision about which timestep to use when averaging. In this case it takes the mid-point of each day, which is 11:00.
You'll notice that in the first day the time is 23:00, as there is only one time.
However, it is not clear why you would want to shift the time back one hour. Your code is not actually calculating the daily mean. Instead it is the mean of the final 23 hours of one day and the first hour of the next. Just change your CDO call to the following and everything should be fine:
cdo daymean in.nc out.nc
Robert Wilson's answer is correct, I just wanted to quickly clarify that the confusion here is due to the difference between
Instantaneous fields: such as clouds, water vapour, temperatures, winds, etc, these are fields that are valid for the instant
Accumulated fields: such as radiative fluxes, latent and sensible heat fluxes, precipitation and so on, these are accumulated over a period of time, and the time stamp is placed at the end of the window.
Thus for instant fields Robert is correct that you don't want to shift, if you consider 00Z to be in the subsequent day, but you could equally validly argue that midnight should be included in the previous day (thus you would need shift), as it lies on the border. Convention says you don't shift, and count 00...23 as one day...
Concerning the fluxes, there are also more details in this post: Calculating ERA5 Daily Total Precipitation using CDO
I also have similar issue with GLDAS 3-hourly temperature data.
Let say I use data for 1948, and the first data will be GLDAS_NOAH025_3H.A19480101.0300.020.nc4 which is temperature value for 1948-01-01 00:00:00 -- 1948-01-01 03:00:00 and the last data with year 1948 written on the filename is GLDAS_NOAH025_3H.A19481231.2100.020.nc4 which is temperature value for 1948-12-31 18:00:00 -- 1948-12-31 21:00:00
I added GLDAS_NOAH025_3H.A19490101.0000.020.nc4 into 1948 folder and merge all files into single netcdf using:
cdo mergetime *.nc4 merge_1948.nc4
Then I try to calculate the daily mean using:
cdo daymean merge_1948.nc4 tmean_1948.nc4
Unfortunately the total file (time) is 367, and the first data is 1948-01-01 00:00:00 -- 1948-01-01 21:00:00 and the last data is 1948-12-31 21:00:00 -- 1949-01-01 00:00:00
So, I tried to use shifttime and it solved the problem.
cdo daymean -shifttime,-3hour merge_1948.nc4 temp.nc4
cdo -shifttime,3hour temp.nc4 tmean_1948.nc4
and the first data is 1948-01-01 00:00:00 -- 1948-01-02 00:00:00 and the last data is 1948-12-31 00:00:00 -- 1949-01-01 00:00:00

Exclude time zone information in datatime object

datatime available as object
converting to datetime type
df[['x','y']] = df[['x','y']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S %Z', errors='coerce')
here I want to exclude time zone information 'UTC'
output data type
From your code sample, I see that you are wanting to process multiple variables at once. In this case, you can exclude the time zone information by using a lambda function to apply the pd.to_datetime function and extract the time zone naive timestamps with the values property. Here is an example:
import pandas as pd # v 1.1.3
df = pd.DataFrame(dict(x=['2020-12-30 12:00:00 UTC', '2020-12-30 13:00:00 UTC',
'2020-12-30 14:00:00 UTC', '2020-12-30 15:00:00 UTC',
'2020-12-30 16:00:00 UTC'],
y=['2020-12-31 01:00:00 UTC', '2020-12-31 04:00:00 UTC',
'2020-12-31 02:00:00 UTC', '2020-12-31 05:00:00 UTC',
'2020-12-31 03:00:00 UTC']))
df[['x','y']] = df[['x','y']].apply(lambda x: pd.to_datetime(x).values)
print(df[['x','y']])
# x y
# 0 2020-12-30 12:00:00 2020-12-31 01:00:00
# 1 2020-12-30 13:00:00 2020-12-31 04:00:00
# 2 2020-12-30 14:00:00 2020-12-31 02:00:00
# 3 2020-12-30 15:00:00 2020-12-31 05:00:00
# 4 2020-12-30 16:00:00 2020-12-31 03:00:00

Reverse position of entries in pandas dataframe based on condition

Here I have an extract from my pandas dataframe which is survey data with two datetime fields. It appears that some of the start times and end times were filled in the wrong position in the survey. Here is an example from my dataframe. The start and end time in the 8th row, I suspect were entered the wrong way round.
Just to give context, I generated the third column like this:
df_time['trip_duration'] = df_time['tripEnd_time'] - df_time['tripStart_time']
The three columns are in timedelta64 format.
Here is the top of my dataframe:
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 -1 days +22:15:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What I am trying to do is, loop through these two columns, and for each time 'tripEnd_time' is less than 'tripStart_time' swap the positions of these two entries. So in the case of row 8 above, I would make tripStart_time = tripEnd_time and tripEnd_time = tripStart_time.
I am not quite sure the best way to approach this. Should I use nested for loop where i compare each entry in the two columns?
Thanks
Use Series.abs:
df_time['trip_duration'] = (df_time['tripEnd_time'] - df_time['tripStart_time']).abs()
print (df_time)
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What is same like:
a = df_time['tripEnd_time'] - df_time['tripStart_time']
b = df_time['tripStart_time'] - df_time['tripEnd_time']
mask = df_time['tripEnd_time'] > df_time['tripStart_time']
df_time['trip_duration'] = np.where(mask, a, b)
print (df_time)
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
You can switch column values on selected rows:
df_time.loc[df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripStart_time', 'tripEnd_time']] = df_time.loc[
df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripEnd_time', 'tripStart_time']].values

Change time format from 1/1/1900 10:00:00 PM to 10:00 PM

I'm currently building a program to track the Login in / out, the data is exported as a string like this "5:00AM", I use the following code to convert the data from string to datetime64[ns]
df = pd.DataFrame({ 'LoginTime' : ["10:00PM", "5:00AM", "11:00PM","7:00AM"],
'Logout Time' : ["6:00AM","2:00PM", "5:00AM", "5:00PM"]})
for c in df.columns:
if c == 'LoginTime':
df['LoginTime'] = pd.to_datetime(df['LoginTime'], format='%I:%M%p')
elif c == 'Logout Time':
df['Logout Time'] = pd.to_datetime(df['Logout Time'], format='%I:%M%p')
The output result is the following:
LoginTime Logout Time
0 1900-01-01 22:00:00 1900-01-01 06:00:00
1 1900-01-01 05:00:00 1900-01-01 14:00:00
2 1900-01-01 23:00:00 1900-01-01 05:00:00
3 1900-01-01 07:00:00 1900-01-01 17:00:00
LoginTime datetime64[ns]
Logout Time datetime64[ns]
The code works as expected and changed the string to time format, however, I noticed the format is 1/1/1900 10:00:00 PM, I would like to know if there's a way to get only the time like this 10:00:00 PM without affecte the data type as datetime64[ns] since I have to create validation for the Login in / out
Thanks in advance
Try using dt.strftime after converting it to datetime format:
df = pd.DataFrame({'LoginTime':['1/1/1900 10:00:00 PM', '1/2/2018 05:00:00 AM']})
LoginTime
0 1/1/1900 10:00:00 PM
1 1/2/2018 05:00:00 AM
df['LoginTime'] = pd.to_datetime(df['LoginTime']).dt.strftime('%I:%M %p')
LoginTime
0 10:00 PM
1 05:00 AM

subselect pandas dataframe based on index

I have a dataframe and I want to remove certain specific repeating rows:
import numpy as np
import pandas as pd
nrows = 144
df = pd.DataFrame(np.random.rand(nrows,), pd.date_range('2016-02-08 00:00:00', periods=nrows, freq='2h'), columns=['A'])
The dataframe is continuous with time, providing data every two hours ad infinitum, but I've chosen to only show a subset for brevity.I want to remove the data every 72 hours at 8:00 starting on Mondays to coincide with an external event that alters the data.For this snapshot of data I want to remove the rows indexed at 2016-02-08 08:00, 2016-02-11 08:00, +3D etc..
Is there a simple way to do this?
IIUC you could do this:
In [18]:
start = df.index[(df.index.dayofweek == 0) & (df.index.hour == 8)][0]
start
Out[18]:
Timestamp('2016-02-08 08:00:00')
In [45]:
df.loc[df.index.difference(pd.date_range(start, end=df.index[-1], freq='3D'))]
Out[45]:
A
2016-02-08 00:00:00 0.323742
2016-02-08 02:00:00 0.962252
2016-02-08 04:00:00 0.706537
2016-02-08 06:00:00 0.561446
2016-02-08 10:00:00 0.225042
2016-02-08 12:00:00 0.746258
2016-02-08 14:00:00 0.167950
2016-02-08 16:00:00 0.199958
2016-02-08 18:00:00 0.808286
2016-02-08 20:00:00 0.288797
2016-02-08 22:00:00 0.508109
2016-02-09 00:00:00 0.980772
2016-02-09 02:00:00 0.995731
2016-02-09 04:00:00 0.742751
2016-02-09 06:00:00 0.392247
2016-02-09 08:00:00 0.460511
2016-02-09 10:00:00 0.083660
2016-02-09 12:00:00 0.273620
2016-02-09 14:00:00 0.791506
2016-02-09 16:00:00 0.440630
2016-02-09 18:00:00 0.326418
2016-02-09 20:00:00 0.790780
2016-02-09 22:00:00 0.521131
2016-02-10 00:00:00 0.219315
2016-02-10 02:00:00 0.016625
2016-02-10 04:00:00 0.958566
2016-02-10 06:00:00 0.405643
2016-02-10 08:00:00 0.958025
2016-02-10 10:00:00 0.786663
2016-02-10 12:00:00 0.589064
... ...
2016-02-17 12:00:00 0.360848
2016-02-17 14:00:00 0.757499
2016-02-17 16:00:00 0.391574
2016-02-17 18:00:00 0.062812
2016-02-17 20:00:00 0.308282
2016-02-17 22:00:00 0.251520
2016-02-18 00:00:00 0.832871
2016-02-18 02:00:00 0.387108
2016-02-18 04:00:00 0.070969
2016-02-18 06:00:00 0.298831
2016-02-18 08:00:00 0.878526
2016-02-18 10:00:00 0.979233
2016-02-18 12:00:00 0.386620
2016-02-18 14:00:00 0.420962
2016-02-18 16:00:00 0.238879
2016-02-18 18:00:00 0.124069
2016-02-18 20:00:00 0.985828
2016-02-18 22:00:00 0.585278
2016-02-19 00:00:00 0.409226
2016-02-19 02:00:00 0.093945
2016-02-19 04:00:00 0.389450
2016-02-19 06:00:00 0.378091
2016-02-19 08:00:00 0.874232
2016-02-19 10:00:00 0.527629
2016-02-19 12:00:00 0.490236
2016-02-19 14:00:00 0.509008
2016-02-19 16:00:00 0.097061
2016-02-19 18:00:00 0.111626
2016-02-19 20:00:00 0.877099
2016-02-19 22:00:00 0.796201
[140 rows x 1 columns]
So this determines the start range by comparing the dayofweek and hour and taking the first index value, we then generate an index using date_range and call difference on the index to remove these rows and pass these to loc

Categories