I have the following time series and I want to convert to datetime in DataFrame using "pd.to_datetime". I am getting the following error: "hour must be in 0..23: 2017/ 01/01 24:00:00". How can I go around this error?
DateTime
0 2017/ 01/01 01:00:00
1 2017/ 01/01 02:00:00
2 2017/ 01/01 03:00:00
3 2017/ 01/01 04:00:00
...
22 2017/ 01/01 23:00:00
23 2017/ 01/01 24:00:00
Given:
DateTime
0 2017/01/01 01:00:00
1 2017/01/01 02:00:00
2 2017/01/01 03:00:00
3 2017/01/01 04:00:00
4 2017/01/01 23:00:00
5 2017/01/01 24:00:00
As the error says, 24:00:00 isn't a valid time. Depending on what it actually means, we can salvage it like this:
# Split up your Date and Time Values into separate Columns:
df[['Date', 'Time']] = df.DateTime.str.split(expand=True)
# Convert them separately, one as datetime, the other as timedelta.
df.Date = pd.to_datetime(df.Date)
df.Time = pd.to_timedelta(df.Time)
# Fix your DateTime Column, Drop the helper Columns:
df.DateTime = df.Date + df.Time
df = df.drop(['Date', 'Time'], axis=1)
print(df)
print(df.dtypes)
Output:
DateTime
0 2017-01-01 01:00:00
1 2017-01-01 02:00:00
2 2017-01-01 03:00:00
3 2017-01-01 04:00:00
4 2017-01-01 23:00:00
5 2017-01-02 00:00:00
DateTime datetime64[ns]
dtype: object
df['DateTime'] =pd.to_datetime(df['DateTime'], format='%y-%m-%d %H:%M', errors='coerce')
Try this out!
Related
I have a df like so:
firstdate seconddate
0 2011-01-01 13:00:00 2011-01-01 13:00:00
1 2011-01-02 14:00:00 2011-01-01 11:00:00
2 2011-01-02 16:00:00 2011-01-02 13:00:00
3 2011-01-04 12:00:00 2011-01-03 15:00:00
...
Seconddate is always before firstdate. I want to compute the difference between firstdate and seconddate in number of days and make this a column, if firstdate and seconddate are the same day, difference=0, if seconddate is the day before firstdate, difference=1 and so on until a week. How would I do this?
df['first'] = pd.to_datetime(df['first'])
df['second'] = pd.to_datetime(df['second'])
df['diff'] = (df['first'] - df['second']).dt.days
This will add a column with the diff. You can delete based on it
df.drop(df[df.diff < 0].index)
# or
df = df[df.diff > 0]
I have a dataframe with hourly values for several years. My dataframe is already in datetime format, and the column containing the values is called say "value column".
date = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24 , 31.71 , 34.39 , 34.49 ]
df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)
print(df.head())
value column
index
2015-02-03 23:00:00 33.24
2015-02-03 23:30:00 31.71
2015-02-04 00:00:00 34.39
2015-02-04 00:30:00 34.49
I know how to get the mean of the "value column" for each year efficiently with for instance the following command:
df = df.groupby(df.index.year).mean()
Now, I would like to divide all hourly values of the column "value column" by the mean of its values for its corresponding year (for instance dividing all the 2015 hourly values by the mean of 2015 values, and same for the other years).
Is there an efficient way to do that in pandas?
Expected result:
value column Value column/mean of year
index
2015-02-03 23:00:00 33.24 0.993499
2015-02-03 23:30:00 31.71 0.94777
2015-02-04 00:00:00 34.39 1.027871
2015-02-04 00:30:00 34.49 1.03086
Many thanks,
Try the following:
df.groupby(df.index.year).transform(lambda x: x/x.mean())
Refer: Group By: split-apply-combine
Transformation is recommended as it is meant to perform some group-specific computations and return a like-indexed object.
I just found another way, which im not sure to understand but works!
df['result'] = df['value column'].groupby(df.index.year).apply(lambda x: x/x.mean())
I thought that in apply functions, x was refering to single values of the array but it seems that it refers to the group itself.
You should be able to do:
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
So you set the index to be the year in order to divide by the groupby object, and then reset the index to keep the original timestamps.
Full example:
import pandas as pd
import numpy as np
np.random.seed(1)
dr = pd.date_range('1-1-2010','1-1-2020', freq='H')
df = pd.DataFrame({'value column':np.random.rand(len(dr))}, index=dr)
print(df, '\n')
print(df.groupby(df.index.year).mean(), '\n')
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
print(df)
Output:
#original data
value column
2010-01-01 00:00:00 0.417022
2010-01-01 01:00:00 0.720324
2010-01-01 02:00:00 0.000114
2010-01-01 03:00:00 0.302333
2010-01-01 04:00:00 0.146756
...
2019-12-31 20:00:00 0.530828
2019-12-31 21:00:00 0.224505
2019-12-31 22:00:00 0.459977
2019-12-31 23:00:00 0.931504
2020-01-01 00:00:00 0.581869
[87649 rows x 1 columns]
#grouped by year
value column
2010 0.497135
2011 0.503547
2012 0.501023
2013 0.497848
2014 0.497065
2015 0.501417
2016 0.498303
2017 0.499266
2018 0.499533
2019 0.492220
2020 0.581869
#final output
value column
2010-01-01 00:00:00 0.838851
2010-01-01 01:00:00 1.448952
2010-01-01 02:00:00 0.000230
2010-01-01 03:00:00 0.608150
2010-01-01 04:00:00 0.295203
...
2019-12-31 20:00:00 1.078436
2019-12-31 21:00:00 0.456107
2019-12-31 22:00:00 0.934494
2019-12-31 23:00:00 1.892455
2020-01-01 00:00:00 1.000000
[87649 rows x 1 columns]
Given a dataframe like:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'Date' : pd.date_range('1/1/2011', periods=5, freq='3675S'),
'Num' : np.random.rand(5)})
Date Num
0 2011-01-01 00:00:00 0.580997
1 2011-01-01 01:01:15 0.407332
2 2011-01-01 02:02:30 0.786035
3 2011-01-01 03:03:45 0.821792
4 2011-01-01 04:05:00 0.807869
I would like to remove the 'minutes' and 'seconds' information.
The following (mostly stolen from: How to remove the 'seconds' of Pandas dataframe index?) works okay,
df = df.assign(Date = lambda x: pd.to_datetime(x['Date'].dt.strftime('%Y-%m-%d %H')))
Date Num
0 2011-01-01 00:00:00 0.580997
1 2011-01-01 01:00:00 0.407332
2 2011-01-01 02:00:00 0.786035
3 2011-01-01 03:00:00 0.821792
4 2011-01-01 04:00:00 0.807869
but it feels strange to convert a datetime to a string then back to a datetime. Is there a way to do this more directly?
dt.round
This is how it should be done... use dt.round
df.assign(Date=df.Date.dt.round('H'))
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
OLD ANSWER
One approach is to set the index and use resample
df.set_index('Date').resample('H').last().reset_index()
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
Another alternative is to strip the date and hour components
df.assign(
Date=pd.to_datetime(df.Date.dt.date) +
pd.to_timedelta(df.Date.dt.hour, unit='H'))
Date Num
0 2011-01-01 00:00:00 0.577957
1 2011-01-01 01:00:00 0.995748
2 2011-01-01 02:00:00 0.864013
3 2011-01-01 03:00:00 0.468762
4 2011-01-01 04:00:00 0.866827
Other solution could be this :
df.Date = pd.to_datetime(df.Date)
df.Date = df.Date.apply(lambda x: datetime(x.year, x.month, x.day, x.hour))
I am trying to add holidays column for France in a Dataframe by using workalendar package but it gives me an error of
Series' object has no attribute 'weekday'
Below is my code;
from workalendar.europe import France
df1 = pd.read_csv('C:\Users\ABC.csv')
df1['Date'] = pd.to_datetime(df1['Date'], format= '%d/%m/%Y %H:%M:%S')
df1['Date1'] = df1.Date.dt.date
dr = df1['Date1']
cal = France()
df1['Holiday'] = cal.is_working_day(df1['Date1'])
df1.head()
The original data in the file looks like this;
Date Value
17/10/2012 19:00:00 0
17/10/2012 20:00:00 0.1
17/10/2012 21:00:00 0
17/10/2012 22:00:00 0
17/10/2012 23:00:00 0
18/10/2012 00:00:00 0
18/10/2012 01:00:00 0
18/10/2012 02:00:00 0
18/10/2012 03:00:00 0.1
18/10/2012 04:00:00 0
18/10/2012 05:00:00 0
18/10/2012 06:00:00 0
18/10/2012 07:00:00 0
18/10/2012 08:00:00 0.2
18/10/2012 09:00:00 0.5
`
Try this.
df1['Holiday'] = df1.Date.apply(lambda x: cal.is_working_day(pd.to_pydatetime(x)))
You have to convert the object type to datetime.
BTW, I thought that working_day would might not be Holiday...
Converting between datetime, Timestamp and datetime64
After importing data from a HDF5 file the index for my stock data has disappeared.
One of the columns in my dataframe "Date" is a Datetime64. How do I convert this date column to a datetimeindex column but without the time parts at the end.
So that slicing the dataframe like this data.ix["2016-01-01":"2016-02-06"] works.
IIUC, starting from a sample dataframe as:
Date x
0 2016-01-01 20:01 1
1 2016-01-02 20:02 2
you can do:
df = df.set_index(pd.DatetimeIndex(df['Date']).date)
which returns your DatetimeIndex only with the date part:
Date x
2016-01-01 2016-01-01 20:01 1
2016-01-02 2016-01-02 20:02 2
Use set_index to convert the column to an index. You do not need to trim the time part for the it to work.
df = pd.DataFrame({
'ts': pd.date_range('2015-12-20', periods=10, freq='12h'),
'stuff': np.random.randn(10)
})
print(df)
stuff ts
0 0.942231 2015-12-20 00:00:00
1 1.229604 2015-12-20 12:00:00
2 -0.162319 2015-12-21 00:00:00
3 -0.142590 2015-12-21 12:00:00
4 1.057184 2015-12-22 00:00:00
5 -0.370927 2015-12-22 12:00:00
6 -0.358605 2015-12-23 00:00:00
7 -0.561857 2015-12-23 12:00:00
8 -0.020714 2015-12-24 00:00:00
9 0.552764 2015-12-24 12:00:00
print(df.set_index('ts').ix['2015-12-21':'2015-12-23'])
stuff
ts
2015-12-21 00:00:00 -0.162319
2015-12-21 12:00:00 -0.142590
2015-12-22 00:00:00 1.057184
2015-12-22 12:00:00 -0.370927
2015-12-23 00:00:00 -0.358605
2015-12-23 12:00:00 -0.561857