I have a huge size DataFrame that contains index in integer form for date time representation, for example, 20171001. What I'm going to do is to change the form, for example, 20171001, to the datetime format, '2017-10-01'.
For simplicity, I generate such a dataframe.
>>> df = pd.DataFrame(np.random.randn(3,2), columns=list('ab'), index=
[20171001,20171002,20171003])
>>> df
a b
20171001 2.205108 0.926963
20171002 1.104884 -0.445450
20171003 0.621504 -0.584352
>>> df.index
Int64Index([20171001, 20171002, 20171003], dtype='int64')
If we apply 'to_datetime' to df.index, we have the weird result:
>>> pd.to_datetime(df.index)
DatetimeIndex(['1970-01-01 00:00:00.020171001',
'1970-01-01 00:00:00.020171002',
'1970-01-01 00:00:00.020171003'],
dtype='datetime64[ns]', freq=None)
What I want is DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10--3'], ...)
How can I manage this problem? Note that the file is given.
Use format %Y%m%d in pd.to_datetime i.e
pd.to_datetime(df.index, format='%Y%m%d')
DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10-03'], dtype='datetime64[ns]', freq=None)
To assign df.index = pd.to_datetime(df.index, format='%Y%m%d')
pd.to_datetime is the panda way of doing it. But here are two alternatives:
import datetime
df.index = (datetime.datetime.strptime(str(i),"%Y%m%d") for i in df.index)
or
import datetime
df.index = df.index.map(lambda x: datetime.datetime.strptime(str(x),"%Y%m%d"))
Related
I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')
I have a dataframe containing the column 'Date' with value as '9999-12-31 00:00:00'. I need to convert it to 'dd/mm/yyyy'.
import pandas as pd
data = (['9999-12-31 00:00:00'])
df = pd.DataFrame(data, columns=['Date'])
Use daily periods by custom function with remove times by split and change format by strftime:
df['Date'] = (df['Date'].str.split()
.str[0]
.apply(lambda x: pd.Period(x, freq='D'))
.dt.strftime('%d/%m/%Y'))
print (df)
Date
0 31/12/9999
I set the index of my dataframe to a time series:
new_data.index = pd.DatetimeIndex(new_data.index)}
How can I convert this timeseries data back into the original string format?
Pandas index objects often have methods equivalent to those available to series. Here you can use pd.Index.astype:
df = pd.DataFrame(index=['2018-01-01', '2018-05-15', '2018-12-25'])
df.index = pd.DatetimeIndex(df.index)
# DatetimeIndex(['2018-01-01', '2018-05-15', '2018-12-25'],
# dtype='datetime64[ns]', freq=None)
df.index = df.index.astype(str)
# Index(['2018-01-01', '2018-05-15', '2018-12-25'], dtype='object')
Note strings in Pandas are stored in object dtype series. If you need a specific format, this can also be accommodated:
df.index = df.index.strftime('%d-%b-%Y')
# Index(['01-Jan-2018', '15-May-2018', '25-Dec-2018'], dtype='object')
See Python's strftime directives for conventions.
Would like to change the Date index of the dataframe from the default style into '%m/%d/%Y' format.
In: df
Out:
Date Close
2006-01-24 48.812471
2006-01-25 47.448712
2006-01-26 53.341202
2006-01-27 58.728122
2006-01-30 59.481986
2006-01-31 55.691974
df.index
Out:
DatetimeIndex(['2006-01-04', '2006-01-05', '2006-01-06', '2006-01-09',
'2006-01-10', '2006-01-11', '2006-01-12', '2006-01-13',
'2006-01-17', '2006-01-18',
...
'2018-02-21', '2018-02-22', '2018-02-23', '2018-02-26',
'2018-02-27', '2018-02-28', '2018-03-01', '2018-03-02',
'2018-03-05', '2018-03-06'],
dtype='datetime64[ns]', name=u'Date', length=3063, freq=None)
Into:
In: df1
Out:
Date Close
01/24/2006 48.812471
01/25/2006 47.448712
01/26/2006 53.341202
01/27/2006 58.728122
01/28/2006 59.481986
01/29/2006 55.691974
I tried this method before...
df1.index = pd.to_datetime(df.index, format = '%m/%d/%Y')
df1.index = df.dt.strftime('%Y-%m-%d')
AttributeError: 'DataFrame' object has no attribute 'dt'
Use DatetimeIndex.strftime - instead dt need index:
df1.index = pd.to_datetime(df1.index, format = '%m/%d/%Y').strftime('%Y-%m-%d')
What is same:
df1.index = pd.to_datetime(df1.index, format = '%m/%d/%Y')
df1.index = df1.index.strftime('%Y-%m-%d')
EDIT if need convert DatetimeIndex to another string format:
print (df1.index)
DatetimeIndex(['2006-01-24', '2006-01-25', '2006-01-26', '2006-01-27',
'2006-01-30', '2006-01-31'],
dtype='datetime64[ns]', name='Date', freq=None)
df1.index = df1.index.strftime('%m/%d/%Y')
print (df1)
Close
01/24/2006 48.812471
01/25/2006 47.448712
01/26/2006 53.341202
01/27/2006 58.728122
01/30/2006 59.481986
01/31/2006 55.691974
I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')