Below is the sample data
Datetime
11/19/2020 9:48:50 AM
12/17/2020 2:41:02 PM
2020-02-11 14:44:58
2020-28-12 10:41:02
2020-05-12 06:31:39
11/19/2020 is in mm/dd/yyyy whereas 2020-28-12 is yyyy-dd-mm.
After applying pd.to_datetime below is the output that I am getting.
Date
2020-11-19 09:48:50
2020-12-17 22:41:02
2020-02-11 14:44:58
2020-28-12 10:41:02
2020-05-12 06:31:39
If the input data is coming with slash (/) i.e 11/19/2020 then format is mm/dd/yyyy in input itself and when data is coming with dash (-) i.e 2020-02-11 then the format is yyyy-dd-mm. But after applying pd.to_datetime datetime is getting interchanged.
The first two output is correct. The bottom three needs to be corrected as
2020-11-02 14:44:58
2020-12-28 10:41:02
2020-12-05 06:31:39
Please suggest to have common format i.e yyyy-mm-dd format.
Use to_datetime with specify both formats and errors='coerce' for missing values if no match and then replace them by another Series in Series.fillna:
d1 = pd.to_datetime(df['datetime'], format='%Y-%d-%m %H:%M:%S', errors='coerce')
d2 = pd.to_datetime(df['datetime'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
df['datetime'] = d1.fillna(d2)
print (df)
datetime
0 2020-11-19 09:48:50
1 2020-12-17 14:41:02
2 2020-11-02 14:44:58
3 2020-12-28 10:41:02
4 2020-12-05 06:31:39
When converting a pandas dataframe column from object to datetime using astype function, the behavior is different depending on if the strings have the time component or not. What is the correct way of converting the column?
df = pd.DataFrame({'Date': ['12/07/2013 21:50:00','13/07/2013 00:30:00','15/07/2013','11/07/2013']})
df['Date'] = pd.to_datetime(df['Date'], format="%d/%m/%Y %H:%M:%S", exact=False, dayfirst=True, errors='ignore')
Output:
Date
0 12/07/2013 21:50:00
1 13/07/2013 00:30:00
2 15/07/2013
3 11/07/2013
but the dtype is still object. When doing:
df['Date'] = df['Date'].astype('datetime64')
it becomes of datetime dtype but the day and month are not parsed correctly on rows 0 and 3.
Date
0 2013-12-07 21:50:00
1 2013-07-13 00:30:00
2 2013-07-15 00:00:00
3 2013-11-07 00:00:00
The expected result is:
Date
0 2013-07-12 21:50:00
1 2013-07-13 00:30:00
2 2013-07-15 00:00:00
3 2013-07-11 00:00:00
If we look at the source code, if you pass format= and dayfirst= arguments, dayfirst= will never be read because passing format= calls a C function (np_datetime_strings.c) that doesn't use dayfirst= to make conversions. On the other hand, if you pass only dayfirst=, it will be used to first guess the format and falls back on dateutil.parser.parse to make conversions. So, use only one of them.
In most cases,
df['Date'] = pd.to_datetime(df['Date'])
does the job.
In the specific example in the OP, passing dayfirst=True does the job.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
That said, passing the format= makes the conversion run ~25x faster (see this post for more info), so if your frame is anything larger than 10k rows, then it's better to pass the format=. Now since the format is mixed, one way is to perform the conversion in two steps (errors='coerce' argument will be useful)
convert the datetimes with time component
fill in the NaT values (the "coerced" rows) by a Series converted with a different format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y %H:%M:%S', errors='coerce')
df['Date'] = df['Date'].fillna(pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce'))
This method (of performing or more conversions) can be used to convert any column with "weirdly" formatted datetimes.
I have a problem with Pandas, Python. I have several rows with different dates, where the dates are String
"2016-02-28" ABC123
"2016-02-29" CDE345
"2016-03-30" FGH567
"2016-03-31" XYZ235
...
Here we see that feb has two different days, 28 and 29. I am only interested in the month. Thus, I want these rows to have the same day, like this:
"2016-02-29" ABC123
"2016-02-29" CDE345
"2016-03-31" FGH567
"2016-03-31" XYZ235
...
It does not really matter which day they get, as long as it is the same day but preferable the last day. I cannot truncate and only keep "2016-02" because I later need the day. I can convert it timestamp if it becomes easier.
df ["DATE"] = pandas.to_datetime (df ["DATE"])
(Another question, why does this line convert the DATE column to Timestamp instead of datetime?? It says to convert to datetime, but instead it becomes Timestamp?)
I have tried to resample, but to no avail. I dont want to do this manually, by cutting and pasting Strings as I have done earlier. There should be a more elegant solution?
Use MonthEnd offset:
df["DATE"] = pd.to_datetime (df["DATE"]) + pd.offsets.MonthEnd(0)
print (df)
DATE COL
0 2016-02-29 ABC123
1 2016-02-29 CDE345
2 2016-03-31 FGH567
3 2016-03-31 XYZ235
If really big DataFrame and performance is important:
df['DATE'] = pd.to_datetime(df["DATE"]).values.astype('datetime64[M]') + \
np.array([1], dtype='timedelta64[M]') - \
np.array([1], dtype='timedelta64[D]')
print (df)
DATE COL
0 2016-02-29 ABC123
1 2016-02-29 CDE345
2 2016-03-31 FGH567
3 2016-03-31 XYZ235
(Another question, why does this line convert the DATE column to
Timestamp instead of datetime?? It says to convert to datetime, but
instead it becomes Timestamp?)
If you see the docs here it says :
TimeStamp is the pandas equivalent of python’s Datetime and is
interchangable with it in most cases. It’s the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.
Or we using to_period
df.index=df.index.to_period('M').to_timestamp('M')
df
Out[16]:
A
2016-02-29 ABC123
2016-02-29 CDE345
2016-03-31 FGH567
2016-03-31 XYZ235
I have a column I_DATE of type string(object) in a dataframe called train as show below.
I_DATE
28-03-2012 2:15:00 PM
28-03-2012 2:17:28 PM
28-03-2012 2:50:50 PM
How to convert I_DATE from string to datetime format & specify the format of input string.
Also, how to filter rows based on a range of dates in pandas?
Use to_datetime. There is no need for a format string since the parser is able to handle it:
In [51]:
pd.to_datetime(df['I_DATE'])
Out[51]:
0 2012-03-28 14:15:00
1 2012-03-28 14:17:28
2 2012-03-28 14:50:50
Name: I_DATE, dtype: datetime64[ns]
To access the date/day/time component use the dt accessor:
In [54]:
df['I_DATE'].dt.date
Out[54]:
0 2012-03-28
1 2012-03-28
2 2012-03-28
dtype: object
In [56]:
df['I_DATE'].dt.time
Out[56]:
0 14:15:00
1 14:17:28
2 14:50:50
dtype: object
You can use strings to filter as an example:
In [59]:
df = pd.DataFrame({'date':pd.date_range(start = dt.datetime(2015,1,1), end = dt.datetime.now())})
df[(df['date'] > '2015-02-04') & (df['date'] < '2015-02-10')]
Out[59]:
date
35 2015-02-05
36 2015-02-06
37 2015-02-07
38 2015-02-08
39 2015-02-09
Approach: 1
Given original string format: 2019/03/04 00:08:48
you can use
updated_df = df['timestamp'].astype('datetime64[ns]')
The result will be in this datetime format: 2019-03-04 00:08:48
Approach: 2
updated_df = df.astype({'timestamp':'datetime64[ns]'})
For a datetime in AM/PM format, the time format is '%I:%M:%S %p'. See all possible format combinations at https://strftime.org/. N.B. If you have time component as in the OP, the conversion will be done much, much faster if you pass the format= (see here for more info).
df['I_DATE'] = pd.to_datetime(df['I_DATE'], format='%d-%m-%Y %I:%M:%S %p')
To filter a datetime using a range, you can use query:
df = pd.DataFrame({'date': pd.date_range('2015-01-01', '2015-04-01')})
df.query("'2015-02-04' < date < '2015-02-10'")
or use between to create a mask and filter.
df[df['date'].between('2015-02-04', '2015-02-10')]
I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])