Pandas, format date from dd/mm/yyyy to MMM dd/yy - python

I would like to change the date from dd/mm/yyyy to MMM dd/yy.
ie 15/04/2021 to APR 15/21
My Date column format is object
I am doing this:
df['Date'] = pd.to_datetime(df['Date'], format='%MMM %DD/%YY')
but I am getting ValueError: time data '15/04/2021' does not match format '%MMM %dd/%yy' (match)
Any help would be appreciated.

>>> import pandas as pd
>>> df = pd.DataFrame({'Date':['15/04/2021', '12/05/2021', '4/6/2021']})
>>> df
Date
0 15/04/2021
1 12/05/2021
2 4/6/2021
>>> df['Date'] = pd.to_datetime(df['Date'])
>>> df
Date
0 2021-04-15
1 2021-12-05
2 2021-04-06
>>> df['date_formated'] = df['Date'].dt.strftime('%b %d/%y').str.upper()
>>> df
Date date_formated
0 2021-04-15 APR 15/21
1 2021-12-05 DEC 05/21
2 2021-04-06 APR 06/21
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 3 non-null datetime64[ns]
1 date_formated 3 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 176.0+ bytes

You need to convert your original Date column to datetime format (pandas will read dates as a strings by default). After doing so, just change the display format.
Try this:
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
# Solution for title: change display to 'MMM dd/y'
df['Date'] = df['Date'].dt.strftime('%b %d/%y').astype(str).str.upper()
# Solution for comment: change display to 'MMM dd y'
df['Date'] = df['date'].dt.strftime('%b %d %y').astype(str).str.upper()

Try -
df.loc['Date'] = pd.to_datetime(df.Date).dt.strftime('%b %d/%y').str.upper()

Related

(Only) some dates in date index get interpreted wrong after import from csv

I want to analyse a dataframe in python. I loaded a csv which consists of two columns, one date/time and one mean value.
I loaded the data like this:
df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
The problem is, that some dates seem to get interpreted wrong by python. The csv only ranges from 01.01.2009 00:00 to 04.10.2010 23:45 (original format). But when I load the file into python it also shows dates from November and December 2010 in the plot and df.info:
PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 61628 non-null float64
dtypes: float64(1)
I searched in the csv for values from this time, but couldn't find any. Also, the number of entries in the df.info matches the rows of my csv, so I reckon that some dates must have been interpreted wrong.
The tail of my dataframe after the import looks like this:
Mean
Timestamp
2010-12-09 22:45 186
2010-12-09 23:00 206
2010-12-09 23:15 168
2010-12-09 23:30 150
2010-12-09 23:45 132
I searched for similar problems, but could not find an explanation as to why most of the data is interpreted correctly, but some incorrectly. Any idea?
The assumed need for infer_datetime_format=True gives away that you are not passing the correct format. Have a look at the strftime documentation. You are using:
format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10
But the format required is:
format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010
Apparently, infer_datetime_format isn't able to infer each string correctly, taking days as months and vice versa. Indeed, let's reproduce the error:
Create csv:
import pandas as pd
import numpy as np
data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
'Mean': np.random.randint(0,10,15385)}
df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)
# csv like:
01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3
Load csv incorrectly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M',
infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()
df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB
Load csv correctly:
df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean 15385 non-null int64
dtypes: int64(1)
memory usage: 240.4 KB

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)
First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.
Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte

How to remove rows in pandas of type datetime64[ns] by date?

I'm pretty newbie, started to use python for my project.
I have dataset, first column has datetime64[ns] type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5889 entries, 0 to 5888
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5889 non-null datetime64[ns]
1 title 5889 non-null object
2 stock 5889 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 138.1+ KB
and
type(BA['date'])
gives
pandas.core.series.Series
date has format 2020-06-10
I need to delete all instances before specific date, for example 2015-09-09
What I tried:
convert to string. Failed
Create conditions using:
.df.year <= %y & .df.month <= %m
<= ('%y-%m-%d')
create data with datetime() method
create variable with datetime64 format
just copy with .loc() and .copy()
All of this failed, I had all kinds of error, like it's not int, its not datetime, datetime mutable, not this, not that, not a holy cow
How can this pandas format can be more counterintuitive, I can't believe, for first time I feel like write a parser CSV in C++ seems easier than use prepared library in python
Thank you for understanding
Toy Example
df = pd.DataFrame({'date':['2021-1-1', '2020-12-6', '2019-02-01', '2020-02-01']})
df.date = pd.to_datetime(df.date)
df
Input df
date
0 2021-01-01
1 2020-12-06
2 2019-02-01
3 2020-02-01
Delete rows before 2020.01.01.
We are selecting the rows which have dates after 2020.01.01 and ignoring old dates.
df.loc[df.date>'2020.01.01']
Output
date
0 2021-01-01
1 2020-12-06
3 2020-02-01
If we want the index to be reset
df = df.loc[df.date>'2020.01.01']
df
Output
date
0 2021-01-01
1 2020-12-06
2 2020-02-01

converting object to datetime without time

I have dataframe that looks like below:
Date Region Data
0 200201 A 8.8
1 200201 B 14.3
...
1545 202005 C 7.3
1546 202005 D 131
I wanted to convert the Date column(data type: object) to DateTime index without time. yyyymm or yyyymmdd or yyyy-mm-dd all of these don't matter as long as I can erase the time part.
I've searched stackoverflow and tried these codes
# (1)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m", errors = "coerce", uts = False)
# (2)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m")
df["Date"] = df["Date"].dt.normalize()
# (3)
df["Date"] = pd.to_datetime(df["Date"], format = "%Y%m")
df["Date"] = df["Date"].dt.date
For (1) and (2), I get ["Date"] with time like yyyy-mm-dd 00:00:00.
For (3), I do get ["Date"] as yyyymm but the dtype is object.
I can't use date range because same date is repeated for some time.
Will there be any way to convert yyyymm[object] to yyyymmdd[datetime] in python?
Thanks in advance.
It could be a display configuration issue on how your DataFrames are showing in your editor. The simplest way to get the data in the right format is:
df['Date'] = pd.to_datetime(df['Date'], format = '%Y%m')
Below are the results from repl.it with your DataFrame and this code. The date is properly formatted without the time component, and it has the proper dtype.
Date Region Data
0 2002-01-01 A 8.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1 non-null datetime64[ns]
1 Region 1 non-null object
2 Data 1 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 152.0+ bytes
You can also try a more convoluted way of going from datetime to date string and back to datetime.
df['Date'] = pd.to_datetime(df['Date'], format = '%Y%m').dt.date
df['Date'] = df['Date'].astype('datetime64[ns]')
The final display and dtypes are the same.
Date Region Data
0 2002-01-01 A 8.8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1 non-null datetime64[ns]
1 Region 1 non-null object
2 Data 1 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 152.0+ bytes
The Date column in the question has the format YYYYMM (but no days). The function pd.to_datetime() implicitly sets the day to 1.
The function pd.Period() converts dates in the format YYYYMM to pandas periods. Note that df['Date'] can be strings or 6-digit integers.
df['Date'].apply(lambda x: pd.Period(x, freq='M'))
0 2002-01
1 2002-01
2 2020-05
3 2020-05
Name: Date, dtype: period[M]

How do I convert strings in a Pandas data frame to a 'date' data type?

I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD
For e.g. '2013-10-28'
At the moment the dtype of the column is object.
How do I convert the column values to Pandas date format?
Essentially equivalent to #waitingkuo, but I would use pd.to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):
In [11]: df
Out[11]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [12]: pd.to_datetime(df['time'])
Out[12]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]
In [13]: df['time'] = pd.to_datetime(df['time'])
In [14]: df
Out[14]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
Handling ValueErrors
If you run into a situation where doing
df['time'] = pd.to_datetime(df['time'])
Throws a
ValueError: Unknown string format
That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
Use astype
In [31]: df
Out[31]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [32]: df['time'] = df['time'].astype('datetime64[ns]')
In [33]: df
Out[33]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:
dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Now you can do df['column'].dt.date
Note that for datetime objects, if you don't see the hour when they're all 00:00:00, that's not pandas. That's iPython notebook trying to make things look pretty.
If you want to get the DATE and not DATETIME format:
df["id_date"] = pd.to_datetime(df["id_date"]).dt.date
Another way to do this and this works well if you have multiple columns to convert to datetime.
cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)
It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.
#set an index by dates
df.set_index(['time'], drop=True, inplace=True)
After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
#Convert to monthly dates
df.index = df.index.to_period(freq='M')
#Convert to strings
df.index = df.index.strftime('%Y-%m')
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
For brevity, I don't show that I run the following code after each line above:
print(df.index)
print(df.index.dtype)
print(type(df.index))
This gives me the following output:
Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>
Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by #SSS, but using rather the datetime library is:
import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null object
1 endDay 110526 non-null object
import pandas as pd
df['startDay'] = pd.to_datetime(df.startDay)
df['endDay'] = pd.to_datetime(df.endDay)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null datetime64[ns]
1 endDay 110526 non-null datetime64[ns]
Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column

Categories