Pandas Converting date string (only month and year) to datetime - python

I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:
DATE
Out[3] 0 081972
1 051967
2 101964
3 041975
4 071976
I tried:
df['DATE'].apply(pd.to_datetime(format='%m%Y'))
and
pd.to_datetime(df['DATE'],format='%m%Y')
I got Runtime Error both times
Then
df['DATE'].apply(pd.to_datetime)
it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.
DATE
0 1972-08-19
1 2067-05-19
2 2064-10-19
3 1975-04-19
4 1976-07-19
Expect output:
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.
Thank you all in advance for your help

First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:
print (df)
DATE
0 81972
1 51967
2 101964
3 41975
4 171976 <-changed data
print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
0 1972-08-01
1 1967-05-01
2 1964-10-01
3 1975-04-01
4 NaT
Name: DATE, dtype: datetime64[ns]
print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
DATE
4 171976
Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 NaT
Solution with original data:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07

I would have done:
df['date_formatted'] = pd.to_datetime(
dict(
year=df['DATE'].str[2:],
month=df['DATE'].str[:2],
day=1
)
)
Maybe this helps. Works for your sample data.

Related

replacing/re-assign pandas value with new value

I wanted to re-assign/replace my new value, from my current
20000123
19850123
19880112
19951201
19850123
20190821
20000512
19850111
19670133
19850123
As you can see there is data with 19670133 (YYYYMMDD), which means that date is not exist since there is no month with 33 days in it.So I wanted to re assign it to the end of the month. I tried to make it to the end of the month, and it works.
But when i try to replace the old value with the new ones, it became a problem.
What I've tried to do is this :
for x in df_tmp_customer['date']:
try:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x), axis=1)
except Exception:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0), axis=1)
This part is the one that makes it end of the month :
pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0)
Probably not efficient on a large dataset but can be done using pendulum.parse()
import pendulum
def parse_dates(x: str) -> pendulum:
i = 0
while ValueError:
try:
return pendulum.parse(str(int(x) - i)).date()
except ValueError:
i += 1
df["date"] = df["date"].apply(lambda x: parse_dates(x))
print(df)
date
0 2000-01-23
1 1985-01-23
2 1988-01-12
3 1995-12-01
4 1985-01-23
5 2019-08-21
6 2000-05-12
7 1985-01-11
8 1967-01-31
9 1985-01-23
For a vectorial solution, you can use:
# try to convert to YYYYMMDD
date1 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
# get rows for which conversion failed
m = date1.isna()
# try to get end of month
date2 = pd.to_datetime(df.loc[m, 'date'].str[:6], format='%Y%m', errors='coerce').add(pd.offsets.MonthEnd())
# Combine both
df['date2'] = date1.fillna(date2)
NB. Assuming df['date'] is of string dtype. If rather of integer dtype, use df.loc[m, 'date'].floordiv(100) in place of df.loc[m, 'date'].str[:6].
Output:
date date2
0 20000123 2000-01-23
1 19850123 1985-01-23
2 19880112 1988-01-12
3 19951201 1995-12-01
4 19850123 1985-01-23
5 20190821 2019-08-21
6 20000512 2000-05-12
7 19850111 1985-01-11
8 19670133 1967-01-31 # invalid replaced by end of month
9 19850123 1985-01-23

Date and Time Format Conversion in Pandas, Python

Initially, my dataframe had a Month column containing numbers representing the months.
Month
1
2
3
4
I typed df["Month"] = pd.to_datetime(df["Month"]) and I get this...
Month
970-01-01 00:00:00.0000000001
1970-01-01 00:00:00.000000002
1970-01-01 00:00:00.000000003
1970-01-01 00:00:00.000000004
I would like to just retain just the dates and not the time. Any solutions?
get the date from the column using df['Month'].dt.date
Use format='%m' in to_datetime:
df["Month"] = pd.to_datetime(df["Month"], format='%m')
print (df)
Month
0 1900-01-01
1 1900-02-01
2 1900-03-01
3 1900-04-01

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Error converting data type float to datetime format

I would like to convert the data type float below to datetime format:
df
Date
0 NaN
1 NaN
2 201708.0
4 201709.0
5 201700.0
6 201600.0
Name: Cred_Act_LstPostDt_U324123, dtype: float64
pd.to_datetime(df['Date'],format='%Y%m.0')
ValueError: time data 201700.0 does not match format '%Y%m.0' (match)
How could I transform these rows without month information as yyyy01 as default?
You can use pd.Series.str.replace to clean up your month data:
s = [x.replace('00.0', '01.0') for x in df['Date'].astype(str)]
df['Date'] = pd.to_datetime(s, format='%Y%m.0', errors='coerce')
print(df)
Date
0 NaT
1 NaT
2 2017-08-01
4 2017-09-01
5 2017-01-01
6 2016-01-01
Create a string that contains the float using .asType(str), then split the string at the fourth char and using cat insert a hyphen. Then you can use format='%Y%m.
However this may still fail if you try to use incorrect month numbering, such as month 00
string = df['Date'].astype(str)
s = pd.Series([string[:4], '-',string[4:6])
date = s.str.cat(sep=',')
pd.to_datetime(date.astype(str),format='%Y%m')

create date object from 2 columns

I have 2 columns:
dt_year, dt_month
2014 3
I need a date column.
I tried something like:
pd.to_datetime((df.dt_year + df.dt_month +1).apply(str),format='%Y%m%d')
But I get an error:
ValueError: time data '2014' does not match format '%Y%m%d' (match)
Any ideas?
first, change the column names to something more normal. then add a 'day' column.
df.columns = df.columns.str.replace('dt_', '')
df['day'] = 1
df
year month day
0 2014 3 1
Then the magic happens
pd.to_datetime(df)
0 2014-03-01
dtype: datetime64[ns]

Categories