Python - Pandas - Convert YYYYMM to datetime - python

Beginner python (and therefore pandas) user. I am trying to import some data into a pandas dataframe. One of the columns is the date, but in the format "YYYYMM". I have attempted to do what most forum responses suggest:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m')
This doesn't work though (ValueError: unconverted data remains: 3). The column actually includes an additional value for each year, with MM=13. The source used this row as an average of the past year. I am guessing to_datetime is having an issue with that.
Could anyone offer a quick solution, either to strip out all of the annual averages (those with the last two digits "13"), or to have to_datetime ignore them?

pass errors='coerce' and then dropna the NaT rows:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m', errors='coerce').dropna()
The duff month values will get converted to NaT values
In[36]:
pd.to_datetime('201613', format='%Y%m', errors='coerce')
Out[36]: NaT
Alternatively you could filter them out before the conversion
df_cons['YYYYMM'] = pd.to_datetime(df_cons.loc[df_cons['YYYYMM'].str[-2:] != '13','YYYYMM'], format='%Y%m', errors='coerce')
although this could lead to alignment issues as the returned Series needs to be the same length so just passing errors='coerce' is a simpler solution

Clean up the dataframe first.
df_cons = df_cons[~df_cons['YYYYMM'].str.endswith('13')]
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'])
May I suggest turning the column into a period index if YYYYMM column is unique in your dataset.
First turn YYYYMM into index, then convert it to monthly period.
df_cons = df_cons.reset_index().set_index('YYYYMM').to_period('M')

Related

Converting column names to date within a pandas dataframe - warning of deprecated date comparison

I have a pandas dataframe where only a few columns are dates.
Example of a dataframe (dates here for the sake of example are str but in my case they are an object):
df = pd.DataFrame({
"activity": ["clean dishes", "fix porch", "slep on couch"],
"finished": ["NaT", "NaT", "2022-12-29"],
"2022-12-27 00:00:00": [1,1,1],
"2022-12-28 00:00:00": [1,1,1],
"2022-12-29 00:00:00": [1,1,0]
})
print(df.columns)
Index(['activity', 'finished', 2022-12-27 00:00:00, 2022-12-28 00:00:00, 2022-12-29 00:00:00], dtype='object')
I want to convert the last three column names to date (don't want the timestamp included) so that I can compare the dates in the finished column with the different column names and place a zero where activity is finished before.
I tried using this approach but did not work (including suggestion in the comments).
To achieve my goal I created this:
from datetime import datetime
import pandas as pd
def format_header_dates(dataframe):
"""Converting the dates in the header to date"""
for column in dataframe.columns:
if isinstance(column, pd.Timestamp):
new_column = pd.Timestamp(column).date()
dataframe = dataframe.rename(columns={column: new_column})
return dataframe
df = format_header_dates(df)
However I get this warning:
FutureWarning: Comparison of Timestamp with datetime.date is deprecated in order to match the standard library behavior. In a future version these will be considered non-comparable. Use 'ts == pd.Timestamp(date)' or 'ts.date() == date' instead.
return key in self._engine
This leaves me with two questions:
Is there a better way to convert a subset of column names to date?
What exactly is causing this warning (isinstance?) and how can I make the necessary corrections?
Solution:
After spending two days scratching my head and googling, I could not pinpoint the root cause of the FutureWarning but got my way around it.
Step 1: Convert every date to datetime64[ns] and normalize it (to set h:m:s:ns to zero as I have no interest in such precision) with the following: pd.to_datetime(column).normalize().to_datetime64()
Step 2: Do whatever operations I wanted to, which in my case required comparing dates.
Step 3: Cosmetically adjust the dates by keeping only the date component with: pd.to_datetime(column).to_datetime64().astype('datetime64[D]')
This allowed me to do any date operations I wanted and no longer displayed the FutureWarning: Comparison of Timestamp with datetime.date is deprecated...

Split date column into YYYY.MM.DD

I have a dataframe column in the format of 20180531.
I need to split this properly i.e. I can get 2018/05/31.
This is a dataframe column that I have and I need to deal with it in a datetime format.
Currently this column is identified as int64 type
I'm not sure how efficient it'll be but if you convert it to a string, and the use pd.to_datetime with a .format=..., eg:
df['actual_datetime'] = pd.to_datetime(df['your_column'].astype(str), format='%Y%m%d')
As Emma points out - the astype(str) is redundant here and just:
df['actual_datetime'] = pd.to_datetime(df['your_column'], format='%Y%m%d')
will work fine.
Assuming the integer dates would always be fixed width at 8 digits, you may try:
df['dt'] = df['dt_int'].astype(str).str.replace(r'(\d{4})(\d{2})(\d{2})', r'\1-\2-\3')

pandas to_datetime but replace with fixed value when fail/coerce, preserve 'meaningful' NaNs

When using pd.to_datetime on my data frame I get this error:
Out of bounds nanosecond timestamp: 30-04-18 00:00:00
Now from looking on StackO I know I can simply use the coerce option:
pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
But I was wondering if anyone had an idea on how I might replace these values with a fixed value? Say 1900-01-01 00:00:00 (or maybe 1955-11-12 for anyone who gets the reference!)
Reason being that this data frame is part of a process that handles thousands and thousands of JSONs per day. I want to be able to see in the dataset easily the incorrect ones by filtering for said fixed date.
It is just as invalid for the JSON to contain any date before 2010 so using an earlier date is fine and it is also perfectly acceptable to have a blank (NA) date value so I can't rely on just blanking the data.
Replace missing values by some default datetime value in Series.mask only for missing values generated by to_datetime with errors='coerce':
df=pd.DataFrame({"date": [np.nan,'20180101','20-20-0']})
t = pd.to_datetime('1900-01-01')
date = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = date.mask(date.isna() & df['date'].notna(), t)
print (df)
date
0 NaT
1 2018-01-01
2 1900-01-01

Converting objects from CSV into datetime

I've got an imported csv file which has multiple columns with dates in the format "5 Jan 2001 10:20". (Note not zero-padded day)
if I do df.dtype then it shows the columns as being a objects rather than a string or a datetime. I need to be able to subtract 2 column values to work out the difference so I'm trying to get them into a state where I can do that.
At the moment if I try the test subtraction at the end I get the error unsupported operand type(s) for -: 'str' and 'str'.
I've tried multiple methods but have run into a problem every way I've tried.
Any help would be appreciated. If I need to give any more information then I will.
As suggested by #MaxU, you can use pd.to_datetime() method to bring the values of the given column to the 'appropriate' format, like this:
df['datetime'] = pd.to_datetime(df.datetime)
You would have to do this on whatever columns you have that you need trasformed to the right dtype.
Alternatively, you can use parse_dates argument of pd.read_csv() method, like this:
df = pd.read_csv(path, parse_dates=[1,2,3])
where columns 1,2,3 are expected to contain data that can be interpreted as dates.
I hope this helps.
convert a column to datetime using this approach
df["Date"] = pd.to_datetime(df["Date"])
If column has empty values then change error level to coerce to ignore errors: Details
df["Date"] = pd.to_datetime(df["Date"], errors='coerce')
After which you should be able to subtract two dates.
example:
import pandas
df = pandas.DataFrame(columns=['to','fr','ans'])
df.to = [pandas.Timestamp('2014-01-24 13:03:12.050000'), pandas.Timestamp('2014-01-27 11:57:18.240000'), pandas.Timestamp('2014-01-23 10:07:47.660000')]
df.fr = [pandas.Timestamp('2014-01-26 23:41:21.870000'), pandas.Timestamp('2014-01-27 15:38:22.540000'), pandas.Timestamp('2014-01-23 18:50:41.420000')]
(df.fr-df.to).astype('timedelta64[h]')
consult this answer for more details:
Calculate Pandas DataFrame Time Difference Between Two Columns in Hours and Minutes
If you want to directly load the column as datetime object while reading from csv, consider this example :
Pandas read csv dateint columns to datetime
I found that the problem was to do with missing values within the column. Using coerce=True so df["Date"] = pd.to_datetime(df["Date"], coerce=True) solves the problem.

Python cleaning dates for conversion to year only in Pandas

I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda. The column is over 1000 entries here is a sample
datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013
Then I tried converting the dates into years using
df['year']=df['datestart'].astype('timedelta64[Y]')
But it gave me an error:
ValueError: Value cannot be converted into object Numpy Time delta
Using Datetime64
df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')
it gave:
"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"
Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.
Is there a try loop, if statement, or something that I can skip over problems like these?
If date time fails I will be force to use a str.extract script which also works:
year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")
del df['month'], df['day']
and use concat to take the year out.
With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]') The error message is:
Message File Name Line Position
Traceback
<module> C:\Users\0\Desktop\python\Example.py 23
astype C:\Python33\lib\site-packages\pandas\core\generic.py 2062
astype C:\Python33\lib\site-packages\pandas\core\internals.py 2491
apply C:\Python33\lib\site-packages\pandas\core\internals.py 3728
astype C:\Python33\lib\site-packages\pandas\core\internals.py 1746
_astype C:\Python33\lib\site-packages\pandas\core\internals.py 470
_astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]
You first have to convert the column with the date values to datetime's with to_datetime():
df['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
This should normally parse the different formats flexibly (the coerce=True is important here to convert invalid dates to NaT).
If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with values you can get the underlying numpy array):
df['datestart'].values.astype('datetime64[Y]')
The problem with this is that it gives again an error when assigning this to a column due to the NaT value (this seems a bug, you can solve this by doing df = df.dropna()). But also, when you assign this to a column, it get converted back to a datetime64[ns] as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:
df['year'] = pd.DatetimeIndex(df['datestart']).year
This last one will return the year as an integer.

Categories