I have been stumped for the past few hours trying to solve the following.
In a large data set I have from an automated system, there is a DATE_TIME value, which for rows at midnight has values that dont have a the full hour like: 12-MAY-2017 0:16:20
When I try convert this to a date (so that its usable for conversions) as follows:
df['DATE_TIME'].astype('datetime64[ns]')
I get the following error:
Error parsing datetime string "12-MAY-2017 0:16:20" at position 3
I tried writing some REGEX to pull out each piece but couldnt get anything working given the hour could be either 1 or two characters respectively. It also doesn't seem like an ideal solution to write regex for each peice.
Any ideas on this?
Try to use pandas.to_datetime() method:
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'], errors='coerce')
Parameter errors='coerce' will take care of those strings that can't be converted to datatime dtype
I think you need pandas.to_datetime only:
df = pd.DataFrame({'DATE_TIME':['12-MAY-2017 0:16:20','12-MAY-2017 0:16:20']})
print (df)
DATE_TIME
0 12-MAY-2017 0:16:20
1 12-MAY-2017 0:16:20
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
print (df)
DATE_TIME
0 2017-05-12 00:16:20
1 2017-05-12 00:16:20
Convert in numpy by astype seems problematic, because need strings in ISO 8601 date or datetime format:
df['DATE_TIME'].astype('datetime64[ns]')
ValueError: Error parsing datetime string "12-MAY-2017 0:16:20" at position 3
EDIT:
If datetimes are broken (some strings or ints) then use MaxU answer.
Related
Some rows in my dataframe have time as "13:2:7" and I want them to be "13:02:07".
I have tried applying pd.to_datetime to the column but it doesnt work
Can someone please suggest some method to format the time in standard format
Found the solution
I solved it through pandas to_timedelta
pd.to_timedelta("13:2:3") or pass the column of dataframe through it
result:
"13:02:03"
ps: you will get result as "0 days 13:02:03"
In order to remove 0 days
df["column_name"].astype(str).map(lambda x: x[7:]))
We will slice of the initial 0 days string present
Note: The final result will be in form string
In case you want time object ,
Use pandas pd.to_datetime or strftime from datetime module
When using pd.to_datetime on my data frame I get this error:
Out of bounds nanosecond timestamp: 30-04-18 00:00:00
Now from looking on StackO I know I can simply use the coerce option:
pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
But I was wondering if anyone had an idea on how I might replace these values with a fixed value? Say 1900-01-01 00:00:00 (or maybe 1955-11-12 for anyone who gets the reference!)
Reason being that this data frame is part of a process that handles thousands and thousands of JSONs per day. I want to be able to see in the dataset easily the incorrect ones by filtering for said fixed date.
It is just as invalid for the JSON to contain any date before 2010 so using an earlier date is fine and it is also perfectly acceptable to have a blank (NA) date value so I can't rely on just blanking the data.
Replace missing values by some default datetime value in Series.mask only for missing values generated by to_datetime with errors='coerce':
df=pd.DataFrame({"date": [np.nan,'20180101','20-20-0']})
t = pd.to_datetime('1900-01-01')
date = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = date.mask(date.isna() & df['date'].notna(), t)
print (df)
date
0 NaT
1 2018-01-01
2 1900-01-01
I have a dataframe with a date column:
data['Date']
0 1/1/14
1 1/8/14
2 1/15/14
3 1/22/14
4 1/29/14
...
255 11/21/18
256 11/28/18
257 12/5/18
258 12/12/18
259 12/19/18
But, when I try to get the max date out of that column, I get:
test_data.Date.max()
'9/9/15'
Any idea why this would happen?
Clearly the column is of type object. You should try using pd.to_datetime() and then performing the max() aggregator:
data['Date'] = pd.to_datetime(data['Date'],errors='coerce') #You might need to pass format
print(data['Date'].max())
The .max() understands it as a date (like you want), if it is a datetime object. Building upon Seshadri's response, try:
type(data['Date'][1])
If it is a datetime object, this returns this:
pandas._libs.tslibs.timestamps.Timestamp
If not, you can make that column a datatime object like so:
data['Date'] = pd.to_datetime(data['Date'],format='%m/%d/%y')
The format argument makes sure you get the right formatting. See the full list of formatting options here in the python docs.
Your date may be stored as a string. First convert the column from string to datetime. Then, max() should work.
test = pd.DataFrame(['1/1/2010', '2/1/2011', '3/4/2020'], columns=['Dates'])
Dates
0 1/1/2010
1 2/1/2011
2 3/4/2020
pd.to_datetime(test['Dates'], format='%m/%d/%Y').max()
Timestamp('2020-03-04 00:00:00')
That timestamp can be cleaned up using .dt.date:
pd.to_datetime(test['Dates'], format='%m/%d/%Y').dt.date.max()
datetime.date(2020, 3, 4)
to_datetime format argument table python docs
pandas to_datetime pandas docs
Beginner python (and therefore pandas) user. I am trying to import some data into a pandas dataframe. One of the columns is the date, but in the format "YYYYMM". I have attempted to do what most forum responses suggest:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m')
This doesn't work though (ValueError: unconverted data remains: 3). The column actually includes an additional value for each year, with MM=13. The source used this row as an average of the past year. I am guessing to_datetime is having an issue with that.
Could anyone offer a quick solution, either to strip out all of the annual averages (those with the last two digits "13"), or to have to_datetime ignore them?
pass errors='coerce' and then dropna the NaT rows:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m', errors='coerce').dropna()
The duff month values will get converted to NaT values
In[36]:
pd.to_datetime('201613', format='%Y%m', errors='coerce')
Out[36]: NaT
Alternatively you could filter them out before the conversion
df_cons['YYYYMM'] = pd.to_datetime(df_cons.loc[df_cons['YYYYMM'].str[-2:] != '13','YYYYMM'], format='%Y%m', errors='coerce')
although this could lead to alignment issues as the returned Series needs to be the same length so just passing errors='coerce' is a simpler solution
Clean up the dataframe first.
df_cons = df_cons[~df_cons['YYYYMM'].str.endswith('13')]
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'])
May I suggest turning the column into a period index if YYYYMM column is unique in your dataset.
First turn YYYYMM into index, then convert it to monthly period.
df_cons = df_cons.reset_index().set_index('YYYYMM').to_period('M')
I am trying to do a simple test on pandas capabilities to handle dates & format.
For that i have created a dataframe with values like below. :
df = pd.DataFrame({'date1' : ['10-11-11','12-11-12','10-10-10','12-11-11',
'12-12-12','11-12-11','11-11-11']})
Here I am assuming that the values are dates. And I am converting it into proper format using pandas' to_datetime function.
df['format_date1'] = pd.to_datetime(df['date1'])
print(df)
Out[3]:
date1 format_date1
0 10-11-11 2011-10-11
1 12-11-12 2012-12-11
2 10-10-10 2010-10-10
3 12-11-11 2011-12-11
4 12-12-12 2012-12-12
5 11-12-11 2011-11-12
6 11-11-11 2011-11-11
Here, Pandas is reading the date of the dataframe as "MM/DD/YY" and converting it in native format (i.e. YYYY/MM/DD). I want to check if Pandas can take my input indicating that the date format is actually "YY/MM/DD" and then let it convert into its native format. This will change the value of row no.: 5. To do this, I have run following code. But it is giving me an error.
df3['format_date2'] = pd.to_datetime(df3['date1'], format='%Y/%m/%d')
ValueError: time data '10-10-10' does not match format '%Y/%m/%d' (match)
I have seen the sort of solution here. But I was hoping to get a little easy and crisp answer.
%Y in the format specifier takes the 4-digit year (i.e. 2016). %y takes the 2-digit year (i.e. 16, meaning 2016). Change the %Y to %y and it should work.
Also the dashes in your format specifier are not present. You need to change your format to %y-%m-%d