How to handle dates which is out of timestamp range in pandas? - python

I was working with the Crunchbase dataset. I have an entry of Harvard University which was founded in 1636. This entry is giving me an error when I am trying to convert string to DateTime.
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1636-09-08 00:00:00
I found out that pandas support timestamp from 1677
>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
I checked out some solutions like one suggesting using errors='coerce' but dropping this entry/ making it null is not an option.
Can you please suggest a way to handle this issue?

As mentioned in comments by Henry, there is limitation of pandas timestamps because of its representation in float64, you could probably work around it by parsing the date-time using datetime library when needed, otherwise letting it stay as string or convert it to an integer
Scenario 1: If you plan on showing this value only when you print it
datetime_object = datetime.strptime('1636-09-08 00:00:00', '%Y-%m-%d %H:%M:%S')
Scenario 2: If you want to use it as a date column to retain information in the dataframe, you could additionally
datetime_object.strftime("%Y%m%d%H%M%S")
using it on a column in a pandas dataframe would yield this
df=pd.DataFrame([['1636-09-08 00:00:00'],['1635-09-09 00:00:00']], columns=['dates'])
df['str_date']=df['dates'].apply(lambda x:datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df.head()
dates
str_date
0
1636-09-08 00:00:00
1636-09-08 00:00:00
1
1635-09-09 00:00:00
1635-09-09 00:00:00
pandas treats this column as a object column, but when you access it, it is a datetime column
df['str_date'][0]
>>datetime.datetime(1636, 9, 8, 0, 0)
also, adding this for the sake of completeness: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob

Related

pandas to_datetime but replace with fixed value when fail/coerce, preserve 'meaningful' NaNs

When using pd.to_datetime on my data frame I get this error:
Out of bounds nanosecond timestamp: 30-04-18 00:00:00
Now from looking on StackO I know I can simply use the coerce option:
pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
But I was wondering if anyone had an idea on how I might replace these values with a fixed value? Say 1900-01-01 00:00:00 (or maybe 1955-11-12 for anyone who gets the reference!)
Reason being that this data frame is part of a process that handles thousands and thousands of JSONs per day. I want to be able to see in the dataset easily the incorrect ones by filtering for said fixed date.
It is just as invalid for the JSON to contain any date before 2010 so using an earlier date is fine and it is also perfectly acceptable to have a blank (NA) date value so I can't rely on just blanking the data.
Replace missing values by some default datetime value in Series.mask only for missing values generated by to_datetime with errors='coerce':
df=pd.DataFrame({"date": [np.nan,'20180101','20-20-0']})
t = pd.to_datetime('1900-01-01')
date = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df['date'] = date.mask(date.isna() & df['date'].notna(), t)
print (df)
date
0 NaT
1 2018-01-01
2 1900-01-01

Pandas convert datetime string column to datetime without offset applied

I'm new to Python and Pandas, so dont be hard with me :)
I have multiple Columns in the form of "2014-01-01 00:00:00-06:00". Now i want to convert the columns name into a pandas datetime. But i struggle with the format i need to use. I already tried
date = pd.to_datetime("2014-01-01 00:00:00-06:00", format='%Y-%m-%d %H:%M:%S%z')
But here i get a error with "ValueError: time data '2014-01-01 00:00:00-06:00' does not match format '%Y-%m-%d %H:%M:%S%Z' (match)"
I dont want the time to get converted into my timezone. I need it for the Timezone -06:00
For this Input:
2014-01-01 00:00:00-06:00
The Output should be:
2014-01-01 00:00:00
I want to use the date variable of the Output so i can split my data into seasons. Something like this:
date > springBegining
Thanks for all help
You don't need a format string, pandas is man/woman enough to handle this:
In[2]:
pd.to_datetime('2014-01-01 00:00:00-06:00')
Out[2]: Timestamp('2014-01-01 06:00:00')
besides your format string has numerous issues:
%b is month as locale abbreviated form, you have a numerical representation so it should be %m
%z requires a UTC offset in the form '+HHMM'/-HHMM
So you'd need to reformat the datetime string to:
'2014-01-01 00:00:00-0600'
If you don't want the offset to be applied and the offset is always the same you can strip this from the string:
In[25]:
pd.to_datetime('2014-01-01 00:00:00-06:00'.rsplit('-',1)[0])
Out[25]: Timestamp('2014-01-01 00:00:00')
Or you could slice the string:
In[26]:
pd.to_datetime('2014-01-01 00:00:00-06:00'[:-6])
Out[26]: Timestamp('2014-01-01 00:00:00')
So to do the above on an entire column:
pd.to_datetime(df[col].str[:-6])
Example:
In[27]:
df = pd.DataFrame({'date':['2014-01-01 00:00:00-06:00','2014-01-01 00:00:00+06:00']})
df
Out[27]:
date
0 2014-01-01 00:00:00-06:00
1 2014-01-01 00:00:00+06:00
In[28]:
pd.to_datetime(df['date'].str[:-6])
Out[28]:
0 2014-01-01
1 2014-01-01
Name: date, dtype: datetime64[ns]
Here we use the string accessor .str to slice all the columns in the same manner and pass this to to_datetime to convert the entire column

pd.to_datetime changes the values to 1970 which are not the correct dates

I have a column of the below type in dataframe.
MAT_DATE object
The values in this column are something like
42872
42741
...
...
How can I convert them to datetime ?
These are essentially future dates.
Using pd.to_datetime() converts them to year 1970
df['MAT_DATE1'] = pd.to_datetime(df['MAT_DATE'], errors='coerce')
If I use the excel to change to short date, it does well to convert the dates.
However I want to use it on the dataframe directly.
Using the origin parameter of the pandas.to_datetime that you are interested in and based on the days as the delta as #Wen suggested, this might work:
pd.to_datetime(df['MAT_DATE'],errors='coerce',unit='d',origin='1900-01-01')
The number is days delta to offset date , the default for excel is offset is 1990-01-01
s=pd.Series([42872,42741])
pd.TimedeltaIndex(s,unit='d')+pd.to_datetime('1900-01-01')
Out[88]: DatetimeIndex(['2017-05-19', '2017-01-08'], dtype='datetime64[ns]', freq=None)

Working on dates with mm-dd-YY & YY-mm-dd format in pandas

I am trying to do a simple test on pandas capabilities to handle dates & format.
For that i have created a dataframe with values like below. :
df = pd.DataFrame({'date1' : ['10-11-11','12-11-12','10-10-10','12-11-11',
'12-12-12','11-12-11','11-11-11']})
Here I am assuming that the values are dates. And I am converting it into proper format using pandas' to_datetime function.
df['format_date1'] = pd.to_datetime(df['date1'])
print(df)
Out[3]:
date1 format_date1
0 10-11-11 2011-10-11
1 12-11-12 2012-12-11
2 10-10-10 2010-10-10
3 12-11-11 2011-12-11
4 12-12-12 2012-12-12
5 11-12-11 2011-11-12
6 11-11-11 2011-11-11
Here, Pandas is reading the date of the dataframe as "MM/DD/YY" and converting it in native format (i.e. YYYY/MM/DD). I want to check if Pandas can take my input indicating that the date format is actually "YY/MM/DD" and then let it convert into its native format. This will change the value of row no.: 5. To do this, I have run following code. But it is giving me an error.
df3['format_date2'] = pd.to_datetime(df3['date1'], format='%Y/%m/%d')
ValueError: time data '10-10-10' does not match format '%Y/%m/%d' (match)
I have seen the sort of solution here. But I was hoping to get a little easy and crisp answer.
%Y in the format specifier takes the 4-digit year (i.e. 2016). %y takes the 2-digit year (i.e. 16, meaning 2016). Change the %Y to %y and it should work.
Also the dashes in your format specifier are not present. You need to change your format to %y-%m-%d

Convert Column to Date Format (Pandas Dataframe)

I have a pandas dataframe as follows:
Symbol Date
A 02/20/2015
A 01/15/2016
A 08/21/2015
I want to sort it by Date, but the column is just an object.
I tried to make the column a date object, but I ran into an issue where that format is not the format needed. The format needed is 2015-02-20, etc.
So now I'm trying to figure out how to have numpy convert the 'American' dates into the ISO standard, so that I can make them date objects, so that I can sort by them.
How would I convert these american dates into ISO standard, or is there a more straight forward method I'm missing within pandas?
You can use pd.to_datetime() to convert to a datetime object. It takes a format parameter, but in your case I don't think you need it.
>>> import pandas as pd
>>> df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
>>> df
Date Symbol
0 02/20/2015 A
1 01/15/2016 A
2 08/21/2015 A
>>> df['Date'] =pd.to_datetime(df.Date)
>>> df.sort('Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
For future search, you can change the sort statement:
>>> df.sort_values(by='Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
sort method has been deprecated and replaced with sort_values. After converting to datetime object using df['Date']=pd.to_datetime(df['Date'])
df.sort_values(by=['Date'])
Note: to sort in-place and/or in a descending order (the most recent first):
df.sort_values(by=['Date'], inplace=True, ascending=False)
#JAB's answer is fast and concise. But it changes the DataFrame you are trying to sort, which you may or may not want.
(Note: You almost certainly will want it, because your date columns should be dates, not strings!)
In the unlikely event that you don't want to change the dates into dates, you can also do it a different way.
First, get the index from your sorted Date column:
In [25]: pd.to_datetime(df.Date).order().index
Out[25]: Int64Index([0, 2, 1], dtype='int64')
Then use it to index your original DataFrame, leaving it untouched:
In [26]: df.ix[pd.to_datetime(df.Date).order().index]
Out[26]:
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
Magic!
Note: for Pandas versions 0.20.0 and later, use loc instead of ix, which is now deprecated.
Since pandas >= 1.0.0 we have the key argument in DataFrame.sort_values. This way we can sort the dataframe by specifying a key and without adjusting the original dataframe:
df.sort_values(by="Date", key=pd.to_datetime)
Symbol Date
0 A 02/20/2015
2 A 08/21/2015
1 A 01/15/2016
The data containing the date column can be read by using the below code:
data = pd.csv(file_path,parse_dates=[date_column])
Once the data is read by using the above line of code, the column containing the information about the date can be accessed using pd.date_time() like:
pd.date_time(data[date_column], format = '%d/%m/%y')
to change the format of date as per the requirement.
data['Date'] = data['Date'].apply(pd.to_datetime) # non-null datetime64[ns]

Categories