Pandas giving me the wrong max date in a date time column? - python

I have a dataframe with a date column:
data['Date']
0 1/1/14
1 1/8/14
2 1/15/14
3 1/22/14
4 1/29/14
...
255 11/21/18
256 11/28/18
257 12/5/18
258 12/12/18
259 12/19/18
But, when I try to get the max date out of that column, I get:
test_data.Date.max()
'9/9/15'
Any idea why this would happen?

Clearly the column is of type object. You should try using pd.to_datetime() and then performing the max() aggregator:
data['Date'] = pd.to_datetime(data['Date'],errors='coerce') #You might need to pass format
print(data['Date'].max())

The .max() understands it as a date (like you want), if it is a datetime object. Building upon Seshadri's response, try:
type(data['Date'][1])
If it is a datetime object, this returns this:
pandas._libs.tslibs.timestamps.Timestamp
If not, you can make that column a datatime object like so:
data['Date'] = pd.to_datetime(data['Date'],format='%m/%d/%y')
The format argument makes sure you get the right formatting. See the full list of formatting options here in the python docs.

Your date may be stored as a string. First convert the column from string to datetime. Then, max() should work.
test = pd.DataFrame(['1/1/2010', '2/1/2011', '3/4/2020'], columns=['Dates'])
Dates
0 1/1/2010
1 2/1/2011
2 3/4/2020
pd.to_datetime(test['Dates'], format='%m/%d/%Y').max()
Timestamp('2020-03-04 00:00:00')
That timestamp can be cleaned up using .dt.date:
pd.to_datetime(test['Dates'], format='%m/%d/%Y').dt.date.max()
datetime.date(2020, 3, 4)
to_datetime format argument table python docs
pandas to_datetime pandas docs

Related

Create date from one year with string and int error - PYTHON

I have the following problem. I want to create a date from another. To do this, I extract the year from the database date and then create the chosen date (day = 30 and month = 9) being the year extracted from the database.
The code is the following
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
But error message is this
"cannot convert the series to <class 'int'>"
I think dt mean datetime, so the line 'dt.datetime(y,m,d)' create datetime object type.
bbdd20Q3['mydate'] should get int?
If so, try to think of another way to store the date (8 numbers maybe).
hope I helped :)
I assume that you did import datetime as dt then by doing:
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
You are delivering series as first argument to datetime.datetime, when it excepts int or something which can be converted to int. You should create one datetime.datetime for each element of series not single datetime.datetime, consider following example
import datetime
import pandas as pd
df = pd.DataFrame({"year":[2001,2002,2003]})
df["day"] = df["year"].apply(lambda x:datetime.datetime(x,9,30))
print(df)
Output:
year day
0 2001 2001-09-30
1 2002 2002-09-30
2 2003 2003-09-30
Here's a sample code with the required logic -
import pandas as pd
df = pd.DataFrame.from_dict({'date': ['2019-12-14', '2020-12-15']})
print(df.dtypes)
# convert the date in string format to datetime object,
# if the date column(Series) is already a datetime object then this is not required
df['date'] = pd.to_datetime(df['date'])
print(f'after conversion \n {df.dtypes}')
# logic to create a new data column
df['new_date'] = pd.to_datetime({'year':df['date'].dt.year,'month':9,'day':30})
#eollon I see that you are also new to Stack Overflow. It would be better if you can add a simple sample code, which others can tryout independently
(keeping the comment here since I don't have permission to comment :) )

Passing chopped down datetimes

I have been stumped for the past few hours trying to solve the following.
In a large data set I have from an automated system, there is a DATE_TIME value, which for rows at midnight has values that dont have a the full hour like: 12-MAY-2017 0:16:20
When I try convert this to a date (so that its usable for conversions) as follows:
df['DATE_TIME'].astype('datetime64[ns]')
I get the following error:
Error parsing datetime string "12-MAY-2017 0:16:20" at position 3
I tried writing some REGEX to pull out each piece but couldnt get anything working given the hour could be either 1 or two characters respectively. It also doesn't seem like an ideal solution to write regex for each peice.
Any ideas on this?
Try to use pandas.to_datetime() method:
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'], errors='coerce')
Parameter errors='coerce' will take care of those strings that can't be converted to datatime dtype
I think you need pandas.to_datetime only:
df = pd.DataFrame({'DATE_TIME':['12-MAY-2017 0:16:20','12-MAY-2017 0:16:20']})
print (df)
DATE_TIME
0 12-MAY-2017 0:16:20
1 12-MAY-2017 0:16:20
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
print (df)
DATE_TIME
0 2017-05-12 00:16:20
1 2017-05-12 00:16:20
Convert in numpy by astype seems problematic, because need strings in ISO 8601 date or datetime format:
df['DATE_TIME'].astype('datetime64[ns]')
ValueError: Error parsing datetime string "12-MAY-2017 0:16:20" at position 3
EDIT:
If datetimes are broken (some strings or ints) then use MaxU answer.

How do I convert timestamp to datetime.date in pandas dataframe?

I need to merge 2 pandas dataframes together on dates, but they currently have different date types. 1 is timestamp (imported from excel) and the other is datetime.date.
Any advice?
I've tried pd.to_datetime().date but this only works on a single item(e.g. df.ix[0,0]), it won't let me apply to the entire series (e.g. df['mydates']) or the dataframe.
I got some help from a colleague.
This appears to solve the problem posted above
pd.to_datetime(df['mydates']).apply(lambda x: x.date())
Much simpler than above:
df['mydates'].dt.date
For me this works:
from datetime import datetime
df[ts] = [datetime.fromtimestamp(x) for x in df[ts]]
You have to know if the unit of the Unix timestamp is in seconds or milliseconds. Assume that it is in seconds and assume that you have the following pandas
print(df.head())
And you get:
timestamp XETHZUSD
0 1609459200 730.85
1 1609545600 775.01
2 1609632000 979.86
3 1609718400 1042.52
4 1609804800 1103.41
You can convert the timestamp to datetime as follows:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
print(df.head())
And we get:
timestamp XETHZUSD
0 2021-01-01 730.85
1 2021-01-02 775.01
2 2021-01-03 979.86
3 2021-01-04 1042.52
4 2021-01-05 1103.41
If the Unix timestamp was in milliseconds, then you should have typed
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
Another question was marked as dupe pointing to this, but it didn't include this answer, which seems the most straightforward (perhaps this method did not yet exist when this question was posted/answered):
The pandas doc shows a pandas.Timestamp.to_pydatetime method to "Convert a Timestamp object to a native Python datetime object".
Assume time column is in timestamp integer msec format
1 day = 86400000 ms
Here you go:
day_divider = 86400000
df['time'] = df['time'].values.astype(dtype='datetime64[ms]') # for msec format
df['time'] = (df['time']/day_divider).values.astype(dtype='datetime64[D]') # for day format
If you need the datetime.date objects... then get them through with the .date attribute of the Timestamp
pd.to_datetime(df['mydates']).date
I found the following to be the most effective, when I ran into a similar issue. For instance, with the dataframe df with a series of timestmaps in column ts.
df.ts.apply(lambda x: pd.datetime.fromtimestamp(x).date())
This makes the conversion, you can leave out the .date() suffix for datetimes. Then to alter the column on the dataframe. Like so...
df.loc[:, 'ts'] = df.ts.apply(lambda x: pd.datetime.fromtimestamp(x).date())
I was trying to convert a timestamp column to date/time, here is what I came up with:
df['Timestamp'] = df['Timestamp'].apply(lambda timestamp: datetime.fromtimestamp(timestamp))

How to convert timedelta to time of day in pandas?

I have a SQL table that contains data of the mySQL time type as follows:
time_of_day
-----------
12:34:56
I then use pandas to read the table in:
df = pd.read_sql('select * from time_of_day', engine)
Looking at df.dtypes yields:
time_of_day timedelta64[ns]
My main issue is that, when writing my df to a csv file, the data comes out all messed up, instead of essentially looking like my SQL table:
time_of_day
0 days 12:34:56.000000000
I'd like to instead (obviously) store this record as a time, but I can't find anything in the pandas docs that talk about a time dtype.
Does pandas lack this functionality intentionally? Is there a way to solve my problem without requiring janky data casting?
Seems like this should be elementary, but I'm confounded.
Pandas does not support a time dtype series
Pandas (and NumPy) do not have a time dtype. Since you wish to avoid Pandas timedelta, you have 3 options: Pandas datetime, Python datetime.time, or Python str. Below they are presented in order of preference. Let's assume you start with the following dataframe:
df = pd.DataFrame({'time': pd.to_timedelta(['12:34:56', '05:12:45', '15:15:06'])})
print(df['time'].dtype) # timedelta64[ns]
Pandas datetime series
You can use Pandas datetime series and include an arbitrary date component, e.g. today's date. Underlying such a series are integers, which makes this solution the most efficient and adaptable.
The default date, if unspecified, is 1-Jan-1970:
df['time'] = pd.to_datetime(df['time'])
print(df)
# time
# 0 1970-01-01 12:34:56
# 1 1970-01-01 05:12:45
# 2 1970-01-01 15:15:06
You can also specify a date, such as today:
df['time'] = pd.Timestamp('today').normalize() + df['time']
print(df)
# time
# 0 2019-01-02 12:34:56
# 1 2019-01-02 05:12:45
# 2 2019-01-02 15:15:06
Pandas object series of Python datetime.time values
The Python datetime module from the standard library supports datetime.time objects. You can convert your series to an object dtype series containing pointers to a sequence of datetime.time objects. Operations will no longer be vectorised, but each underlying value will be represented internally by a number.
df['time'] = pd.to_datetime(df['time']).dt.time
print(df)
# time
# 0 12:34:56
# 1 05:12:45
# 2 15:15:06
print(df['time'].dtype)
# object
print(type(df['time'].at[0]))
# <class 'datetime.time'>
Pandas object series of Python str values
Converting to strings is only recommended for presentation purposes that are not supported by other types, e.g. Pandas datetime or Python datetime.time. For example:
df['time'] = pd.to_datetime(df['time']).dt.strftime('%H:%M:%S')
print(df)
# time
# 0 12:34:56
# 1 05:12:45
# 2 15:15:06
print(df['time'].dtype)
# object
print(type(df['time'].at[0]))
# <class 'str'>
it's a hack, but you can pull out the components to create a string and convert that string to a datetime.time(h,m,s) object
def convert(td):
time = [str(td.components.hours), str(td.components.minutes),
str(td.components.seconds)]
return datetime.strptime(':'.join(time), '%H:%M:%S').time()
df['time'] = df['time'].apply(lambda x: convert(x))
found a solution, but i feel like it's gotta be more elegant than this:
def convert(x):
return pd.to_datetime(x).strftime('%H:%M:%S')
df['time_of_day'] = df['time_of_day'].apply(convert)
df['time_of_day'] = pd.to_datetime(df['time_of_day']).apply(lambda x: x.time())
Adapted this code

Convert Column to Date Format (Pandas Dataframe)

I have a pandas dataframe as follows:
Symbol Date
A 02/20/2015
A 01/15/2016
A 08/21/2015
I want to sort it by Date, but the column is just an object.
I tried to make the column a date object, but I ran into an issue where that format is not the format needed. The format needed is 2015-02-20, etc.
So now I'm trying to figure out how to have numpy convert the 'American' dates into the ISO standard, so that I can make them date objects, so that I can sort by them.
How would I convert these american dates into ISO standard, or is there a more straight forward method I'm missing within pandas?
You can use pd.to_datetime() to convert to a datetime object. It takes a format parameter, but in your case I don't think you need it.
>>> import pandas as pd
>>> df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
>>> df
Date Symbol
0 02/20/2015 A
1 01/15/2016 A
2 08/21/2015 A
>>> df['Date'] =pd.to_datetime(df.Date)
>>> df.sort('Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
For future search, you can change the sort statement:
>>> df.sort_values(by='Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
sort method has been deprecated and replaced with sort_values. After converting to datetime object using df['Date']=pd.to_datetime(df['Date'])
df.sort_values(by=['Date'])
Note: to sort in-place and/or in a descending order (the most recent first):
df.sort_values(by=['Date'], inplace=True, ascending=False)
#JAB's answer is fast and concise. But it changes the DataFrame you are trying to sort, which you may or may not want.
(Note: You almost certainly will want it, because your date columns should be dates, not strings!)
In the unlikely event that you don't want to change the dates into dates, you can also do it a different way.
First, get the index from your sorted Date column:
In [25]: pd.to_datetime(df.Date).order().index
Out[25]: Int64Index([0, 2, 1], dtype='int64')
Then use it to index your original DataFrame, leaving it untouched:
In [26]: df.ix[pd.to_datetime(df.Date).order().index]
Out[26]:
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
Magic!
Note: for Pandas versions 0.20.0 and later, use loc instead of ix, which is now deprecated.
Since pandas >= 1.0.0 we have the key argument in DataFrame.sort_values. This way we can sort the dataframe by specifying a key and without adjusting the original dataframe:
df.sort_values(by="Date", key=pd.to_datetime)
Symbol Date
0 A 02/20/2015
2 A 08/21/2015
1 A 01/15/2016
The data containing the date column can be read by using the below code:
data = pd.csv(file_path,parse_dates=[date_column])
Once the data is read by using the above line of code, the column containing the information about the date can be accessed using pd.date_time() like:
pd.date_time(data[date_column], format = '%d/%m/%y')
to change the format of date as per the requirement.
data['Date'] = data['Date'].apply(pd.to_datetime) # non-null datetime64[ns]

Categories