Convert Column to Date Format (Pandas Dataframe) - python

I have a pandas dataframe as follows:
Symbol Date
A 02/20/2015
A 01/15/2016
A 08/21/2015
I want to sort it by Date, but the column is just an object.
I tried to make the column a date object, but I ran into an issue where that format is not the format needed. The format needed is 2015-02-20, etc.
So now I'm trying to figure out how to have numpy convert the 'American' dates into the ISO standard, so that I can make them date objects, so that I can sort by them.
How would I convert these american dates into ISO standard, or is there a more straight forward method I'm missing within pandas?

You can use pd.to_datetime() to convert to a datetime object. It takes a format parameter, but in your case I don't think you need it.
>>> import pandas as pd
>>> df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
>>> df
Date Symbol
0 02/20/2015 A
1 01/15/2016 A
2 08/21/2015 A
>>> df['Date'] =pd.to_datetime(df.Date)
>>> df.sort('Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
For future search, you can change the sort statement:
>>> df.sort_values(by='Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A

sort method has been deprecated and replaced with sort_values. After converting to datetime object using df['Date']=pd.to_datetime(df['Date'])
df.sort_values(by=['Date'])
Note: to sort in-place and/or in a descending order (the most recent first):
df.sort_values(by=['Date'], inplace=True, ascending=False)

#JAB's answer is fast and concise. But it changes the DataFrame you are trying to sort, which you may or may not want.
(Note: You almost certainly will want it, because your date columns should be dates, not strings!)
In the unlikely event that you don't want to change the dates into dates, you can also do it a different way.
First, get the index from your sorted Date column:
In [25]: pd.to_datetime(df.Date).order().index
Out[25]: Int64Index([0, 2, 1], dtype='int64')
Then use it to index your original DataFrame, leaving it untouched:
In [26]: df.ix[pd.to_datetime(df.Date).order().index]
Out[26]:
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
Magic!
Note: for Pandas versions 0.20.0 and later, use loc instead of ix, which is now deprecated.

Since pandas >= 1.0.0 we have the key argument in DataFrame.sort_values. This way we can sort the dataframe by specifying a key and without adjusting the original dataframe:
df.sort_values(by="Date", key=pd.to_datetime)
Symbol Date
0 A 02/20/2015
2 A 08/21/2015
1 A 01/15/2016

The data containing the date column can be read by using the below code:
data = pd.csv(file_path,parse_dates=[date_column])
Once the data is read by using the above line of code, the column containing the information about the date can be accessed using pd.date_time() like:
pd.date_time(data[date_column], format = '%d/%m/%y')
to change the format of date as per the requirement.

data['Date'] = data['Date'].apply(pd.to_datetime) # non-null datetime64[ns]

Related

Python cannot compare datetime and method

I tried to convert two columns to the same format, datetime in this case.
a['sale_date'] = pd.to_datetime(a['sale_date'])
a['last_date'] = pd.to_datetime(a['last'])
a[a.last_date>a.sale_date]
When I output the dtypes they both show up as the same:
sale_date datetime64[ns]
last_date datetime64[ns]
But I get an error from the comparison of sale_date with last that says:
Invalid comparison between dtype=datetime64[ns] and method
Does this mean they are different types? Why does this not show up when I use .dtypes? Visually the outputs look comparable.
last is the name of an existing pandas method. So, it is better to avoid using last as a column name. If you can't avoid it, then you have to select the column using square brackets.
a = pd.DataFrame({'sale_date': pd.date_range('2018-04-09', periods=4, freq='3D'),
'last': pd.date_range('2018-04-12', periods=4, freq='1D')})
a[a["last"] > a.sale_date]
# sale_date last
# 0 2018-04-09 2018-04-12
# 1 2018-04-12 2018-04-13

pandas: Extracting only the month-day values for rows from a column

I want to plot a line graph for my data however the x-axis becomes extremely tight together due to the long date format (Y-M-D), and I've checked the data type for 'date' and it returned:
In[200]: df['date'].dtypes
Out[200]: dtype('O')
So my 'date' values are:
date
----
2020-04-12
2020-05-13
2020-02-02
but I want to extract only the month and day to make the column look like
date
----
04-12
05-13
02-02
How should I do this? I apologise for dupes as I couldn't find anything similar due to my datatype being 'O'. Appreciate all the help!
Use Series.str.split and select second ist by indexing str[1]:
df['date'] = df['date'].str.split('-', n=1).str[1]
#if dates objects
#df['date'] = df['date'].astype(str).str.split('-', n=1).str[1]
print (df)
date
0 04-12
1 05-13
2 02-02
Or convert to datetimes by to_datetime with Series.dt.strftime:
df['date'] = pd.to_datetime(df['date']).dt.strftime('%m-%d')

Pandas giving me the wrong max date in a date time column?

I have a dataframe with a date column:
data['Date']
0 1/1/14
1 1/8/14
2 1/15/14
3 1/22/14
4 1/29/14
...
255 11/21/18
256 11/28/18
257 12/5/18
258 12/12/18
259 12/19/18
But, when I try to get the max date out of that column, I get:
test_data.Date.max()
'9/9/15'
Any idea why this would happen?
Clearly the column is of type object. You should try using pd.to_datetime() and then performing the max() aggregator:
data['Date'] = pd.to_datetime(data['Date'],errors='coerce') #You might need to pass format
print(data['Date'].max())
The .max() understands it as a date (like you want), if it is a datetime object. Building upon Seshadri's response, try:
type(data['Date'][1])
If it is a datetime object, this returns this:
pandas._libs.tslibs.timestamps.Timestamp
If not, you can make that column a datatime object like so:
data['Date'] = pd.to_datetime(data['Date'],format='%m/%d/%y')
The format argument makes sure you get the right formatting. See the full list of formatting options here in the python docs.
Your date may be stored as a string. First convert the column from string to datetime. Then, max() should work.
test = pd.DataFrame(['1/1/2010', '2/1/2011', '3/4/2020'], columns=['Dates'])
Dates
0 1/1/2010
1 2/1/2011
2 3/4/2020
pd.to_datetime(test['Dates'], format='%m/%d/%Y').max()
Timestamp('2020-03-04 00:00:00')
That timestamp can be cleaned up using .dt.date:
pd.to_datetime(test['Dates'], format='%m/%d/%Y').dt.date.max()
datetime.date(2020, 3, 4)
to_datetime format argument table python docs
pandas to_datetime pandas docs

Prevent Pandas to_json() from adding time component to date object

I have a dataframe of that contains some date objects. I need to convert to a json for use in JavaScript, which requires YYYY-MM-DD, but to_json() keeps adding a time component. I've seen a number of answers that convert to a string first, but this is part of a loop of about 15 queries each with many columns (simplified it for the SO question) and I don't want to hardcode each column conversion as there are a lot.
import pandas as pd
from datetime import date
df = pd.DataFrame(data=[[date(year=2018, month=1, day=1)]])
print df.to_json(orient='records', date_format='iso', date_unit='s')
Output:
[{"0":"2018-01-01T00:00:00Z"}]
Desired Output:
[{"0":"2018-01-01"}]
Pandas does not currently have the feature. There is an open issue about this, you should subscribe to the issue in case more options for the date_format argument are added in a future release (which seems like a reasonable feature request):
No way with to_json to write only date out of datetime #16492
Manually converting the relevant columns to string before dumping out json is likely the best option.
You could use strftime('%Y-%m-%d') format like so:
df = pd.DataFrame(data=[[date(year=2018, month=1, day=1).strftime('%Y-%m-
%d')]]
print(df.to_json(orient='records', date_format='iso', date_unit='s'))
# [{"0":"2018-01-01"}]
I think this is the best approach for now until pandas adds a way to write only the date out of datetime.
Demo:
Source DF:
In [249]: df = pd.DataFrame({
...: 'val':np.random.rand(5),
...: 'date1':pd.date_range('2018-01-01',periods=5),
...: 'date2':pd.date_range('2017-12-15',periods=5)
...: })
In [250]: df
Out[250]:
date1 date2 val
0 2018-01-01 2017-12-15 0.539349
1 2018-01-02 2017-12-16 0.308532
2 2018-01-03 2017-12-17 0.788588
3 2018-01-04 2017-12-18 0.526541
4 2018-01-05 2017-12-19 0.887299
In [251]: df.dtypes
Out[251]:
date1 datetime64[ns]
date2 datetime64[ns]
val float64
dtype: object
You can cast datetime columns to strings in one command:
In [252]: df.update(df.loc[:, df.dtypes.astype(str).str.contains('date')].astype(str))
In [253]: df.dtypes
Out[253]:
date1 object
date2 object
val float64
dtype: object
In [254]: df.to_json(orient='records')
Out[254]: '[{"date1":"2018-01-01","date2":"2017-12-15","val":0.5393488718},{"date1":"2018-01-02","date2":"2017-12-16","val":0.3085324043},{"
date1":"2018-01-03","date2":"2017-12-17","val":0.7885879674},{"date1":"2018-01-04","date2":"2017-12-18","val":0.5265407505},{"date1":"2018-0
1-05","date2":"2017-12-19","val":0.887298853}]'
Alternatively you can cast date columns to strings on the SQL side
I had that problem as well, but since I was looking only for the date, discarding the timezone, I was able to go around that using the following expression:
df = pd.read_json('test.json')
df['date_hour'] = [datetime.strptime(date[0:10],'%Y-%m-%d').date() for date in df['date_hour']]
So, if you have 'iso' date_format for df[date_hour] in the json file = "2018-01-01T00:00:00Z" you may use this solution.
This way you can extract the bit that really matters. Important to say that you must do it using this list comprehension, because the conversion can only be done string by string (or row by row), otherwise, the datetime.strptime alone, would throw an error saying that cannot be used with series.
Generic solution would be as follows:
df.assign( **df.select_dtypes(['datetime']).astype(str).to_dict('list') ).to_json(orient="records")
Based on the dtype it selects the datetime columns and set these as str objects so the date format is kept during serialization.

Filtering Pandas DataFrames on dates

I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.
What is the best way to achieve this?
If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.
For example:
df.loc['2014-01-01':'2014-02-01']
See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
If the column is not the index you have two choices:
Make it the index (either temporarily or permanently if it's time-series data)
df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]
See here for the general explanation
Note: .ix is deprecated.
Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:
import datetime
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]
And if your dates are standardized by importing datetime package, you can simply use:
df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]
For standarding your date string using datetime package, you can use this function:
import datetime
datetime.datetime.strptime
If you have already converted the string to a date format using pd.to_datetime you can just use:
df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]
The shortest way to filter your dataframe by date:
Lets suppose your date column is type of datetime64[ns]
# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']
# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']
# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']
If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:
from datetime import date
import pandas as pd
value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]
If the dates are in the index then simply:
df['20160101':'20160301']
You can use pd.Timestamp to perform a query and a local reference
import pandas as pd
import numpy as np
df = pd.DataFrame()
ts = pd.Timestamp
df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')
print(df)
print(df.query('date > #ts("20190515T071320")')
with the output
date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing # prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string
So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().
If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:
import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost']
This has been tested working for Python 3.7. Hope you will find this useful.
I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.
If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:
df.loc[df.index.month == 3]
That will filter the dataset for you by March.
How about using pyjanitor
It has cool features.
After pip install pyjanitor
import janitor
df_filtered = df.filter_date(your_date_column_name, start_date, end_date)
You could just select the time range by doing: df.loc['start_date':'end_date']
In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case
df.loc['2021-08-01':'2021-08-31']
returned empty. Whereas
df.loc['2021-08-31':'2021-08-01']
returned the expected data.
Another solution if you would like to use the .query() method.
It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())
df = pd.DataFrame({
'MyValue': [1,2,3],
'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")
(following the comment from #Phillip Cloud, answer from #Retozi)
import the pandas library
import pandas as pd
STEP 1: convert the date column into a string using the pd.to_datetime() method
df['date']=pd.to_datetime(df["date"],unit='s')
STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)
df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]
STEP 3 : Check the output
print(df)
# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]

Categories