I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.
What is the best way to achieve this?
If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.
For example:
df.loc['2014-01-01':'2014-02-01']
See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
If the column is not the index you have two choices:
Make it the index (either temporarily or permanently if it's time-series data)
df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]
See here for the general explanation
Note: .ix is deprecated.
Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:
import datetime
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]
And if your dates are standardized by importing datetime package, you can simply use:
df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]
For standarding your date string using datetime package, you can use this function:
import datetime
datetime.datetime.strptime
If you have already converted the string to a date format using pd.to_datetime you can just use:
df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]
The shortest way to filter your dataframe by date:
Lets suppose your date column is type of datetime64[ns]
# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']
# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']
# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']
If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:
from datetime import date
import pandas as pd
value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]
If the dates are in the index then simply:
df['20160101':'20160301']
You can use pd.Timestamp to perform a query and a local reference
import pandas as pd
import numpy as np
df = pd.DataFrame()
ts = pd.Timestamp
df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')
print(df)
print(df.query('date > #ts("20190515T071320")')
with the output
date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing # prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string
So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().
If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:
import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost']
This has been tested working for Python 3.7. Hope you will find this useful.
I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.
If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:
df.loc[df.index.month == 3]
That will filter the dataset for you by March.
How about using pyjanitor
It has cool features.
After pip install pyjanitor
import janitor
df_filtered = df.filter_date(your_date_column_name, start_date, end_date)
You could just select the time range by doing: df.loc['start_date':'end_date']
In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case
df.loc['2021-08-01':'2021-08-31']
returned empty. Whereas
df.loc['2021-08-31':'2021-08-01']
returned the expected data.
Another solution if you would like to use the .query() method.
It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())
df = pd.DataFrame({
'MyValue': [1,2,3],
'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")
(following the comment from #Phillip Cloud, answer from #Retozi)
import the pandas library
import pandas as pd
STEP 1: convert the date column into a string using the pd.to_datetime() method
df['date']=pd.to_datetime(df["date"],unit='s')
STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)
df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]
STEP 3 : Check the output
print(df)
# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]
Related
I have the following problem. I want to create a date from another. To do this, I extract the year from the database date and then create the chosen date (day = 30 and month = 9) being the year extracted from the database.
The code is the following
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
But error message is this
"cannot convert the series to <class 'int'>"
I think dt mean datetime, so the line 'dt.datetime(y,m,d)' create datetime object type.
bbdd20Q3['mydate'] should get int?
If so, try to think of another way to store the date (8 numbers maybe).
hope I helped :)
I assume that you did import datetime as dt then by doing:
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
You are delivering series as first argument to datetime.datetime, when it excepts int or something which can be converted to int. You should create one datetime.datetime for each element of series not single datetime.datetime, consider following example
import datetime
import pandas as pd
df = pd.DataFrame({"year":[2001,2002,2003]})
df["day"] = df["year"].apply(lambda x:datetime.datetime(x,9,30))
print(df)
Output:
year day
0 2001 2001-09-30
1 2002 2002-09-30
2 2003 2003-09-30
Here's a sample code with the required logic -
import pandas as pd
df = pd.DataFrame.from_dict({'date': ['2019-12-14', '2020-12-15']})
print(df.dtypes)
# convert the date in string format to datetime object,
# if the date column(Series) is already a datetime object then this is not required
df['date'] = pd.to_datetime(df['date'])
print(f'after conversion \n {df.dtypes}')
# logic to create a new data column
df['new_date'] = pd.to_datetime({'year':df['date'].dt.year,'month':9,'day':30})
#eollon I see that you are also new to Stack Overflow. It would be better if you can add a simple sample code, which others can tryout independently
(keeping the comment here since I don't have permission to comment :) )
I am working on a code that takes hourly data for a month and groups it into 24 hour sums. My problem is that I would like the index to read the date/year and I am just getting an index of 1-30.
The code I am using is
df = df.iloc[:,16:27].groupby([lambda x: x.day]).sum()
example of output I am getting
DateTime data
1 1772.031568
2 19884.42243
3 28696.72159
4 24906.20355
5 9059.120325
example of output I would like
DateTime data
1/1/2017 1772.031568
1/2/2017 19884.42243
1/3/2017 28696.72159
1/4/2017 24906.20355
1/5/2017 9059.120325
This is an old question, but I don't think the accepted solution is the best in this particular case. What you want to accomplish is to down sample time series data, and Pandas has built-in functionality for this called resample(). For your example you will do:
df = df.iloc[:,16:27].resample('D').sum()
or if the datetime column is not the index
df = df.iloc[:,16:27].resample('D', on='datetime_column_name').sum()
There are (at least) 2 benefits from doing it this way as opposed to accepted answer:
Resample can up sample and down sample, groupby() can only down sample
No lambdas, list comprehensions or date formatting functions required.
For more information and examples, see documentation here: resample()
If your index is a datetime, you can build a combined groupby clause:
df = df.iloc[:,16:27].groupby([lambda x: "{}/{}/{}".format(x.day, x.month, x.year)]).sum()
or even better:
df = df.iloc[:,16:27].groupby([lambda x: x.strftime("%d%m%Y")]).sum()
if your index was not datetime object.
import pandas as pd
df = pd.DataFrame({'data': [1772.031568, 19884.42243,28696.72159, 24906.20355,9059.120325]},index=[1,2,3,4,5])
print df.head()
rng = pd.date_range('1/1/2017',periods =len(df.index), freq='D')
df.set_index(rng,inplace=True)
print df.head()
will result in
data
1 1772.031568
2 19884.422430
3 28696.721590
4 24906.203550
5 9059.120325
data
2017-01-01 1772.031568
2017-01-02 19884.422430
2017-01-03 28696.721590
2017-01-04 24906.203550
2017-01-05 9059.120325
First you need to create an index on your datetime column to expose functions that break the datetime into smaller pieces efficiently (like the year and month of the datetime).
Next, you need to group by the year, month and day of the index if you want to apply an aggregate method (like sum()) to each day of the year, and retain separate aggregations for each day.
The reset_index() and rename() functions allow us to rename our group_by categories to their original names. This "flattens" out our data, making the category an actual column on the resulting dataframe.
import pandas as pd
date_index = pd.DatetimeIndex(df.created_at)
# 'df.created_at' is the datetime column in your dataframe
counted = df.group_by([date_index.year, date_index.month, date_index.day])\
.agg({'column_to_sum': 'sum'})\
.reset_index()\
.rename(columns={'level_1': 'year',
'level_2': 'month',
'level_3': 'day'})
# Resulting dataframe has columns "column_to_sum", "year", "month", "day" available
You can exploit panda's DatetimeIndex:
working_df=df.iloc[:, 16:27]
result = working_df.groupby(pd.DatetimeIndex(working_df.DateTime)).date).sum()
This if you DateTime column is actually DateTime (and be careful of the timezone).
In this way you will have valid datetime in the index, so that you can easily do other manipulations.
I'm trying to delete rows of a dataframe based on one date column; [Delivery Date]
I need to delete rows which are older than 6 months old but not equal to the year '1970'.
I've created 2 variables:
from datetime import date, timedelta
sixmonthago = date.today() - timedelta(188)
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
but I don't know how to delete rows based on these two variables, using the [Delivery Date] column.
Could anyone provide the correct solution?
You can just filter them out:
df[(df['Delivery Date'].dt.year == 1970) | (df['Delivery Date'] >= sixmonthago)]
This returns all rows where the year is 1970 or the date is less than 6 months.
You can use boolean indexing and pass multiple conditions to filter the df, for multiple conditions you need to use the array operators so | instead of or, and parentheses around the conditions due to operator precedence.
Check the docs for an explanation of boolean indexing
Be sure the calculation itself is accurate for "6 months" prior. You may not want to be hardcoding in 188 days. Not all months are made equally.
from datetime import date
from dateutil.relativedelta import relativedelta
#http://stackoverflow.com/questions/546321/how-do-i-calculate-the-date-six-months-from-the-current-date-using-the-datetime
six_months = date.today() - relativedelta( months = +6 )
Then you can apply the following logic.
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
df = df[(df['Delivery Date'].dt.year == nineteen_seventy.tm_year) | (df['Delivery Date'] >= six_months)]
If you truly want to drop sections of the dataframe, you can do the following:
df = df[(df['Delivery Date'].dt.year != nineteen_seventy.tm_year) | (df['Delivery Date'] < six_months)].drop(df.columns)
I have a dataframe full of dates and I would like to select all dates where the month==12 and the day==25 and add replace the zero in the xmas column with a 1.
Anyway to do this? the second line of my code errors out.
df = DataFrame({'date':[datetime(2013,1,1).date() + timedelta(days=i) for i in range(0,365*2)], 'xmas':np.zeros(365*2)})
df[df['date'].month==12 and df['date'].day==25] = 1
Pandas Series with datetime now behaves differently. See .dt accessor.
This is how it should be done now:
df.loc[(df['date'].dt.day==25) & (cust_df['date'].dt.month==12), 'xmas'] = 1
Basically what you tried won't work as you need to use the & to compare arrays, additionally you need to use parentheses due to operator precedence. On top of this you should use loc to perform the indexing:
df.loc[(df['date'].month==12) & (df['date'].day==25), 'xmas'] = 1
An update was needed in reply to this question. As of today, there's a slight difference in how you extract months from datetime objects in a pd.Series.
So from the very start, incase you have a raw date column, first convert it to datetime objects by using a simple function:
import datetime as dt
def read_as_datetime(str_date):
# replace %Y-%m-%d with your own date format
return dt.datetime.strptime(str_date,'%Y-%m-%d')
then apply this function to your dates column and save results in a new column namely datetime:
df['datetime'] = df.dates.apply(read_as_datetime)
finally in order to extract dates by day and month, use the same piece of code that #Shayan RC explained, with this slight change; notice the dt.datetime after calling the datetime column:
df.loc[(df['datetime'].dt.datetime.month==12) &(df['datetime'].dt.datetime.day==25),'xmas'] =1
I have a pandas dataframe as follows:
Symbol Date
A 02/20/2015
A 01/15/2016
A 08/21/2015
I want to sort it by Date, but the column is just an object.
I tried to make the column a date object, but I ran into an issue where that format is not the format needed. The format needed is 2015-02-20, etc.
So now I'm trying to figure out how to have numpy convert the 'American' dates into the ISO standard, so that I can make them date objects, so that I can sort by them.
How would I convert these american dates into ISO standard, or is there a more straight forward method I'm missing within pandas?
You can use pd.to_datetime() to convert to a datetime object. It takes a format parameter, but in your case I don't think you need it.
>>> import pandas as pd
>>> df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
>>> df
Date Symbol
0 02/20/2015 A
1 01/15/2016 A
2 08/21/2015 A
>>> df['Date'] =pd.to_datetime(df.Date)
>>> df.sort('Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
For future search, you can change the sort statement:
>>> df.sort_values(by='Date') # This now sorts in date order
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
sort method has been deprecated and replaced with sort_values. After converting to datetime object using df['Date']=pd.to_datetime(df['Date'])
df.sort_values(by=['Date'])
Note: to sort in-place and/or in a descending order (the most recent first):
df.sort_values(by=['Date'], inplace=True, ascending=False)
#JAB's answer is fast and concise. But it changes the DataFrame you are trying to sort, which you may or may not want.
(Note: You almost certainly will want it, because your date columns should be dates, not strings!)
In the unlikely event that you don't want to change the dates into dates, you can also do it a different way.
First, get the index from your sorted Date column:
In [25]: pd.to_datetime(df.Date).order().index
Out[25]: Int64Index([0, 2, 1], dtype='int64')
Then use it to index your original DataFrame, leaving it untouched:
In [26]: df.ix[pd.to_datetime(df.Date).order().index]
Out[26]:
Date Symbol
0 2015-02-20 A
2 2015-08-21 A
1 2016-01-15 A
Magic!
Note: for Pandas versions 0.20.0 and later, use loc instead of ix, which is now deprecated.
Since pandas >= 1.0.0 we have the key argument in DataFrame.sort_values. This way we can sort the dataframe by specifying a key and without adjusting the original dataframe:
df.sort_values(by="Date", key=pd.to_datetime)
Symbol Date
0 A 02/20/2015
2 A 08/21/2015
1 A 01/15/2016
The data containing the date column can be read by using the below code:
data = pd.csv(file_path,parse_dates=[date_column])
Once the data is read by using the above line of code, the column containing the information about the date can be accessed using pd.date_time() like:
pd.date_time(data[date_column], format = '%d/%m/%y')
to change the format of date as per the requirement.
data['Date'] = data['Date'].apply(pd.to_datetime) # non-null datetime64[ns]