I have a pandas data series with cumulative daily returns for a series:
Date CumReturn
3/31/2017 1
4/3/2017 .99
4/4/2017 .992
... ...
4/28/2017 1.012
5/1/2017 1.011
... ...
5/31/2017 1.022
... ...
6/30/2017 1.033
... ...
I want only the month-end values.
Date CumReturn
4/28/2017 1.012
5/31/2017 1.022
6/30/2017 1.033
Because I want only the month-end values, resampling doesn't work as it aggregates the interim values.
What is the easiest way to get only the month end values as they appear in the original dataframe?
Use the is_month_end component of the .dt date accessor:
# Ensure the date column is a Timestamp
df['Date'] = pd.to_datetime(df['Date'])
# Filter to end of the month only
df = df[df['Date'].dt.is_month_end]
Applying this to the data you provided:
Date CumReturn
0 2017-03-31 1.000
5 2017-05-31 1.022
6 2017-06-30 1.033
EDIT
To get business month end, compare using BMonthEnd(0):
from pandas.tseries.offsets import BMonthEnd
# Ensure the date column is a Timestamp
df['Date'] = pd.to_datetime(df['Date'])
# Filter to end of the month only
df = df[df['Date'] == df['Date'] + BMonthEnd(0)]
Applying this to the data you provided:
Date CumReturn
0 2017-03-31 1.000
3 2017-04-28 1.012
5 2017-05-31 1.022
6 2017-06-30 1.033
df.sort_values('Date').groupby([df.Date.dt.year,df.Date.dt.month]).last()
Out[197]:
Date CumReturn
Date Date
2017 3 2017-03-31 1.000
4 2017-04-28 1.012
5 2017-05-31 1.022
6 2017-06-30 1.033
Assuming that the dataframe is already sorted by 'Date' and that the values in that column are Pandas timestamps, you can convert them to YYYY-mm string values for grouping and take the last value:
df.groupby(df['Date'].dt.strftime('%Y-%m'))['CumReturn'].last()
# Example output:
# 2017-01 0.127002
# 2017-02 0.046894
# 2017-03 0.005560
# 2017-04 0.150368
Related
I created a pandas df with columns named start_date and current_date. Both columns have a dtype of datetime64[ns]. What's the best way to find the quantity of business days between the current_date and start_date column?
I've tried:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = len(pd.date_range(start=projects_df['start_date'], end=projects_df['current_date'], freq=us_bd))
I get the following error message:
Cannot convert input....start_date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
I'm using Python version 3.10.4.
pd.date_range's parameters need to be datetimes, not series.
For this reason, we can use df.apply to apply the function to each row.
In addition, pandas has bdate_range which is just date_range with freq defaulting to business days, which is exactly what you need.
Using apply and a lambda function, we can create a new Series calculating business days between each start and current date for each row.
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = projects_df.apply(lambda row: len(pd.bdate_range(row['start_date'], row['current_date'])), axis=1)
Using a random sample of 10 date pairs, my output is the following:
start_date current_date bdays
0 2022-01-03 17:08:04 2022-05-20 00:53:46 100
1 2022-04-18 09:43:02 2022-06-10 16:56:16 40
2 2022-09-01 12:02:34 2022-09-25 14:59:29 17
3 2022-04-02 14:24:12 2022-04-24 21:05:55 15
4 2022-01-31 02:15:46 2022-07-02 16:16:02 110
5 2022-08-02 22:05:15 2022-08-17 17:25:10 12
6 2022-03-06 05:30:20 2022-07-04 08:43:00 86
7 2022-01-15 17:01:33 2022-08-09 21:48:41 147
8 2022-06-04 14:47:53 2022-12-12 18:05:58 136
9 2022-02-16 11:52:03 2022-10-18 01:30:58 175
I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?
We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151
have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007
I have a df column with dates and hours / minutes:
0 2019-09-13 06:00:00
1 2019-09-13 06:05:00
2 2019-09-13 06:10:00
3 2019-09-13 06:15:00
4 2019-09-13 06:20:00
Name: Date, dtype: datetime64[ns]
I need to count how many days the dataframe contains.
I tried it like this:
sample_length = len(df.groupby(df['Date'].dt.date).first())
and
sample_length = len(df.groupby(df['Date'].dt.date))
But the number I get seems wrong. Do you know another method of counting the days?
Consider the sample dates:
sample = pd.date_range('2019-09-12 06:00:00', periods=50, freq='4h')
df = pd.DataFrame({'date': sample})
date
0 2019-09-12 06:00:00
1 2019-09-12 10:00:00
2 2019-09-12 14:00:00
3 2019-09-12 18:00:00
4 2019-09-12 22:00:00
5 2019-09-13 02:00:00
6 2019-09-13 06:00:00
...
47 2019-09-20 02:00:00
48 2019-09-20 06:00:00
49 2019-09-20 10:00:00
Use, DataFrame.groupby to group the dataframe on df['date'].dt.date and use the aggregate function GroupBy.size:
count = df.groupby(df['date'].dt.date).size()
# print(count)
date
2019-09-12 5
2019-09-13 6
2019-09-14 6
2019-09-15 6
2019-09-16 6
2019-09-17 6
2019-09-18 6
2019-09-19 6
2019-09-20 3
dtype: int64
I'm not completely sure what you want to do here. Do you want to count the number of unique days (Monday/Tuesday/...), monthly dates (1-31 ish), yearly dates (1-365), or unique dates (unique days since the dawn of time)?
From a pandas series, you can use {series}.value_counts() to get the number of entries for each unique value, or simply get all unique values with {series}.unique()
import pandas as pd
df = pd.DataFrame(pd.DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
'2015-01-24 10:11:04', '2015-03-26 16:23:53',
'2017-04-01 00:38:21', '2015-03-19 03:47:54',
'2015-12-30 07:32:32', '2015-11-10 20:39:36',
'2015-06-24 05:48:09', '2015-03-19 16:05:19'],
dtype='datetime64[ns]', freq=None), columns = ["date"])
days (Monday/Tuesday/...):
df.date.dt.dayofweek.value_counts()
monthly dates (1-31 ish)
df.date.dt.day.value_counts()
yearly dates (1-365)
df.date.dt.dayofyear.value_counts()
unique dates (unique days since the dawn of time)
df.date.dt.date.value_counts()
To get the number of unique entries from any of the above, simply add .shape[0]
In order to calculate the total number of unique dates in the given time series data example we can use:
print(len(pd.to_datetime(df['Date']).dt.date.unique()))
import pandas as pd
df = pd.DataFrame({'Date': ['2019-09-13 06:00:00',
'2019-09-13 06:05:00',
'2019-09-13 06:10:00',
'2019-09-13 06:15:00',
'2019-09-13 06:20:00']
},
dtype = 'datetime64[ns]'
)
df = df.set_index('Date')
_count_of_days = df.resample('D').first().shape[0]
print(_count_of_days)
I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6
Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.
I have a dataset containing monthly observations of a time-series.
What I want to do is transform the datetime to year/quarter format and then extract the first value DATE[0] as the previous quarter. For example 2006-10-31 belongs to 4Q of 2006. But I want to change it to 2006Q3.
For the extraction of the subsequent values I will just use the last value from each quarter.
So, for 2006Q4 I will keep BBGN, SSD, and QQ4567 values only from DATE[2]. Similarly, for 2007Q1 I will keep only DATE[5] values, and so forth.
Original dataset:
DATE BBGN SSD QQ4567
0 2006-10-31 00:00:00 1.210 22.022 9726.550
1 2006-11-30 00:00:00 1.270 22.060 9891.008
2 2006-12-31 00:00:00 1.300 22.080 10055.466
3 2007-01-31 00:00:00 1.330 22.099 10219.924
4 2007-02-28 00:00:00 1.393 22.110 10350.406
5 2007-03-31 00:00:00 1.440 22.125 10480.888
After processing the DATE
DATE BBGN SSD QQ4567
0 2006Q3 1.210 22.022 9726.550
2 2006Q4 1.300 22.080 10055.466
5 2007Q1 1.440 22.125 10480.888
The steps I have taken so far are:
Turn the values from the yyyy-mm-dd hh format to yyyyQQ format
DF['DATE'] = pd.to_datetime(DF['DATE']).dt.to_period('Q')
and I get this
DATE BBGN SSD QQ4567
0 2006Q4 1.210 22.022 9726.550
1 2006Q4 1.270 22.060 9891.008
2 2006Q4 1.300 22.080 10055.466
3 2007Q1 1.330 22.099 10219.924
4 2007Q1 1.393 22.110 10350.406
5 2007Q1 1.440 22.125 10480.888
The next step is to extract the last values from each quarter. But because I always want to keep the first row I will exclude DATE[0] from the function.
quarterDF = DF.iloc[1:,].drop_duplicates(subset='DATE', keep='last')
Now, my question is how can I change the value in DATE[0] to always be the previous quarter. So, from 2006Q4 to be 2006Q3. Also, how this will work if DATE[0] is 2007Q1, can I change it to 2006Q4?
My suggestion would be to create a new DATE column with a day 3 months in the past. Like this
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['2006-10-31', '2007-01-31'])
one_quarter = pd.tseries.offsets.DateOffset(months=3)
df['Last_quarter'] = df.Date - one_quarter
This will give you
Date Last_quarter
0 2006-10-31 2006-07-31
1 2007-01-31 2006-10-31
Then you can do the same process as you described above on Last_quarter
Here is a pivot_table approach
# Subtract the quarter from date save it in a column
df['Q'] = df['DATE'] - pd.tseries.offsets.QuarterEnd()
#0 2006-09-30
#1 2006-09-30
#2 2006-09-30
#3 2006-12-31
#4 2006-12-31
#5 2006-12-31
#Name: Q, dtype: datetime64[ns]
# Drop and pivot for not including the columns
ndf = df.drop(['DATE','Q'],1).pivot_table(index=pd.to_datetime(df['Q']).dt.to_period('Q'),aggfunc='last')
BBGN QQ4567 SSD
Qdate
2006Q3 1.30 10055.466 22.080
2006Q4 1.44 10480.888 22.125