Difference between pandas aggregators .first() and .last() - python

I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6

Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.

Related

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.
Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.

Pandas: compute average and standard deviation by clock time

I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?
We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151
have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007

how to retrieve the 3 months from each quarter hence increase df row number by 3 times. Pandas, Python

I have a quite silly task but haven't found a way to do it,
I have a huge df, here is the head
Deal Date Period Name Price Quarter Start Quarter End
0 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999
1 2011-11-01 2012-Q1 30.95 2012-01-01 2012-03-31 23:59:59.999999999
2 2011-11-01 2012-Q2 30.67 2012-04-01 2012-06-30 23:59:59.999999999
3 2011-11-01 2012-Q3 29.87 2012-07-01 2012-09-30 23:59:59.999999999
4 2011-11-01 2012-Q4 29.49 2012-10-01 2012-12-31 23:59:59.999999999
I wish to have an additional column which shows "month", the above 5 rows will become 15 rows, for example the initial row 0 will repeat twice
Deal Date Period Name Price Quarter Start Quarter End Month
0 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999 10
1 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999 11
2 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999 12
as there are these 3 months included in Q4...
similar for the rest of rows.
Is there an easy way to achieve this? Thanks
You can extract the quarter value from period, then perform pandas.merge with a dataframe with only 12 rows containing quarter -> month mapping.
Simplified example code:
import pandas as pd
df_test = pd.DataFrame({'quart':[1,2,3,4,1,2], 'val': ['a','b','c','d','e','f']})
df_quart_to_month = pd.DataFrame({'quart':[1,1,1,2,2,2,3,3,3,4,4,4], 'month': [1,2,3,4,5,6,7,8,9,10,11,12]})
df_with_months = df_test.merge(df_quart_to_month ,on='quart', how='outer')
If you want to keep the original order:
df_with_months = df_test.reset_index().merge(df_quart_to_month ,on='quart', how='outer').set_index('index')
df_sorted = df_with_months.sort_values(['index', 'month'], ascending=[True, True])
Alternatively you could split your dataset into 4 DataFrames based on their quarter, copy each sub-dataframe twice and add the corresponding month. Then concatenate the resulting 12 sub-dataframes together.

Sorting values for every level 1 in pandas multiindex

I'm having a dataframe with multiindex, the first level is an company_ID and the second level is a timestamp. How can I get a rank of all companies depending on their scores, every month?
Score
company_idx timestamp
10006 2010-01-31 69.875394
2010-11-30 73.640693
2010-12-31 73.286248
2011-01-31 73.660052
2011-02-28 74.615564
2011-03-31 73.535187
2011-04-30 72.491390
2012-01-31 72.162768
2012-02-29 61.637952
2012-03-31 59.445419
2012-04-30 25.685615
2012-05-31 8.047693
2012-06-30 58.341200
...
9981 2016-12-31 51.011261
2018-05-31 54.462832
2018-06-30 57.126250
2018-07-31 54.695835
2018-08-31 63.758145
2018-09-30 63.255583
2018-10-31 62.069697
2018-11-30 62.795650
2018-12-31 63.045329
2019-01-31 60.276990
2019-02-28 56.666379
2019-03-31 57.903213
2019-04-30 57.558973
2019-05-31 52.260287
I've tried to do:
df2 = df.sort_index(by='Score', ascending=False)
But it's not getting me what i want.
Would you be able to help? I'm quite new with multilevel dataframes.
Many thanks!
You should swap the index levels to have the month first, then sort by timestamp ascending and Score descending:
df.index = df.index.swaplevel()
df.sort_values(['timestamp', 'Score'], ascending=[True, False], inplace=True)
It does not give interesting result with your sample value, because only one company has Score value for one month.
To extract the values for one month, you can use df.xs(month_value, level=0) that will drop one level in the multi-index, or df.xs(month_value, level=0, drop_level=False) that will keep it.

Pandas day for day

I have a lot of data in a Pandas dataframe:
Timestamp Value
2015-07-15 07:16:39.034 49.960
2015-07-15 07:16:39.036 49.940
......
2015-08-12 23:16:39.235 42.958
I have about 50 000 entries per day, and I would like to perform different operations on this data, day by day.
For example, if I would like to find the rolling mean, I would enter this:
df['rm5000'] = pd.rolling_mean(df['Value'], window=5000)
But that would give me the rolling mean across dates. The first rolling mean datapoint August 12th would contain 4999 datapoints from August 11th. However, I would like to start all over each day, so as the first 4999 datapoints on each day do not contain a rolling mean of 5000, as there might be a large difference between the last data one date and the first data the next day.
Do I have to slice the data into separate dataframes for each date for Pandas to do certain operations on the data for each separate date?
If you set the timestamps as a index, you can groupby a TimeGrouper with a frequency code to partition the data by days, like below
In [2]: df = pd.DataFrame({'Timestamp': pd.date_range('2015-07-15', '2015-07-18', freq='10min'),
'Value': np.linspace(49, 51, 433)})
In [3]: df = df.set_index('Timestamp')
In [4]: df.groupby(pd.TimeGrouper('D'))['Value'].apply(lambda x: pd.rolling_mean(x, window=15))
Out[4]:
Timestamp
2015-07-15 00:00:00 NaN
2015-07-15 00:10:00 NaN
.....
2015-07-15 23:30:00 49.620370
2015-07-15 23:40:00 49.625000
2015-07-15 23:50:00 49.629630
2015-07-16 00:00:00 NaN
2015-07-16 00:10:00 NaN

Categories