Groupby in pandas timeseries dataframe choosing most recent event - python

I have a pandas timeseries dataframe that has date set as index and a number of columns (one is cusip).
I want to iterate through the dataframe and create a new dataframe where, for each cusip, I take the most recent data available.
I tried to use groupby:
newData = []
for group in df.groupby(df['CUSIP']):
newData.append(group[group.index == max(group.index)])
'builtin_function_or_method' object is not iterable
In [374]: df.head()
Out[374]:
CUSIP COLA COLB COLC
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.24
1995-12-12 BBB 254 5150 3.25
1997-12-12 BBB 245 6150 3.25
1998-12-12 CCC 234 5140 3.24145
1999-12-12 CCC 223 5120 3.65145
I want:
CUSIP COLA COLB COLC
date
1992-07-13 AAA 234 4677 3.485577
1997-12-12 BBB 245 6150 3.25
1999-12-12 CCC 223 5120 3.65145
Should I approach this another way? Thank you.

In [17]: df
Out[17]:
cusip a b c
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.240000
1995-12-12 BBB 254 5150 3.250000
1997-12-12 BBB 245 6150 3.250000
1998-12-12 CCC 234 5140 3.241450
1999-12-12 CCC 223 5120 3.651450
[7 rows x 4 columns]
Sort it
In [18]: df = df.sort_index()
In [19]: df
Out[19]:
cusip a b c
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.240000
1995-12-12 BBB 254 5150 3.250000
1997-12-12 BBB 245 6150 3.250000
1998-12-12 CCC 234 5140 3.241450
1999-12-12 CCC 223 5120 3.651450
[7 rows x 4 columns]
Take the last element from each group
In [20]: df.groupby('cusip').last()
Out[20]:
a b c
cusip
AAA 234 4677 3.485577
BBB 245 6150 3.250000
CCC 223 5120 3.651450
[3 rows x 3 columns]
If you want to keep the date index, reset first, group, then set the index back
In [9]: df.reset_index().groupby('cusip').last().reset_index().set_index('date')
Out[9]:
cusip a b c
date
1992-07-13 AAA 234 4677 3.485577
1997-12-12 BBB 245 6150 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]

I did it this way
df = pd.read_csv('/home/desktop/test.csv' )
convert date to datetime
df = df.reset_index()
df['date'] = pd.to_datetime(df['date'])
sort dataframe the way you want
df = df.sort(['CUSIP','date'], ascending=[True,False]).groupby('CUSIP')
define what happens when you aggregate (according to the way you sorted)
def return_first(pd_series):
return pd_series.values[0]
make dict to apply same function to all columns
agg_dict = {c: return_first for c in df.columns}
finally aggregate
df = df.agg(agg_dict)
EDIT:
converting the date to datetime avoids this kind of error:
In [12]: df.sort(['CUSIP','date'],ascending=[True,False])
Out[12]:
date CUSIP COLA COLB COLC date_time
6 1999-12-12 CCC 223 5120 3.651450 1999-12-12 00:00:00
5 1998-12-12 CCC 234 5140 3.241450 1998-12-12 00:00:00
8 1997-12-4 DDD 999 9999 9.999999 1997-12-04 00:00:00
9 1997-12-05 DDD 245 6150 3.250000 1997-12-05 00:00:00
7 1992-07-6 DDD 234 4677 3.485577 1992-07-06 00:00:00

Related

Python: Loop over multiple items

Let's start with a simple DataFrame:
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
df:
a
0 100
1 100
2 105
3 110
4 100
5 106
6 120
7 110
8 105
9 70
10 90
11 100
Now, I want to calculate the returns on a 7-day rolling basis. So I apply the following:
df['delta_rol_a_last_first'] = np.nan
for i in range(7,len(df)):
df['delta_rol_a_last_first'].iloc[i] = (df['a'].iloc[i] - df['a'].iloc[i-7])/df['a'].iloc[i-6]
df.dropna(inplace=True)
df:
a delta_rol_a_last_first
7 110 0.100000
8 105 0.047619
9 70 -0.318182
10 90 -0.200000
11 100 0.000000
Now I just want the negative returns, apply quantiles to them and I want to add identities to the rows as follows:
df_quant = df['delta_rol_a_last_first'][df['delta_rol_a_last_first'] <0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant=df_quant.to_frame()
df_quant['Type'] = 'pct'
df_quant['timeframe'] = 'weekly'
df_quant:
delta_rol_a_last_first Type timeframe
quantile
0.01 -0.317000 pct weekly
0.03 -0.314636 pct weekly
0.05 -0.312273 pct weekly
0.10 -0.306364 pct weekly
So that works perfectly.
Now imagine I want to do the same but more dynamically. So consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I will create a dictionary for the periods over which I want to calculate my rolling returns:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
Now I want to create the same DataFrame as df_quant above. So my DataFrame should look something like this:
col_A_rolling col_B_rolling col_C_rolling Type timeframe
quantile
0.01 -0.317000 -0.234 -0.0443 pct weekly
0.03 -0.314636 -0.022 ... pct weekly
0.05 ... ... ... ...
0.10 ... ...
0.01 ... ...
0.03 ... ...
0.05 ... ...
0.10 -0.306364 -.530023 pct daily
(NOTE: the numbers in this DataFrame are hypothetical)
EDIT:
My attempt is this:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df1[col].iloc[i-key])/df1[col].iloc[i-key]
What is the best way to do this? Any help would be appreciated.
I didn't test all code but first part can be replaced by DataFrame.roling
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
df[['delta_rol_a_last_first']] = df.rolling(8).apply(convert)
# ---
print(df)
or using lambda
df[['delta_rol_a_last_first']] = df.rolling(8).apply(lambda data: ((data.iloc[-1] - data.iloc[0])/data.iloc[1]))
The same for many columns:
import pandas as pd
data = [
[99330,12,122], [1123,1230,1287], [123,101,812739], [1143,1230123,252],
[234,342,4546], [2445,3453,3457], [7897,8657,5675], [46,5675,453],
[76,484,3735], [363,93,4568], [385,568,367], [458,846,4847],
[574,45747,658468], [57457,46534,4675]
]
df = pd.DataFrame(
data,
index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C']
)
df.index = pd.to_datetime(df.index)
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
#df[['col_A_weekly', 'col_B_weekly', 'col_C_weekly']] = df.rolling(8).apply(convert)
new_columns = [name+'_weekly' for name in df.columns]
df[new_columns] = df.rolling(8).apply(convert)
# ---
print(df)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN
2022-01-03 123 101 812739 NaN NaN NaN
2022-01-04 1143 1230123 252 NaN NaN NaN
2022-01-05 234 342 4546 NaN NaN NaN
2022-01-06 2445 3453 3457 NaN NaN NaN
2022-01-07 7897 8657 5675 NaN NaN NaN
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297
2022-01-12 458 846 4847 0.091616 0.145960 0.087070
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506
EDIT:
Using two ranges daily and weekly
old_columns = df.columns
new_columns = [name+'_weekly' for name in old_columns]
df[new_columns] = df[old_columns].rolling(8).apply(convert)
new_columns = [name+'_daily' for name in old_columns]
df[new_columns] = df[old_columns].rolling(2).apply(convert)
or using loop:
old_columns = df.columns
for days, suffix in ((1, 'daily'), (7, 'weekly')):
new_columns = [name+'_'+suffix for name in old_columns]
df[new_columns] = df[old_columns].rolling(days+1).apply(convert)
or
for days, suffix in ((1, 'daily'), (7, 'weekly')):
for name in old_columns:
new_name = name + '_' + suffix
df[new_name] = df[name].rolling(days+1).apply(convert)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly col_A_daily col_B_daily col_C_daily
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN -87.450579 0.990244 0.905206
2022-01-03 123 101 812739 NaN NaN NaN -8.130081 -11.178218 0.998416
2022-01-04 1143 1230123 252 NaN NaN NaN 0.892388 0.999918 -3224.154762
2022-01-05 234 342 4546 NaN NaN NaN -3.884615 -3595.850877 0.944567
2022-01-06 2445 3453 3457 NaN NaN NaN 0.904294 0.900956 -0.315013
2022-01-07 7897 8657 5675 NaN NaN NaN 0.690389 0.601132 0.390837
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187 -170.673913 -0.525463 -11.527594
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012 0.394737 -10.725207 0.878715
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778 0.790634 -4.204301 0.182356
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297 0.057143 0.836268 -11.446866
2022-01-12 458 846 4847 0.091616 0.145960 0.087070 0.159389 0.328605 0.924283
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441 0.202091 0.981507 0.992639
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506 0.990010 0.016912 -139.848770
EDIT:
Quantile:
finall_df = pd.DataFrame()
for days, suffix in ((1, 'daily'), (7, 'weekly')):
df_quant = pd.DataFrame()
for name in old_columns:
new_name = name + '_' + suffix
df_quant[name] = df[new_name][df[new_name]<0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant['Type'] = 'pct'
df_quant['timeframe'] = suffix
print(df_quant.to_string())
#finall_df = finall_df.append(df_quant)
finall_df = pd.concat([finall_df,df_quant])
print(finall_df)
Result:
col_A col_B col_C Type timeframe
quantile
0.01 -168.177213 -3452.463971 -3100.782522 pct daily
0.03 -163.183813 -3165.690158 -2854.038043 pct daily
0.05 -158.190413 -2878.916345 -2607.293564 pct daily
0.10 -145.706913 -2161.981813 -1990.432365 pct daily
0.01 -86.012694 -3523.433980 -3174.979575 pct weekly
0.03 -81.218849 -3379.921823 -3110.883170 pct weekly
0.05 -76.425004 -3236.409666 -3046.786764 pct weekly
0.10 -64.440391 -2877.629275 -2886.545751 pct weekly

Filter dataframe per ID based on conditional timerange

Hi I will try to explain the issue I am facing.
I have one dataframe (df) with the following:
ID
Date (dd-mm-yyyy)
AAA
01-09-2020
AAA
01-11-2020
AAA
18-03-2021
AAA
10-10-2022
BBB
01-01-2019
BBB
01-03-2019
CCC
01-05-2020
CCC
01-07-2020
CCC
01-08-2020
CCC
01-10-2021
I have created another dataframe (df2) with the first date (t) registered per ID and t+3months:
ID
T (First Date Occurred)
T+3
AAA
01-09-2020
01-12-2020
BBB
01-01-2019
01-03-2020
CCC
01-05-2020
01-08-2020
The desired output where I am struggling is to filter the df based on the two date filters defined in df2("T" & "T+3):
e.g.AAA = AAA > T & AAA < T+3
ID
Date (dd-mm-yyyy)
AAA
01-11-2020
BBB
01-03-2019
CCC
01-07-2020
CCC
01-08-2020
What is the best way to approach this? Any help is appreciated!
IIUC, you can use pandas.merge_asof with allow_exact_matches=False:
(pd.merge_asof(df1.sort_values(by='Date'), df2.sort_values(by='T'),
allow_exact_matches=False,
by='ID', left_on='Date', right_on='T')
.loc[lambda d: d['Date'] <= d['T+3']]
)
NB. the exact condition on the T+3 is unclear as you describe "< T+3" but the shown output has "<= T+3", just chose what you want (< or <=) in the loc
output:
ID Date T T+3
1 BBB 2019-03-01 2019-01-01 2020-03-01
3 CCC 2020-07-01 2020-05-01 2020-08-01
4 CCC 2020-08-01 2020-05-01 2020-08-01
6 AAA 2020-11-01 2020-09-01 2020-12-01

how to shift value in dataframe using pandas?

I have data like this, without z1, what i need is to add a column to DataFrame, so it will add column z1 and represent values as in the example, what it should do is to shift z value equally on 1 day before for the same Start date.
I was thinking it could be done with apply and lambda in pandas, but i`m not sure how to define lambda function
data = pd.read_csv("....")
data["Z"] = data[[
"Start", "Z"]].apply(lambda x:
You can use DataFrameGroupBy.shift with merge:
#if not datetime
df['date'] = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
df1 = df.groupby('start')['z'].shift(freq='1D',periods=1).reset_index()
print (pd.merge(df.reset_index(),df1, on=['start','date'], how='left', suffixes=('','1')))
date start z z1
0 2012-12-01 324 564545 NaN
1 2012-12-01 384 5555 NaN
2 2012-12-01 349 554 NaN
3 2012-12-02 855 635 NaN
4 2012-12-02 324 56 564545.0
5 2012-12-01 341 98 NaN
6 2012-12-03 324 888 56.0
EDIT:
Try find duplicates and fillna by 0:
df['date'] = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
df1 = df.groupby('start')['z'].shift(freq='1D',periods=1).reset_index()
df2 = pd.merge(df.reset_index(),df1, on=['start','date'], how='left', suffixes=('','1'))
mask = df2.start.duplicated(keep=False)
df2.ix[mask, 'z1'] = df2.ix[mask, 'z1'].fillna(0)
print (df2)
date start z z1
0 2012-12-01 324 564545 0.0
1 2012-12-01 384 5555 NaN
2 2012-12-01 349 554 NaN
3 2012-12-02 855 635 NaN
4 2012-12-02 324 56 564545.0
5 2012-12-01 341 98 NaN
6 2012-12-03 324 888 56.0

Frequency count unique values Pandas

I have a Pandas Series as follow :
2014-05-24 23:59:49 1.3
2014-05-24 23:59:50 2.17
2014-05-24 23:59:50 1.28
2014-05-24 23:59:51 1.30
2014-05-24 23:59:51 2.17
2014-05-24 23:59:53 2.17
2014-05-24 23:59:58 2.17
Name: api_id, Length: 483677
I'm trying to count for each id the frequency per day.
For now I'm doing this :
count = {}
for x in apis.unique():
count[x] = apis[apis == x].resample('D','count')
count_df = pd.DataFrame(count)
That gives me what I want which is :
... 2.13 2.17 2.4 2.6 2.7 3.5(user) 3.9 4.2 5.1 5.6
timestamp ...
2014-05-22 ... 391 49962 3727 161 2 444 113 90 1398 90
2014-05-23 ... 450 49918 3861 187 1 450 170 90 629 90
2014-05-24 ... 396 46359 3603 172 3 513 171 89 622 90
But is there a way to do so without the for loop ?
You can use the value_counts function for this (docs), applying this after a groupby (which is similar to the resample('D') you did, but resample is expecting an aggregated output so we have to use the more general groupby in this case). With a small example:
In [16]: s = pd.Series([1,1,2,2,1,2,5,6,2,5,4,1], index=pd.date_range('2012-01-01', periods=12, freq='8H'))
In [17]: counts = s.groupby(pd.Grouper(freq='D')).value_counts()
In [18]: counts
Out[18]:
2012-01-01 1 2
2 1
2012-01-02 2 2
1 1
2012-01-03 2 1
6 1
5 1
2012-01-04 1 1
5 1
4 1
dtype: int64
To get this in the desired format, you can just unstack this (move the second level row indices to the columns):
In [19]: counts.unstack()
Out[19]:
1 2 4 5 6
2012-01-01 2 1 NaN NaN NaN
2012-01-02 1 2 NaN NaN NaN
2012-01-03 NaN 1 NaN 1 1
2012-01-04 1 NaN 1 1 NaN
Note: for the use of groupby(pd.Grouper(freq='D')) you need pandas 0.14. If you have al older version, you can use groupby(pd.TimeGrouper(freq='D')) to obtain exactly the same. This is also similar to doing groupby(s.index.date) (with the difference you have then datetime.date objects in the index).

Groupby - taking last element - how do I keep nan's?

I have a df and I want to grab the most recent row below by CUSIP.
In [374]: df.head()
Out[374]:
CUSIP COLA COLB COLC
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA NaN 4677 3.485577
1992-12-12 BBB 221 5150 3.24
1995-12-12 BBB 254 5150 3.25
1997-12-12 BBB 245 Nan 3.25
1998-12-12 CCC 234 5140 3.24145
1999-12-12 CCC 223 5120 3.65145
I am using:
df = df.reset_index().groupby('CUSIP').last().reset_index.set_index('date')
I want this:
CUSIP COLA COLB COLC
date
1992-07-13 AAA NaN 4677 3.485577
1997-12-12 BBB 245 Nan 3.25
1999-12-12 CCC 223 5120 3.65145
Instead I am getting:
CUSIP COLA COLB COLC
date
1992-07-13 AAA 238 4677 3.485577
1997-12-12 BBB 245 5150 3.25
1999-12-12 CCC 223 5120 3.65145
How do I get last() to take the last row of the groupby including the NaN's?
Thank you.
You can do this directly with an apply instead of last (and get the -1th row of each group):
In [11]: df.reset_index().groupby('CUSIP').apply(lambda x: x.iloc[-1]).reset_index(drop=True).set_index('date')
Out[11]:
CUSIP COLA COLB COLC
date
1992-07-13 AAA NaN 4677 3.485577
1997-12-12 BBB 245 NaN 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]
In 0.13 (rc out now), a faster and more direct way will be to use cumcount:
In [12]: df[df.groupby('CUSIP').cumcount(ascending=False) == 0]
Out[12]:
CUSIP COLA COLB COLC
date
1992-07-13 AAA NaN 4677 3.485577
1997-12-12 BBB 245 NaN 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]

Categories