Groupby - taking last element - how do I keep nan's? - python

I have a df and I want to grab the most recent row below by CUSIP.
In [374]: df.head()
Out[374]:
CUSIP COLA COLB COLC
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA NaN 4677 3.485577
1992-12-12 BBB 221 5150 3.24
1995-12-12 BBB 254 5150 3.25
1997-12-12 BBB 245 Nan 3.25
1998-12-12 CCC 234 5140 3.24145
1999-12-12 CCC 223 5120 3.65145
I am using:
df = df.reset_index().groupby('CUSIP').last().reset_index.set_index('date')
I want this:
CUSIP COLA COLB COLC
date
1992-07-13 AAA NaN 4677 3.485577
1997-12-12 BBB 245 Nan 3.25
1999-12-12 CCC 223 5120 3.65145
Instead I am getting:
CUSIP COLA COLB COLC
date
1992-07-13 AAA 238 4677 3.485577
1997-12-12 BBB 245 5150 3.25
1999-12-12 CCC 223 5120 3.65145
How do I get last() to take the last row of the groupby including the NaN's?
Thank you.

You can do this directly with an apply instead of last (and get the -1th row of each group):
In [11]: df.reset_index().groupby('CUSIP').apply(lambda x: x.iloc[-1]).reset_index(drop=True).set_index('date')
Out[11]:
CUSIP COLA COLB COLC
date
1992-07-13 AAA NaN 4677 3.485577
1997-12-12 BBB 245 NaN 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]
In 0.13 (rc out now), a faster and more direct way will be to use cumcount:
In [12]: df[df.groupby('CUSIP').cumcount(ascending=False) == 0]
Out[12]:
CUSIP COLA COLB COLC
date
1992-07-13 AAA NaN 4677 3.485577
1997-12-12 BBB 245 NaN 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]

Related

Python: Loop over multiple items

Let's start with a simple DataFrame:
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
df:
a
0 100
1 100
2 105
3 110
4 100
5 106
6 120
7 110
8 105
9 70
10 90
11 100
Now, I want to calculate the returns on a 7-day rolling basis. So I apply the following:
df['delta_rol_a_last_first'] = np.nan
for i in range(7,len(df)):
df['delta_rol_a_last_first'].iloc[i] = (df['a'].iloc[i] - df['a'].iloc[i-7])/df['a'].iloc[i-6]
df.dropna(inplace=True)
df:
a delta_rol_a_last_first
7 110 0.100000
8 105 0.047619
9 70 -0.318182
10 90 -0.200000
11 100 0.000000
Now I just want the negative returns, apply quantiles to them and I want to add identities to the rows as follows:
df_quant = df['delta_rol_a_last_first'][df['delta_rol_a_last_first'] <0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant=df_quant.to_frame()
df_quant['Type'] = 'pct'
df_quant['timeframe'] = 'weekly'
df_quant:
delta_rol_a_last_first Type timeframe
quantile
0.01 -0.317000 pct weekly
0.03 -0.314636 pct weekly
0.05 -0.312273 pct weekly
0.10 -0.306364 pct weekly
So that works perfectly.
Now imagine I want to do the same but more dynamically. So consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I will create a dictionary for the periods over which I want to calculate my rolling returns:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
Now I want to create the same DataFrame as df_quant above. So my DataFrame should look something like this:
col_A_rolling col_B_rolling col_C_rolling Type timeframe
quantile
0.01 -0.317000 -0.234 -0.0443 pct weekly
0.03 -0.314636 -0.022 ... pct weekly
0.05 ... ... ... ...
0.10 ... ...
0.01 ... ...
0.03 ... ...
0.05 ... ...
0.10 -0.306364 -.530023 pct daily
(NOTE: the numbers in this DataFrame are hypothetical)
EDIT:
My attempt is this:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df1[col].iloc[i-key])/df1[col].iloc[i-key]
What is the best way to do this? Any help would be appreciated.
I didn't test all code but first part can be replaced by DataFrame.roling
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
df[['delta_rol_a_last_first']] = df.rolling(8).apply(convert)
# ---
print(df)
or using lambda
df[['delta_rol_a_last_first']] = df.rolling(8).apply(lambda data: ((data.iloc[-1] - data.iloc[0])/data.iloc[1]))
The same for many columns:
import pandas as pd
data = [
[99330,12,122], [1123,1230,1287], [123,101,812739], [1143,1230123,252],
[234,342,4546], [2445,3453,3457], [7897,8657,5675], [46,5675,453],
[76,484,3735], [363,93,4568], [385,568,367], [458,846,4847],
[574,45747,658468], [57457,46534,4675]
]
df = pd.DataFrame(
data,
index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C']
)
df.index = pd.to_datetime(df.index)
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
#df[['col_A_weekly', 'col_B_weekly', 'col_C_weekly']] = df.rolling(8).apply(convert)
new_columns = [name+'_weekly' for name in df.columns]
df[new_columns] = df.rolling(8).apply(convert)
# ---
print(df)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN
2022-01-03 123 101 812739 NaN NaN NaN
2022-01-04 1143 1230123 252 NaN NaN NaN
2022-01-05 234 342 4546 NaN NaN NaN
2022-01-06 2445 3453 3457 NaN NaN NaN
2022-01-07 7897 8657 5675 NaN NaN NaN
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297
2022-01-12 458 846 4847 0.091616 0.145960 0.087070
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506
EDIT:
Using two ranges daily and weekly
old_columns = df.columns
new_columns = [name+'_weekly' for name in old_columns]
df[new_columns] = df[old_columns].rolling(8).apply(convert)
new_columns = [name+'_daily' for name in old_columns]
df[new_columns] = df[old_columns].rolling(2).apply(convert)
or using loop:
old_columns = df.columns
for days, suffix in ((1, 'daily'), (7, 'weekly')):
new_columns = [name+'_'+suffix for name in old_columns]
df[new_columns] = df[old_columns].rolling(days+1).apply(convert)
or
for days, suffix in ((1, 'daily'), (7, 'weekly')):
for name in old_columns:
new_name = name + '_' + suffix
df[new_name] = df[name].rolling(days+1).apply(convert)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly col_A_daily col_B_daily col_C_daily
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN -87.450579 0.990244 0.905206
2022-01-03 123 101 812739 NaN NaN NaN -8.130081 -11.178218 0.998416
2022-01-04 1143 1230123 252 NaN NaN NaN 0.892388 0.999918 -3224.154762
2022-01-05 234 342 4546 NaN NaN NaN -3.884615 -3595.850877 0.944567
2022-01-06 2445 3453 3457 NaN NaN NaN 0.904294 0.900956 -0.315013
2022-01-07 7897 8657 5675 NaN NaN NaN 0.690389 0.601132 0.390837
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187 -170.673913 -0.525463 -11.527594
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012 0.394737 -10.725207 0.878715
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778 0.790634 -4.204301 0.182356
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297 0.057143 0.836268 -11.446866
2022-01-12 458 846 4847 0.091616 0.145960 0.087070 0.159389 0.328605 0.924283
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441 0.202091 0.981507 0.992639
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506 0.990010 0.016912 -139.848770
EDIT:
Quantile:
finall_df = pd.DataFrame()
for days, suffix in ((1, 'daily'), (7, 'weekly')):
df_quant = pd.DataFrame()
for name in old_columns:
new_name = name + '_' + suffix
df_quant[name] = df[new_name][df[new_name]<0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant['Type'] = 'pct'
df_quant['timeframe'] = suffix
print(df_quant.to_string())
#finall_df = finall_df.append(df_quant)
finall_df = pd.concat([finall_df,df_quant])
print(finall_df)
Result:
col_A col_B col_C Type timeframe
quantile
0.01 -168.177213 -3452.463971 -3100.782522 pct daily
0.03 -163.183813 -3165.690158 -2854.038043 pct daily
0.05 -158.190413 -2878.916345 -2607.293564 pct daily
0.10 -145.706913 -2161.981813 -1990.432365 pct daily
0.01 -86.012694 -3523.433980 -3174.979575 pct weekly
0.03 -81.218849 -3379.921823 -3110.883170 pct weekly
0.05 -76.425004 -3236.409666 -3046.786764 pct weekly
0.10 -64.440391 -2877.629275 -2886.545751 pct weekly

Flip and shift multi-column data to the left in Pandas

Here's an Employee - Supervisor mapping data.
I'd like to flip and then shift whole column to left. Only data should be shifted to the left 1 time and the columns should be fixed. Could you tell me how can I do this?
Input: Bottom - Up approach
Emp_ID
Sup_1 ID
Sup_2 ID
Sup_3 ID
Sup_4 ID
123
234
456
678
789
234
456
678
789
NaN
456
678
789
NaN
NaN
678
789
NaN
NaN
NaN
789
NaN
NaN
NaN
NaN
Output: Top - Down approach
Emp_ID
Sup_1 ID
Sup_2 ID
Sup_3 ID
Sup_4 ID
123
789
678
456
234
234
789
678
456
NaN
456
789
678
NaN
NaN
678
789
NaN
NaN
NaN
789
NaN
NaN
NaN
NaN
Appreciate any kind of assistance
Try with fliplr:
# Get numpy structure
x = df.loc[:, 'Sup_1 ID':].to_numpy()
# flip left to right
a = np.fliplr(x)
# Overwrite not NaN values in x with not NaN in a
x[~np.isnan(x)] = a[~np.isnan(a)]
# Update DataFrame
df.loc[:, 'Sup_1 ID':] = x
df:
Emp_ID Sup_1 ID Sup_2 ID Sup_3 ID Sup_4 ID
0 123 789.0 678.0 456.0 234.0
1 234 789.0 678.0 456.0 NaN
2 456 789.0 678.0 NaN NaN
3 678 789.0 NaN NaN NaN
4 789 NaN NaN NaN NaN
DataFrame Constructor and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Emp_ID': [123, 234, 456, 678, 789],
'Sup_1 ID': [234.0, 456.0, 678.0, 789.0, np.nan],
'Sup_2 ID': [456.0, 678.0, 789.0, np.nan, np.nan],
'Sup_3 ID': [678.0, 789.0, np.nan, np.nan, np.nan],
'Sup_4 ID': [789.0, np.nan, np.nan, np.nan, np.nan]
})
In your case try np.roll
df = df.set_index('Emp_ID')
out = df.apply(lambda x : np.roll(x[x.notnull()].values,1)).apply(pd.Series)
0 1 2 3
Sup_1 ID 789.0 234.0 456.0 678.0
Sup_2 ID 789.0 456.0 678.0 NaN
Sup_3 ID 789.0 678.0 NaN NaN
Sup_4 ID 789.0 NaN NaN NaN
out.columns = df.columns

how to shift value in dataframe using pandas?

I have data like this, without z1, what i need is to add a column to DataFrame, so it will add column z1 and represent values as in the example, what it should do is to shift z value equally on 1 day before for the same Start date.
I was thinking it could be done with apply and lambda in pandas, but i`m not sure how to define lambda function
data = pd.read_csv("....")
data["Z"] = data[[
"Start", "Z"]].apply(lambda x:
You can use DataFrameGroupBy.shift with merge:
#if not datetime
df['date'] = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
df1 = df.groupby('start')['z'].shift(freq='1D',periods=1).reset_index()
print (pd.merge(df.reset_index(),df1, on=['start','date'], how='left', suffixes=('','1')))
date start z z1
0 2012-12-01 324 564545 NaN
1 2012-12-01 384 5555 NaN
2 2012-12-01 349 554 NaN
3 2012-12-02 855 635 NaN
4 2012-12-02 324 56 564545.0
5 2012-12-01 341 98 NaN
6 2012-12-03 324 888 56.0
EDIT:
Try find duplicates and fillna by 0:
df['date'] = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
df1 = df.groupby('start')['z'].shift(freq='1D',periods=1).reset_index()
df2 = pd.merge(df.reset_index(),df1, on=['start','date'], how='left', suffixes=('','1'))
mask = df2.start.duplicated(keep=False)
df2.ix[mask, 'z1'] = df2.ix[mask, 'z1'].fillna(0)
print (df2)
date start z z1
0 2012-12-01 324 564545 0.0
1 2012-12-01 384 5555 NaN
2 2012-12-01 349 554 NaN
3 2012-12-02 855 635 NaN
4 2012-12-02 324 56 564545.0
5 2012-12-01 341 98 NaN
6 2012-12-03 324 888 56.0

Frequency count unique values Pandas

I have a Pandas Series as follow :
2014-05-24 23:59:49 1.3
2014-05-24 23:59:50 2.17
2014-05-24 23:59:50 1.28
2014-05-24 23:59:51 1.30
2014-05-24 23:59:51 2.17
2014-05-24 23:59:53 2.17
2014-05-24 23:59:58 2.17
Name: api_id, Length: 483677
I'm trying to count for each id the frequency per day.
For now I'm doing this :
count = {}
for x in apis.unique():
count[x] = apis[apis == x].resample('D','count')
count_df = pd.DataFrame(count)
That gives me what I want which is :
... 2.13 2.17 2.4 2.6 2.7 3.5(user) 3.9 4.2 5.1 5.6
timestamp ...
2014-05-22 ... 391 49962 3727 161 2 444 113 90 1398 90
2014-05-23 ... 450 49918 3861 187 1 450 170 90 629 90
2014-05-24 ... 396 46359 3603 172 3 513 171 89 622 90
But is there a way to do so without the for loop ?
You can use the value_counts function for this (docs), applying this after a groupby (which is similar to the resample('D') you did, but resample is expecting an aggregated output so we have to use the more general groupby in this case). With a small example:
In [16]: s = pd.Series([1,1,2,2,1,2,5,6,2,5,4,1], index=pd.date_range('2012-01-01', periods=12, freq='8H'))
In [17]: counts = s.groupby(pd.Grouper(freq='D')).value_counts()
In [18]: counts
Out[18]:
2012-01-01 1 2
2 1
2012-01-02 2 2
1 1
2012-01-03 2 1
6 1
5 1
2012-01-04 1 1
5 1
4 1
dtype: int64
To get this in the desired format, you can just unstack this (move the second level row indices to the columns):
In [19]: counts.unstack()
Out[19]:
1 2 4 5 6
2012-01-01 2 1 NaN NaN NaN
2012-01-02 1 2 NaN NaN NaN
2012-01-03 NaN 1 NaN 1 1
2012-01-04 1 NaN 1 1 NaN
Note: for the use of groupby(pd.Grouper(freq='D')) you need pandas 0.14. If you have al older version, you can use groupby(pd.TimeGrouper(freq='D')) to obtain exactly the same. This is also similar to doing groupby(s.index.date) (with the difference you have then datetime.date objects in the index).

Groupby in pandas timeseries dataframe choosing most recent event

I have a pandas timeseries dataframe that has date set as index and a number of columns (one is cusip).
I want to iterate through the dataframe and create a new dataframe where, for each cusip, I take the most recent data available.
I tried to use groupby:
newData = []
for group in df.groupby(df['CUSIP']):
newData.append(group[group.index == max(group.index)])
'builtin_function_or_method' object is not iterable
In [374]: df.head()
Out[374]:
CUSIP COLA COLB COLC
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.24
1995-12-12 BBB 254 5150 3.25
1997-12-12 BBB 245 6150 3.25
1998-12-12 CCC 234 5140 3.24145
1999-12-12 CCC 223 5120 3.65145
I want:
CUSIP COLA COLB COLC
date
1992-07-13 AAA 234 4677 3.485577
1997-12-12 BBB 245 6150 3.25
1999-12-12 CCC 223 5120 3.65145
Should I approach this another way? Thank you.
In [17]: df
Out[17]:
cusip a b c
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.240000
1995-12-12 BBB 254 5150 3.250000
1997-12-12 BBB 245 6150 3.250000
1998-12-12 CCC 234 5140 3.241450
1999-12-12 CCC 223 5120 3.651450
[7 rows x 4 columns]
Sort it
In [18]: df = df.sort_index()
In [19]: df
Out[19]:
cusip a b c
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.240000
1995-12-12 BBB 254 5150 3.250000
1997-12-12 BBB 245 6150 3.250000
1998-12-12 CCC 234 5140 3.241450
1999-12-12 CCC 223 5120 3.651450
[7 rows x 4 columns]
Take the last element from each group
In [20]: df.groupby('cusip').last()
Out[20]:
a b c
cusip
AAA 234 4677 3.485577
BBB 245 6150 3.250000
CCC 223 5120 3.651450
[3 rows x 3 columns]
If you want to keep the date index, reset first, group, then set the index back
In [9]: df.reset_index().groupby('cusip').last().reset_index().set_index('date')
Out[9]:
cusip a b c
date
1992-07-13 AAA 234 4677 3.485577
1997-12-12 BBB 245 6150 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]
I did it this way
df = pd.read_csv('/home/desktop/test.csv' )
convert date to datetime
df = df.reset_index()
df['date'] = pd.to_datetime(df['date'])
sort dataframe the way you want
df = df.sort(['CUSIP','date'], ascending=[True,False]).groupby('CUSIP')
define what happens when you aggregate (according to the way you sorted)
def return_first(pd_series):
return pd_series.values[0]
make dict to apply same function to all columns
agg_dict = {c: return_first for c in df.columns}
finally aggregate
df = df.agg(agg_dict)
EDIT:
converting the date to datetime avoids this kind of error:
In [12]: df.sort(['CUSIP','date'],ascending=[True,False])
Out[12]:
date CUSIP COLA COLB COLC date_time
6 1999-12-12 CCC 223 5120 3.651450 1999-12-12 00:00:00
5 1998-12-12 CCC 234 5140 3.241450 1998-12-12 00:00:00
8 1997-12-4 DDD 999 9999 9.999999 1997-12-04 00:00:00
9 1997-12-05 DDD 245 6150 3.250000 1997-12-05 00:00:00
7 1992-07-6 DDD 234 4677 3.485577 1992-07-06 00:00:00

Categories