Suppose I have two DataFrames:
df1:
ticker A B C
date
2022-01-01 NaN NaN 100
2022-01-02 NaN 200 NaN
2022-01-03 100 NaN NaN
2022-01-04 NaN NaN 120
df2:
ticker A B C
date
2022-01-02 145 233 100
2022-01-03 231 200 241
2022-01-04 100 200 422
2022-01-05 424 324 222
2022-01-06 400 421 320
I want to fill the values in df2 as np.nan for each index and column, where the value in df1 is not null to get the following:
df3:
ticker A B C
date
2022-01-02 145 NaN 100
2022-01-03 NaN 200 241
2022-01-04 100 200 NaN
2022-01-05 424 324 222
2022-01-06 400 421 320
I am applying the following code:
for col in df1.columns:
idx = df1[df1[col].notna()].index
if df2[col][idx] == df1[col][idx]:
df2[col][idx] = np.nan
However, this gives the error: ValueError: The truth value of a Series is ambiguous. Use a.empty(), a.bool(), a.item(), a.any() or a.all().
How can I re-write the above loop?
You can use reindex_like to align df1 with df2, then mask the values of df2 for which the matching df1 are notna:
out = df2.mask(df1.reindex_like(df2).notna())
To modify df2 in place:
df2[df1.reindex_like(df2).notna()] = float('nan')
Output:
A B C
date
2022-01-02 145.0 NaN 100.0
2022-01-03 NaN 200.0 241.0
2022-01-04 100.0 200.0 NaN
2022-01-05 424.0 324.0 222.0
2022-01-06 400.0 421.0 320.0
combining several conditions
df1b = df1.reindex_like(df2)
out = df2.mask(df1b.notna()&df2.ne(df1b), df2-df1b)
Output:
A B C
date
2022-01-02 145 33 100
2022-01-03 131 200 241
2022-01-04 100 200 302
2022-01-05 424 324 222
2022-01-06 400 421 320
Let's start with a simple DataFrame:
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
df:
a
0 100
1 100
2 105
3 110
4 100
5 106
6 120
7 110
8 105
9 70
10 90
11 100
Now, I want to calculate the returns on a 7-day rolling basis. So I apply the following:
df['delta_rol_a_last_first'] = np.nan
for i in range(7,len(df)):
df['delta_rol_a_last_first'].iloc[i] = (df['a'].iloc[i] - df['a'].iloc[i-7])/df['a'].iloc[i-6]
df.dropna(inplace=True)
df:
a delta_rol_a_last_first
7 110 0.100000
8 105 0.047619
9 70 -0.318182
10 90 -0.200000
11 100 0.000000
Now I just want the negative returns, apply quantiles to them and I want to add identities to the rows as follows:
df_quant = df['delta_rol_a_last_first'][df['delta_rol_a_last_first'] <0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant=df_quant.to_frame()
df_quant['Type'] = 'pct'
df_quant['timeframe'] = 'weekly'
df_quant:
delta_rol_a_last_first Type timeframe
quantile
0.01 -0.317000 pct weekly
0.03 -0.314636 pct weekly
0.05 -0.312273 pct weekly
0.10 -0.306364 pct weekly
So that works perfectly.
Now imagine I want to do the same but more dynamically. So consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I will create a dictionary for the periods over which I want to calculate my rolling returns:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
Now I want to create the same DataFrame as df_quant above. So my DataFrame should look something like this:
col_A_rolling col_B_rolling col_C_rolling Type timeframe
quantile
0.01 -0.317000 -0.234 -0.0443 pct weekly
0.03 -0.314636 -0.022 ... pct weekly
0.05 ... ... ... ...
0.10 ... ...
0.01 ... ...
0.03 ... ...
0.05 ... ...
0.10 -0.306364 -.530023 pct daily
(NOTE: the numbers in this DataFrame are hypothetical)
EDIT:
My attempt is this:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df1[col].iloc[i-key])/df1[col].iloc[i-key]
What is the best way to do this? Any help would be appreciated.
I didn't test all code but first part can be replaced by DataFrame.roling
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
df[['delta_rol_a_last_first']] = df.rolling(8).apply(convert)
# ---
print(df)
or using lambda
df[['delta_rol_a_last_first']] = df.rolling(8).apply(lambda data: ((data.iloc[-1] - data.iloc[0])/data.iloc[1]))
The same for many columns:
import pandas as pd
data = [
[99330,12,122], [1123,1230,1287], [123,101,812739], [1143,1230123,252],
[234,342,4546], [2445,3453,3457], [7897,8657,5675], [46,5675,453],
[76,484,3735], [363,93,4568], [385,568,367], [458,846,4847],
[574,45747,658468], [57457,46534,4675]
]
df = pd.DataFrame(
data,
index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C']
)
df.index = pd.to_datetime(df.index)
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
#df[['col_A_weekly', 'col_B_weekly', 'col_C_weekly']] = df.rolling(8).apply(convert)
new_columns = [name+'_weekly' for name in df.columns]
df[new_columns] = df.rolling(8).apply(convert)
# ---
print(df)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN
2022-01-03 123 101 812739 NaN NaN NaN
2022-01-04 1143 1230123 252 NaN NaN NaN
2022-01-05 234 342 4546 NaN NaN NaN
2022-01-06 2445 3453 3457 NaN NaN NaN
2022-01-07 7897 8657 5675 NaN NaN NaN
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297
2022-01-12 458 846 4847 0.091616 0.145960 0.087070
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506
EDIT:
Using two ranges daily and weekly
old_columns = df.columns
new_columns = [name+'_weekly' for name in old_columns]
df[new_columns] = df[old_columns].rolling(8).apply(convert)
new_columns = [name+'_daily' for name in old_columns]
df[new_columns] = df[old_columns].rolling(2).apply(convert)
or using loop:
old_columns = df.columns
for days, suffix in ((1, 'daily'), (7, 'weekly')):
new_columns = [name+'_'+suffix for name in old_columns]
df[new_columns] = df[old_columns].rolling(days+1).apply(convert)
or
for days, suffix in ((1, 'daily'), (7, 'weekly')):
for name in old_columns:
new_name = name + '_' + suffix
df[new_name] = df[name].rolling(days+1).apply(convert)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly col_A_daily col_B_daily col_C_daily
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN -87.450579 0.990244 0.905206
2022-01-03 123 101 812739 NaN NaN NaN -8.130081 -11.178218 0.998416
2022-01-04 1143 1230123 252 NaN NaN NaN 0.892388 0.999918 -3224.154762
2022-01-05 234 342 4546 NaN NaN NaN -3.884615 -3595.850877 0.944567
2022-01-06 2445 3453 3457 NaN NaN NaN 0.904294 0.900956 -0.315013
2022-01-07 7897 8657 5675 NaN NaN NaN 0.690389 0.601132 0.390837
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187 -170.673913 -0.525463 -11.527594
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012 0.394737 -10.725207 0.878715
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778 0.790634 -4.204301 0.182356
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297 0.057143 0.836268 -11.446866
2022-01-12 458 846 4847 0.091616 0.145960 0.087070 0.159389 0.328605 0.924283
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441 0.202091 0.981507 0.992639
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506 0.990010 0.016912 -139.848770
EDIT:
Quantile:
finall_df = pd.DataFrame()
for days, suffix in ((1, 'daily'), (7, 'weekly')):
df_quant = pd.DataFrame()
for name in old_columns:
new_name = name + '_' + suffix
df_quant[name] = df[new_name][df[new_name]<0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant['Type'] = 'pct'
df_quant['timeframe'] = suffix
print(df_quant.to_string())
#finall_df = finall_df.append(df_quant)
finall_df = pd.concat([finall_df,df_quant])
print(finall_df)
Result:
col_A col_B col_C Type timeframe
quantile
0.01 -168.177213 -3452.463971 -3100.782522 pct daily
0.03 -163.183813 -3165.690158 -2854.038043 pct daily
0.05 -158.190413 -2878.916345 -2607.293564 pct daily
0.10 -145.706913 -2161.981813 -1990.432365 pct daily
0.01 -86.012694 -3523.433980 -3174.979575 pct weekly
0.03 -81.218849 -3379.921823 -3110.883170 pct weekly
0.05 -76.425004 -3236.409666 -3046.786764 pct weekly
0.10 -64.440391 -2877.629275 -2886.545751 pct weekly
I am working a pandas DataFrame of a shape of 7837 rows and 19 columns. I am interested in getting the number of times a product_id appears per month which is the date column, and the associated amount. Because a product_id can have various amounts. So I am looking for a way to say for example product_id 1921 with amount 59 appeared ....
Here is the small version of the pandas dataframe
print(df)
CompanyName Produktname product_id amount Date
0 companyA productA 1921 59.0 Jan-2020
1 companyB productB 114 NaN May-2020
2 companyC productC 469 NaN Feb-2020
3 companyD productD 569 18.0 Jun-2020
4 companyE productE 569 18.0 March-2020
I think pivot_table might be helpful. I wanted to first see how many times each product_id appeared with the date as the column
pd.pivot_table(df, index="product_id", values= "product_id" ,columns="Date", aggfunc="count")
but I get an error:
ValueError: Grouper for 'product_id' not 1-dimensional
Is there a way around this or a more efficient way to handle this?
IIUC use:
df = df.pivot_table(index="product_id", values= "amount" ,columns="Date", aggfunc="count")
print (df)
Date Feb-2020 Jan-2020 Jun-2020 March-2020 May-2020
product_id
114 NaN NaN NaN NaN 0.0
469 0.0 NaN NaN NaN NaN
569 NaN NaN 1.0 1.0 NaN
1921 NaN 1.0 NaN NaN NaN
For correct order is possible use:
df['Date'] = pd.to_datetime(df['Date'], format='%b-%Y')
df = df.pivot_table(index="product_id",
values= "amount" ,
columns="Date",
aggfunc="count",
fill_value=0).rename(columns = lambda x: x.strftime('%b-%Y'))
print (df)
Date Jan-2020 Feb-2020 Mar-2020 May-2020 Jun-2020
product_id
114 0 0 0 0 0
469 0 0 0 0 0
569 0 0 1 0 1
1921 1 0 0 0 0
I have a dataFrame with more than 200 features, and I put a part of the dataset to show the problem:
index ID X1 X2 Date1 Y1
0 2 324 634 2016-01-01 NaN
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654
4 4 543 843 2017-02-01 NaN
5 5 523 843 2015-09-01 NaN
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
As you can see the rows are duplicated (in ID, X1, X2 and Date1) and I want to remove one of the rows which are similar in ID, X1, X2, Date1 and Y1 which contains NaN. So, my desired DataFrame should be:
index ID X1 X2 Date1 Y1
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
Does any one know, how I can handle it?
Use sort_values on "Y1" to move NaNs to the bottom of the DataFrame, and then use drop_duplicates:
df2 = (df.sort_values('Y1', na_position='last')
.drop_duplicates(['ID', 'X1', 'X2', 'Date1'], keep='first')
.sort_index())
df2
ID X1 X2 Date1 Y1
index
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654.0
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
just use drop_duplicates function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
df \
.orderBy(Y1).desc()) \
.drop_duplicates(subset='ID')
I have a pandas timeseries dataframe that has date set as index and a number of columns (one is cusip).
I want to iterate through the dataframe and create a new dataframe where, for each cusip, I take the most recent data available.
I tried to use groupby:
newData = []
for group in df.groupby(df['CUSIP']):
newData.append(group[group.index == max(group.index)])
'builtin_function_or_method' object is not iterable
In [374]: df.head()
Out[374]:
CUSIP COLA COLB COLC
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.24
1995-12-12 BBB 254 5150 3.25
1997-12-12 BBB 245 6150 3.25
1998-12-12 CCC 234 5140 3.24145
1999-12-12 CCC 223 5120 3.65145
I want:
CUSIP COLA COLB COLC
date
1992-07-13 AAA 234 4677 3.485577
1997-12-12 BBB 245 6150 3.25
1999-12-12 CCC 223 5120 3.65145
Should I approach this another way? Thank you.
In [17]: df
Out[17]:
cusip a b c
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.240000
1995-12-12 BBB 254 5150 3.250000
1997-12-12 BBB 245 6150 3.250000
1998-12-12 CCC 234 5140 3.241450
1999-12-12 CCC 223 5120 3.651450
[7 rows x 4 columns]
Sort it
In [18]: df = df.sort_index()
In [19]: df
Out[19]:
cusip a b c
date
1992-05-08 AAA 238 4256 3.523346
1992-07-13 AAA 234 4677 3.485577
1992-12-12 BBB 221 5150 3.240000
1995-12-12 BBB 254 5150 3.250000
1997-12-12 BBB 245 6150 3.250000
1998-12-12 CCC 234 5140 3.241450
1999-12-12 CCC 223 5120 3.651450
[7 rows x 4 columns]
Take the last element from each group
In [20]: df.groupby('cusip').last()
Out[20]:
a b c
cusip
AAA 234 4677 3.485577
BBB 245 6150 3.250000
CCC 223 5120 3.651450
[3 rows x 3 columns]
If you want to keep the date index, reset first, group, then set the index back
In [9]: df.reset_index().groupby('cusip').last().reset_index().set_index('date')
Out[9]:
cusip a b c
date
1992-07-13 AAA 234 4677 3.485577
1997-12-12 BBB 245 6150 3.250000
1999-12-12 CCC 223 5120 3.651450
[3 rows x 4 columns]
I did it this way
df = pd.read_csv('/home/desktop/test.csv' )
convert date to datetime
df = df.reset_index()
df['date'] = pd.to_datetime(df['date'])
sort dataframe the way you want
df = df.sort(['CUSIP','date'], ascending=[True,False]).groupby('CUSIP')
define what happens when you aggregate (according to the way you sorted)
def return_first(pd_series):
return pd_series.values[0]
make dict to apply same function to all columns
agg_dict = {c: return_first for c in df.columns}
finally aggregate
df = df.agg(agg_dict)
EDIT:
converting the date to datetime avoids this kind of error:
In [12]: df.sort(['CUSIP','date'],ascending=[True,False])
Out[12]:
date CUSIP COLA COLB COLC date_time
6 1999-12-12 CCC 223 5120 3.651450 1999-12-12 00:00:00
5 1998-12-12 CCC 234 5140 3.241450 1998-12-12 00:00:00
8 1997-12-4 DDD 999 9999 9.999999 1997-12-04 00:00:00
9 1997-12-05 DDD 245 6150 3.250000 1997-12-05 00:00:00
7 1992-07-6 DDD 234 4677 3.485577 1992-07-06 00:00:00