How to calculate sumproduct in pandas by column?

How to calculate sumproduct in pandas by column? - python

I have a dataframe:
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
1 2847 2861 2875 2890 2904 94717
2 1338 1343 1348 1353 1358 70105
3 3301 3311 3321 3331 3341 60307
4 1425 1422 1419 1416 1413 79888
I want to add a new row to the table with the sumproduct formula (excel) =sumproduct(array $val, array 2000-xx). The first value in the new row is computed as 2847x94717 + 1338x70105 + 3301x60307 + 1425x79888 = 676373596 (in Excel terms, B2xG2+B3xG3+B4xG4+B5xG5)
Output:
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
1 2847 2861 2875 2890 2904 94717
2 1338 1343 1348 1353 1358 70105
3 3301 3311 3321 3331 3341 60307
4 1425 1422 1419 1416 1413 79888
5 676373596 678413565 680453534 682588220 684628189
How do I go about this?

You can do this, assuming ID is not in the index:
df.loc[5, :] = df.iloc[:,1:-1].mul(df['val'], axis=0).sum()
Output:
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
0 1.0 2847.0 2861.0 2875.0 2890.0 2904.0 94717.0
1 2.0 1338.0 1343.0 1348.0 1353.0 1358.0 70105.0
2 3.0 3301.0 3311.0 3321.0 3331.0 3341.0 60307.0
3 4.0 1425.0 1422.0 1419.0 1416.0 1413.0 79888.0
5 NaN 676373596.0 678413565.0 680453534.0 682588220.0 684628189.0 NaN
Use pandas.DataFrame.mul with axis=0 then sum and let pandas intrinsic data alignment put the values in the correct column based on indexing.

You could do the dot product # and merge back to the original dataframe:
df.merge(pd.DataFrame(df.iloc[:,1:-1].T # df['val']).T, how='outer')
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
0 1.0 2847 2861 2875 2890 2904 94717.0
1 2.0 1338 1343 1348 1353 1358 70105.0
2 3.0 3301 3311 3321 3331 3341 60307.0
3 4.0 1425 1422 1419 1416 1413 79888.0
4 NaN 676373596 678413565 680453534 682588220 684628189 NaN

Other options for the same result
columns_to_multiply = df.columns.drop(['ID', 'val'])
df1 = df.copy()
for x in columns_to_multiply:
df1[x] *= df1['val']
prod_sum_list = [len(df)] + df1[columns_to_multiply].sum().tolist() + [np.nan]
df.loc[len(df.index)] = prod_sum_list
df

You can do:
row = [sum(df[col]*df['val']) for col in df.columns.drop(['ID','val'])]
row.insert(0, len(df)+1)
row.insert(len(row), 0)
df.loc[len(df)] = row
df.loc[len(df)-1,'val'] = ''

Related

Python: Loop over multiple items

Let's start with a simple DataFrame:
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
df:
a
0 100
1 100
2 105
3 110
4 100
5 106
6 120
7 110
8 105
9 70
10 90
11 100
Now, I want to calculate the returns on a 7-day rolling basis. So I apply the following:
df['delta_rol_a_last_first'] = np.nan
for i in range(7,len(df)):
df['delta_rol_a_last_first'].iloc[i] = (df['a'].iloc[i] - df['a'].iloc[i-7])/df['a'].iloc[i-6]
df.dropna(inplace=True)
df:
a delta_rol_a_last_first
7 110 0.100000
8 105 0.047619
9 70 -0.318182
10 90 -0.200000
11 100 0.000000
Now I just want the negative returns, apply quantiles to them and I want to add identities to the rows as follows:
df_quant = df['delta_rol_a_last_first'][df['delta_rol_a_last_first'] <0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant=df_quant.to_frame()
df_quant['Type'] = 'pct'
df_quant['timeframe'] = 'weekly'
df_quant:
delta_rol_a_last_first Type timeframe
quantile
0.01 -0.317000 pct weekly
0.03 -0.314636 pct weekly
0.05 -0.312273 pct weekly
0.10 -0.306364 pct weekly
So that works perfectly.
Now imagine I want to do the same but more dynamically. So consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I will create a dictionary for the periods over which I want to calculate my rolling returns:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
Now I want to create the same DataFrame as df_quant above. So my DataFrame should look something like this:
col_A_rolling col_B_rolling col_C_rolling Type timeframe
quantile
0.01 -0.317000 -0.234 -0.0443 pct weekly
0.03 -0.314636 -0.022 ... pct weekly
0.05 ... ... ... ...
0.10 ... ...
0.01 ... ...
0.03 ... ...
0.05 ... ...
0.10 -0.306364 -.530023 pct daily
(NOTE: the numbers in this DataFrame are hypothetical)
EDIT:
My attempt is this:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df1[col].iloc[i-key])/df1[col].iloc[i-key]
What is the best way to do this? Any help would be appreciated.

I didn't test all code but first part can be replaced by DataFrame.roling
df = pd.DataFrame({"a":[100,100,105,110,100,106,120,110,105,70,90, 100]})
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
df[['delta_rol_a_last_first']] = df.rolling(8).apply(convert)
# ---
print(df)
or using lambda
df[['delta_rol_a_last_first']] = df.rolling(8).apply(lambda data: ((data.iloc[-1] - data.iloc[0])/data.iloc[1]))
The same for many columns:
import pandas as pd
data = [
[99330,12,122], [1123,1230,1287], [123,101,812739], [1143,1230123,252],
[234,342,4546], [2445,3453,3457], [7897,8657,5675], [46,5675,453],
[76,484,3735], [363,93,4568], [385,568,367], [458,846,4847],
[574,45747,658468], [57457,46534,4675]
]
df = pd.DataFrame(
data,
index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C']
)
df.index = pd.to_datetime(df.index)
# ---
def convert(data):
return (data.iloc[-1] - data.iloc[0])/data.iloc[1]
#df[['col_A_weekly', 'col_B_weekly', 'col_C_weekly']] = df.rolling(8).apply(convert)
new_columns = [name+'_weekly' for name in df.columns]
df[new_columns] = df.rolling(8).apply(convert)
# ---
print(df)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN
2022-01-03 123 101 812739 NaN NaN NaN
2022-01-04 1143 1230123 252 NaN NaN NaN
2022-01-05 234 342 4546 NaN NaN NaN
2022-01-06 2445 3453 3457 NaN NaN NaN
2022-01-07 7897 8657 5675 NaN NaN NaN
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297
2022-01-12 458 846 4847 0.091616 0.145960 0.087070
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506
EDIT:
Using two ranges daily and weekly
old_columns = df.columns
new_columns = [name+'_weekly' for name in old_columns]
df[new_columns] = df[old_columns].rolling(8).apply(convert)
new_columns = [name+'_daily' for name in old_columns]
df[new_columns] = df[old_columns].rolling(2).apply(convert)
or using loop:
old_columns = df.columns
for days, suffix in ((1, 'daily'), (7, 'weekly')):
new_columns = [name+'_'+suffix for name in old_columns]
df[new_columns] = df[old_columns].rolling(days+1).apply(convert)
or
for days, suffix in ((1, 'daily'), (7, 'weekly')):
for name in old_columns:
new_name = name + '_' + suffix
df[new_name] = df[name].rolling(days+1).apply(convert)
Result:
col_A col_B col_C col_A_weekly col_B_weekly col_C_weekly col_A_daily col_B_daily col_C_daily
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 NaN NaN NaN -87.450579 0.990244 0.905206
2022-01-03 123 101 812739 NaN NaN NaN -8.130081 -11.178218 0.998416
2022-01-04 1143 1230123 252 NaN NaN NaN 0.892388 0.999918 -3224.154762
2022-01-05 234 342 4546 NaN NaN NaN -3.884615 -3595.850877 0.944567
2022-01-06 2445 3453 3457 NaN NaN NaN 0.904294 0.900956 -0.315013
2022-01-07 7897 8657 5675 NaN NaN NaN 0.690389 0.601132 0.390837
2022-01-08 46 5675 453 -88.409617 4.604065 0.257187 -170.673913 -0.525463 -11.527594
2022-01-09 76 484 3735 -8.512195 -7.386139 0.003012 0.394737 -10.725207 0.878715
2022-01-10 363 93 4568 0.209974 -0.000007 -3207.027778 0.790634 -4.204301 0.182356
2022-01-11 385 568 367 -3.239316 -3595.190058 0.025297 0.057143 0.836268 -11.446866
2022-01-12 458 846 4847 0.091616 0.145960 0.087070 0.159389 0.328605 0.924283
2022-01-13 574 45747 658468 -0.236925 4.885526 115.420441 0.202091 0.981507 0.992639
2022-01-14 57457 46534 4675 1077.391304 6.674361 -2.207506 0.990010 0.016912 -139.848770
EDIT:
Quantile:
finall_df = pd.DataFrame()
for days, suffix in ((1, 'daily'), (7, 'weekly')):
df_quant = pd.DataFrame()
for name in old_columns:
new_name = name + '_' + suffix
df_quant[name] = df[new_name][df[new_name]<0].quantile([0.01,0.03,0.05,0.1])
df_quant.index.names = ['quantile']
df_quant['Type'] = 'pct'
df_quant['timeframe'] = suffix
print(df_quant.to_string())
#finall_df = finall_df.append(df_quant)
finall_df = pd.concat([finall_df,df_quant])
print(finall_df)
Result:
col_A col_B col_C Type timeframe
quantile
0.01 -168.177213 -3452.463971 -3100.782522 pct daily
0.03 -163.183813 -3165.690158 -2854.038043 pct daily
0.05 -158.190413 -2878.916345 -2607.293564 pct daily
0.10 -145.706913 -2161.981813 -1990.432365 pct daily
0.01 -86.012694 -3523.433980 -3174.979575 pct weekly
0.03 -81.218849 -3379.921823 -3110.883170 pct weekly
0.05 -76.425004 -3236.409666 -3046.786764 pct weekly
0.10 -64.440391 -2877.629275 -2886.545751 pct weekly

Replace the missing value NAN based on values of another columns (conditions)

Hi I would like to fill in the NaN value based on value of sources.
I have tried the np.select, but this method also overwrite the other correct values.
landline_area1['area'] = np.select(area_conditions, values)
Table overview
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN NaN
7 1225 1225 Bath
8 1142 NaN NaN
conditions
area_conditions = [
(landline_area1['source'].str.startswith('20')),
(landline_area1['source'].str.startswith('23')),
(landline_area1['source'].str.startswith('24'))]
values
values = [
'London',
'Southampton / Portsmouth',
'Coventry']
Expected result
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN London
7 1225 1225 Bath
8 1142 NaN Sheffield

Let us try np.select and adding astype str
#landline_area1['source'].astype(str).str.startswith('20')
s = np.select(area_conditions, values)
landline_area1['area'].fillna(pd.Series(s, index=landline_area1.index),inplace=True)

Common values between multiple dataframes with different length

I have 3 huge dataframes that have different length of values
Ex,
A B C
2981 2952 1287
2759 2295 2952
1284 2235 1284
1295 1928 0887
2295 1284 1966
1567 1928
1287 2374
2846
2578
I want to find the common values between the three columns like this
A B C Common
2981 2952 1287 1284
2759 2295 2952 2295
1284 2235 1284
1295 1928 0887
2295 1284 1966
1567 2295
1287 2374
2846
2578
I tried (from here)
df1['Common'] = np.intersect1d(df1.A, np.intersect1d(df2.B, df3.C))
but I get this error, ValueError: Length of values does not match length of index

Idea is create Series with index filtered by indexing with length of array:
a = np.intersect1d(df1.A, np.intersect1d(df2.B, df3.C))
df1['Common'] = pd.Series(a, index=df1.index[:len(a)])
If same DataFrame:
a = np.intersect1d(df1.A, np.intersect1d(df1.B, df1.C))
df1['Common'] = pd.Series(a, index=df1.index[:len(a)])
print (df1)
A B C Common
0 2981.0 2952.0 1287 1284.0
1 2759.0 2295.0 2952 2295.0
2 1284.0 2235.0 1284 NaN
3 1295.0 1928.0 887 NaN
4 2295.0 1284.0 1966 NaN
5 NaN 1567.0 2295 NaN
6 NaN 1287.0 2374 NaN
7 NaN NaN 2846 NaN
8 NaN NaN 2578 NaN

Subtracting fix date from whole panda data frame - python

I have data
customer_id purchase_amount date_of_purchase
0 760 25.0 06-11-2009
1 860 50.0 09-28-2012
2 1200 100.0 10-25-2005
3 1420 50.0 09-07-2009
4 1940 70.0 01-25-2013
5 1960 40.0 10-29-2013
6 2620 30.0 09-03-2006
7 3050 50.0 12-04-2007
8 3120 150.0 08-11-2006
9 3260 45.0 10-20-2010
10 3510 35.0 04-05-2013
11 3970 30.0 07-06-2007
12 4000 20.0 11-25-2005
13 4180 20.0 09-22-2010
14 4390 30.0 04-15-2011
15 4750 60.0 02-12-2013
16 4840 30.0 10-14-2005
17 4910 15.0 12-13-2006
18 4950 50.0 05-19-2010
19 4970 30.0 01-12-2006
20 5250 50.0 12-20-2005
Now I want to subtract 01-01-2016 from each row of date_of_purchase
I tried the following so I should have a new column days_since with a number of days.
NOW = pd.to_datetime('01/01/2016').strftime('%m-%d-%Y')
gb = customer_purchases_df.groupby('customer_id')
df2 = gb.agg({'date_of_purchase': lambda x: (NOW - x.max()).days})
any suggestion. how I can achieve this
Thanks in advance

pd.to_datetime(df['date_of_purchase']).rsub(pd.to_datetime('2016-01-01')).dt.days
0 2395
1 1190
2 3720
3 2307
4 1071
5 794
6 3407
7 2950
8 3430
9 1899
10 1001
11 3101
12 3689
13 1927
14 1722
15 1053
16 3731
17 3306
18 2053
19 3641
20 3664
Name: date_of_purchase, dtype: int64

I'm assuming the 'date_of_purchase' column already has the datetime dtype.
>>> df
customer_id purchase_amount date_of_purchase
0 760 25.0 2009-06-11
1 860 50.0 2012-09-28
2 1200 100.0 2005-10-25
>>> df['days_since'] = df['date_of_purchase'].sub(pd.to_datetime('01/01/2016')).dt.days.abs()
>>> df
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 2009-06-11 2395
1 860 50.0 2012-09-28 1190
2 1200 100.0 2005-10-25 3720

Python add an id for specific values

how do I add a column for id in dataframe? The values between 0 to 100 should have an id of 1 otherwise 2.
time values
2018-03-19 14:31:17.200 1095
2018-03-19 14:31:17.300 2296
2018-03-19 14:31:17.400 2147
2018-03-19 14:31:17.500 309
2018-03-19 14:31:17.600 244
2018-03-19 14:31:17.700 263
2018-03-19 14:31:17.800 548

I think need numpy.where with condition created by between (default inclusive=True):
df['id'] = np.where(df['values'].between(0,100), 1,2)
print (df)
time values id
1 2018-03-19 14:31:17.200 1095 2
2 2018-03-19 14:31:17.300 2296 2
3 2018-03-19 14:31:17.400 2147 2
4 2018-03-19 14:31:17.500 309 2
5 2018-03-19 14:31:17.600 244 2
6 2018-03-19 14:31:17.700 263 2
7 2018-03-19 14:31:17.800 548 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate sumproduct in pandas by column? - python

Other options for the same result columns_to_multiply = df.columns.drop(['ID', 'val']) df1 = df.copy() for x in columns_to_multiply: df1[x] *= df1['val'] prod_sum_list = [len(df)] + df1[columns_to_multiply].sum().tolist() + [np.nan] df.loc[len(df.index)] = prod_sum_list df

You can do: row = [sum(df[col]*df['val']) for col in df.columns.drop(['ID','val'])] row.insert(0, len(df)+1) row.insert(len(row), 0) df.loc[len(df)] = row df.loc[len(df)-1,'val'] = ''

Related

Python: Loop over multiple items

Replace the missing value NAN based on values of another columns (conditions)

Common values between multiple dataframes with different length

Subtracting fix date from whole panda data frame - python

Python add an id for specific values

Categories

Resources