Multipy every five rows in python - python

I have two dataframes in pandas of the following form:
df1 df2
column factor
0 2 0 0.0
1 4 1 0.25
2 12 2 0.50
3 5 3 0.99
4 4 4 1.00
5 15
6 32
The work is to sumproduct every 5 row in df1 with df2 and put the new results in df3 (my actual data has about 500 rows in df1). The results should be like this:
df3
results **description (no need to add this column)**
0 15.95 df1.iloc[:4,0].dot(df2)
1 24.46 df1.iloc[1:5,0].dot(df2)
3 50.10 df1.iloc[2:6,0].dot(df2

Try:
import numpy as np
mx=df2.to_numpy()
df1.rolling(5).apply(lambda x: np.dot(x, mx), raw=True).iloc[4:]
Outputs:
column
4 15.95
5 24.46
6 50.10

Related

How do I perform rolling division with several columns in Pandas?

I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)

Pandas create new column based on division of two other columns

Hi I have the following df in which I want the new column to be the result of B/A unless B == 0 in which case take the average of C&D and divide by A so ((C+D)/2)/A.
I know how to do df["New Column"] = df["B"]/df["A"] But I am not sure how you would do it how I want. DO I need to iterate through each row of the df and use conditional if statements?
A B C D New Column Desired Column
5 3 2 4 0.6 0.6
6 2 2 3 0.333 0.333333333
8 4 3 4 0.5 0.5
9 0 3 4 0 0.388888889
14 3 3 4 0.214 0.214285714
5 0 2 4 0 0.6
Here you go:
import numpy as np
df["new Column"] = np.where(df["B"] != 0, df["B"]/df["A"], (df["C"]+df["D"])/2/df["A"])

How to explode aggregated pandas column

I have a df that looks like this:
df
time score
83623 4
83624 3
83625 3
83629 2
83633 1
I want to explode df.time so that the single digit increments by 1, and then the df.score value is duplicated for each added row. See example below:
time score
83623 4
83624 3
83625 3
83626 3
83627 3
83628 3
83629 2
83630 2
83631 2
83632 2
83633 1
From your sample, I assume df.time is integer. You may try this way
df_final = df.set_index('time').reindex(range(df.time.min(), df.time.max()+1),
method='pad').reset_index()
Out[89]:
time score
0 83623 4
1 83624 3
2 83625 3
3 83626 3
4 83627 3
5 83628 3
6 83629 2
7 83630 2
8 83631 2
9 83632 2
10 83633 1

How to find the maximum value of a column with pandas?

I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())

Merging dataframes with different dimensions and related data

I have 2 dataframes with different size with related data to be merged in an efficient way:
master_df = pd.DataFrame({'kpi_1': [1,2,3,4]},
index=['dn1_app1_bar.com',
'dn1_app2_bar.com',
'dn2_app1_foo.com',
'dn2_app2_foo.com'])
guard_df = pd.DataFrame({'kpi_2': [1,2],
'kpi_3': [10,20]},
index=['dn1_bar.com', 'dn2_foo.com'])
master_df:
kpi_1
dn1_app1_bar.com 1
dn1_app2_bar.com 2
dn2_app1_foo.com 3
dn2_app2_foo.com 4
guard_df:
kpi_2 kpi_3
dn1_bar.com 1 10
dn2_foo.com 2 20
I want to get a dataframe with values from a guard_df's row indexed with <group>_<name> "propagated' to all master_df's rows matching
<group>_.*_<name>.
Expected result:
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1.0 10.0
dn1_app2_bar.com 2 1.0 10.0
dn2_app1_foo.com 3 2.0 20.0
dn2_app2_foo.com 4 2.0 20.0
What I've managed so far is the following basic approach:
def eval_base_dn(dn):
chunks = dn.split('_')
return '_'.join((chunks[0], chunks[2]))
for dn in master_df.index:
for col in guard_df.columns:
master_df.loc[dn, col] = guard_df.loc[eval_base_dn(dn), col]
but I'm looking for some more performant way to "broadcast" the values and merge the dataframes.
If use pandas 0.25+ is possible pass array, here index to on parameter of merge with left join:
master_df = master_df.merge(guard_df,
left_on=master_df.index.str.replace('_.+_', '_'),
right_index=True,
how='left')
print (master_df)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20
Try this one:
>>> pd.merge(master_df.assign(guard_df_id=master_df.index.str.split("_").map(lambda x: "{0}_{1}".format(x[0], x[-1]))), guard_df, left_on="guard_df_id", right_index=True).drop(["guard_df_id"], axis=1)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

Categories