Vectorization of loops in python

Vectorization of loops in python - python

I have the following code in Python:
import numpy as np
import pandas as pd
colum1 = [1,2,3,4,5,6,7,8,9,10,11,12]
colum2 = [10,20,30,40,50,60,70,80,90,100,110,120]
df = pd.DataFrame({
'colum1' : colum1,
'colum2' : colum2
});
df.loc[df.colum1 == 1,'result'] = df['colum2']
for i in range(len(colum2)):
df.result = np.where(df.colum1>1, 5 - (df['colum2'] - df.result.shift(1)), df.result)
the result of df.result is:
colum1 colum2 result
0 1 10 10.0
1 2 20 -5.0
2 3 30 -30.0
3 4 40 -65.0
4 5 50 -110.0
5 6 60 -165.0
6 7 70 -230.0
7 8 80 -305.0
8 9 90 -390.0
9 10 100 -485.0
10 11 110 -590.0
11 12 120 -705.0
I would like to know if there is a method that allows me to obtain the same result without using a cycle for

Your operation is dependent on two things, the previous row in the DataFrame, and the difference between consecutive values in the DataFrame. That hints that the solution will require shift and diff. However, you want to add a small constant to the expanding sum, as well as actually subtract this from each row, not add it.
To set the pieces of the problem up, first create your shifted series, where you add 5:
a = df.colum2.shift().add(5).cumsum().fillna(0)
Now you need the difference between elements in the Series, and fill missing results with their respective value in colum2:
b = df.colum2.diff().fillna(df.colum2)
To get your final result, simply subtract a from b:
b - a
0 10.0
1 -5.0
2 -30.0
3 -65.0
4 -110.0
5 -165.0
6 -230.0
7 -305.0
8 -390.0
9 -485.0
10 -590.0
11 -705.0
Name: colum2, dtype: float64

Related

Column-wise subtraction calculations in python 3

It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?
This is what my data looks like:
This is what I have so far, but when running run_analysis my "result" equation is bringing up an error, and I do not know how to store the results in a new dataframe. I'm a beginner at all of this so any help would be much appreciated.
storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
return a-b
def run_analysis (frame, store):
for first_col_index in range(len(frame)): #finding the first column to use
temp=[] #temporary place to store the column-wise values from the analysis
for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to
#the current column, then skip to next column
continue
else:
result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
#if column above our current column, the subtract values in the column and keep the result in temp
temp.append(result)
store.append(temp) #save the complete analysis in the store

Something like this?
#dummy ddataframe
df = pd.DataFrame({'a':list(range(10)), 'b':list(range(10,20)), 'c':list(range(10))})
print(df)
output:
a b c
0 0 10 0
1 1 11 1
2 2 12 2
3 3 13 3
4 4 14 4
5 5 15 5
6 6 16 6
7 7 17 7
8 8 18 8
9 9 19 9
Now iterate over pairs of columns and subtract them while assigning another column to the dataframe
for c1, c2 in zip(df.columns[:-1], df.columns[1:]):
df[f'{c2}-{c1}'] = df[c2]-df[c1]
print(df)
output:
a b c b-a c-b
0 0 10 0 10 -10
1 1 11 1 10 -10
2 2 12 2 10 -10
3 3 13 3 10 -10
4 4 14 4 10 -10
5 5 15 5 10 -10
6 6 16 6 10 -10
7 7 17 7 10 -10
8 8 18 8 10 -10
9 9 19 9 10 -10

Iterating through excel row using pandas

To begin with I would like to mention that I new to python. I am trying to iterate over rows in pandas. My data comes from an excel file and looks like this:
I would like to create a loop that calculates the mean of specific rows. For instance row 0,1,2 and then 9,10,11 and so on.
What I have already done:
import pandas as pd
import numpy as np
df = pd.read_excel("Excel name.xlsx")
for i in range([0,1,2],154,3)
x =df.iloc[[i]].mean()
print(x)
But I am not getting results. Any idea? Thank you in advance.
What I am doing and my code actually works is:
x1= df.iloc[[0,1,2]].mean()
x2= df.iloc[[9,10,11]].mean()
x3= df.iloc[[18,19,20]].mean()
x4= df.iloc[[27,28,29]].mean()
x5= df.iloc[[36,37,38]].mean()
x6= df.iloc[[45,46,47]].mean()
....
....
....
x17= df.iloc[[146,147,148]].mean()
What if I had 100 x? It would be impossible to code. So my question is if there is a way to automate this procedure with a loop.

Dont loop, rather select all rows by using little maths - here integer division with modulo by 9 and selecting 0,1,2 values in Index.isin, and then aggregate mean:
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(10, size=(20, 3)))
mask = (df.index % 9).isin([0,1,2])
print(df[mask].groupby(df[mask].index // 9).mean())
0 1 2
0 4.000000 5.666667 6.666667
1 3.666667 6.000000 8.333333
2 6.500000 8.000000 7.000000
Detail:
print(df[mask])
0 1 2
0 4 5 9
1 0 6 5
2 8 6 6
9 1 6 7
10 5 6 9
11 5 6 9
18 4 9 7
19 9 7 7

How do I perform rolling division with several columns in Pandas?

I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667

It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)

pandas bfill by interval to correct missing/invalid entries

so I have a dataframe
df = pandas.DataFrame([[numpy.nan,5],[numpy.nan,5],[2015,5],[2020,5],[numpy.nan,10],[numpy.nan,10],[numpy.nan,10],[2090,10],[2100,10]],columns=["value","interval"])
value interval
0 NaN 5
1 NaN 5
2 2015.0 5
3 2020.0 5
4 NaN 10
5 NaN 10
6 NaN 10
7 2090.0 10
8 2100.0 10
I need to backwards fill the NaN values based on their interval and the first non-nan following that index so the expected output is
value interval
0 2005.0 5 # corrected 2010 - 5(interval)
1 2010.0 5 # corrected 2015 - 5(interval)
2 2015.0 5 # no change ( use this to correct 2 previous rows)
3 2020.0 5 # no change
4 2060.0 10 # corrected 2070 - 10
5 2070.0 10 # corrected 2080 - 10
6 2080.0 10 # corrected 2090 - 10
7 2090.0 10 # no change (use this to correct 3 previous rows)
8 2100.0 10 # no change
I am at a loss as to how i can accomplish this task using pandas/numpy vectorized operations ...
I can do it with a pretty simple loop
last_good_value = None
fixed_values = []
for val,interval in reversed(df.values):
if val == numpy.nan and last_good_value is not None:
fixed_values.append(last_good_value - interval)
last_good_value = fixed_values[-1]
else:
fixed_values.append(val)
if val != numpy.nan:
last_good_value = val
print (reversed(fixed_values))
which strictly speaking works... but i would like to understand a pandas solution that can resolve the value, and avoid the loops (this is quite a big list in reality)

First, get the position of the rows within groups sharing same 'interval' value.
Then, get the last value of each group.
What you are looking for is "last_value - pos * interval"
df = df.reset_index()
grouped_df = df.groupby(['interval'])
df['pos'] = grouped_df['index'].rank(method='first', ascending=False) - 1
df['last'] = grouped_df['value'].transform('last')
df['value'] = df['last'] - df['interval'] * df['pos']
del df['pos'], df['last'], df['index']

Create a grouping Series that groups the last non-null value with all NaN rows before it, by reversing with [::-1]. Then you can bfill and use cumsum to determine how much to subtract off of every row.
s = df['value'].notnull()[::-1].cumsum()
subt = df.loc[df['value'].isnull(), 'interval'][::-1].groupby(s).cumsum()
df['value'] = df.groupby(s)['value'].bfill().subtract(subt, fill_value=0)
value interval
0 2005.0 5
1 2010.0 5
2 2015.0 5
3 2020.0 5
4 2060.0 10
5 2070.0 10
6 2080.0 10
7 2090.0 10
8 2100.0 10
Because subt is subset to only the NaN rows, the fill_value=0 ensures rows with values remain unchanged
print(subt)
#6 10
#5 20
#4 30
#1 5
#0 10
#Name: interval, dtype: int64

Comparing two columns and if condition is met add '1' to a new column [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I cannot figure out how to compare two columns and if one columns is greater than or equal to another number input '1' to a new column. If the condition is not met I would like python to do nothing.
The data set for testing is here:
data = [[12,10],[15,10],[8,5],[4,5],[15,'NA'],[5,'NA'],[10,10], [9,10]]
df = pd.DataFrame(data, columns = ['Score', 'Benchmark'])
Score Benchmark
0 12 10
1 15 10
2 8 5
3 4 5
4 15 NA
5 5 NA
6 10 10
7 9 10
The desired output is:
desired_output_data = [[12,10, 1],[15,10,1],[8,5,1],[4,5],[15,'NA'],[5,'NA'],[10,10,1], [9,10]]
desired_output_df = pd.DataFrame(desired_output_data, columns = ['Score', 'Benchmark', 'MetBench'])
Score Benchmark MetBench
0 12 10 1.0
1 15 10 1.0
2 8 5 1.0
3 4 5 NaN
4 15 NA NaN
5 5 NA NaN
6 10 10 1.0
7 9 10 NaN
I tried doing something like this:
if df['Score'] >= df['Benchmark']:
df['MetBench'] = 1
I am new to programming in general so any guidance would be greatly appreciated.
Thank you!

Can usege and map
df.Score.ge(df.Benchmark).map({True: 1, False:np.nan})
or use the mapping from False to np.nan implicitly, since pandas uses the dict.get method to apply the mapping, and None is the default value (thanks #piRSquared)
df.Score.ge(df.Benchmark).map({True: 1})
Or simply series.where
df.Score.ge(df.Benchmark).where(lambda s: s)
Both outputs
0 1.0
1 1.0
2 1.0
3 NaN
4 NaN
5 NaN
6 1.0
7 NaN
dtype: float64
Make sure to do
df['Benchmark'] = pd.to_numeric(df['Benchmark'], errors='coerce')
first, since you have 'NA' as a string, but you need the numeric value np.nan to be able to compare it with other numbers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorization of loops in python - python

Related

Column-wise subtraction calculations in python 3

Iterating through excel row using pandas

How do I perform rolling division with several columns in Pandas?

pandas bfill by interval to correct missing/invalid entries

Comparing two columns and if condition is met add '1' to a new column [duplicate]

Categories

Resources