To begin with I would like to mention that I new to python. I am trying to iterate over rows in pandas. My data comes from an excel file and looks like this:
I would like to create a loop that calculates the mean of specific rows. For instance row 0,1,2 and then 9,10,11 and so on.
What I have already done:
import pandas as pd
import numpy as np
df = pd.read_excel("Excel name.xlsx")
for i in range([0,1,2],154,3)
x =df.iloc[[i]].mean()
print(x)
But I am not getting results. Any idea? Thank you in advance.
What I am doing and my code actually works is:
x1= df.iloc[[0,1,2]].mean()
x2= df.iloc[[9,10,11]].mean()
x3= df.iloc[[18,19,20]].mean()
x4= df.iloc[[27,28,29]].mean()
x5= df.iloc[[36,37,38]].mean()
x6= df.iloc[[45,46,47]].mean()
....
....
....
x17= df.iloc[[146,147,148]].mean()
What if I had 100 x? It would be impossible to code. So my question is if there is a way to automate this procedure with a loop.
Dont loop, rather select all rows by using little maths - here integer division with modulo by 9 and selecting 0,1,2 values in Index.isin, and then aggregate mean:
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(10, size=(20, 3)))
mask = (df.index % 9).isin([0,1,2])
print(df[mask].groupby(df[mask].index // 9).mean())
0 1 2
0 4.000000 5.666667 6.666667
1 3.666667 6.000000 8.333333
2 6.500000 8.000000 7.000000
Detail:
print(df[mask])
0 1 2
0 4 5 9
1 0 6 5
2 8 6 6
9 1 6 7
10 5 6 9
11 5 6 9
18 4 9 7
19 9 7 7
Related
I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)
This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I cannot figure out how to compare two columns and if one columns is greater than or equal to another number input '1' to a new column. If the condition is not met I would like python to do nothing.
The data set for testing is here:
data = [[12,10],[15,10],[8,5],[4,5],[15,'NA'],[5,'NA'],[10,10], [9,10]]
df = pd.DataFrame(data, columns = ['Score', 'Benchmark'])
Score Benchmark
0 12 10
1 15 10
2 8 5
3 4 5
4 15 NA
5 5 NA
6 10 10
7 9 10
The desired output is:
desired_output_data = [[12,10, 1],[15,10,1],[8,5,1],[4,5],[15,'NA'],[5,'NA'],[10,10,1], [9,10]]
desired_output_df = pd.DataFrame(desired_output_data, columns = ['Score', 'Benchmark', 'MetBench'])
Score Benchmark MetBench
0 12 10 1.0
1 15 10 1.0
2 8 5 1.0
3 4 5 NaN
4 15 NA NaN
5 5 NA NaN
6 10 10 1.0
7 9 10 NaN
I tried doing something like this:
if df['Score'] >= df['Benchmark']:
df['MetBench'] = 1
I am new to programming in general so any guidance would be greatly appreciated.
Thank you!
Can usege and map
df.Score.ge(df.Benchmark).map({True: 1, False:np.nan})
or use the mapping from False to np.nan implicitly, since pandas uses the dict.get method to apply the mapping, and None is the default value (thanks #piRSquared)
df.Score.ge(df.Benchmark).map({True: 1})
Or simply series.where
df.Score.ge(df.Benchmark).where(lambda s: s)
Both outputs
0 1.0
1 1.0
2 1.0
3 NaN
4 NaN
5 NaN
6 1.0
7 NaN
dtype: float64
Make sure to do
df['Benchmark'] = pd.to_numeric(df['Benchmark'], errors='coerce')
first, since you have 'NA' as a string, but you need the numeric value np.nan to be able to compare it with other numbers
I have the following code in Python:
import numpy as np
import pandas as pd
colum1 = [1,2,3,4,5,6,7,8,9,10,11,12]
colum2 = [10,20,30,40,50,60,70,80,90,100,110,120]
df = pd.DataFrame({
'colum1' : colum1,
'colum2' : colum2
});
df.loc[df.colum1 == 1,'result'] = df['colum2']
for i in range(len(colum2)):
df.result = np.where(df.colum1>1, 5 - (df['colum2'] - df.result.shift(1)), df.result)
the result of df.result is:
colum1 colum2 result
0 1 10 10.0
1 2 20 -5.0
2 3 30 -30.0
3 4 40 -65.0
4 5 50 -110.0
5 6 60 -165.0
6 7 70 -230.0
7 8 80 -305.0
8 9 90 -390.0
9 10 100 -485.0
10 11 110 -590.0
11 12 120 -705.0
I would like to know if there is a method that allows me to obtain the same result without using a cycle for
Your operation is dependent on two things, the previous row in the DataFrame, and the difference between consecutive values in the DataFrame. That hints that the solution will require shift and diff. However, you want to add a small constant to the expanding sum, as well as actually subtract this from each row, not add it.
To set the pieces of the problem up, first create your shifted series, where you add 5:
a = df.colum2.shift().add(5).cumsum().fillna(0)
Now you need the difference between elements in the Series, and fill missing results with their respective value in colum2:
b = df.colum2.diff().fillna(df.colum2)
To get your final result, simply subtract a from b:
b - a
0 10.0
1 -5.0
2 -30.0
3 -65.0
4 -110.0
5 -165.0
6 -230.0
7 -305.0
8 -390.0
9 -485.0
10 -590.0
11 -705.0
Name: colum2, dtype: float64
I have a script with an output of multiple columns which are put beneath each other. I would like the columns to be merged together and to drop the duplicates. I've tried merge, combine, concatenate and joining, but I can't seem to figure it out. I also tried to merge as a list, but that doesn't seem to help as well. Below is my code:
import pandas as pd
data = pd.ExcelFile('path')
newlist = [x for x in data.sheet_names if x.startswith("ZZZ")]
for x in newlist:
sheets = pd.read_excel(data, sheetname = x)
column = sheets.loc[:,'YYY']
Any help is really appreciated!
Edit
Some more info about the code: data is where an excelfile is loaded. Then at newlist, the sheetnames that start with ZZZ are shown. Then in the for-loop, these sheets are called. At column, the columns named YYY are called. These columns are put beneath each other, but aren't merged yet. For example:
Here is the output of the columns now and I would like them to be one list from 1 to 17.
I hope it is more clear now!
Edit 2.0
Here I tried the concat method that is mentioned below. However, I still get the output as the picture above shows instead of a list from 1 to 17.
my_concat_series = pd.Series()
for x in newlist:
sheets = pd.read_excel(data, sheetname = x)
column = sheets.loc[:,'YYY']
my_concat_series = pd.concat([my_concat_series,column]).drop_duplicates()
print(my_concat_series)
I don't see how pandas.concat doesn't work, let's try an example corresponding to the data picture you posted:
import pandas as pd
col1 = pd.Series(np.arange(1,12))
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
dtype: int64
col2 = pd.Series(np.arange(7,18))
0 7
1 8
2 9
3 10
4 11
5 12
6 13
7 14
8 15
9 16
10 17
dtype: int64
And then use pd.concat and drop_duplicates
pd.concat([col1,col2]).drop_duplicates()
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
5 12
6 13
7 14
8 15
9 16
10 17
dtype: int64
You can then reshape your data the way you want them, for instance if you don't want a duplicate index:
pd.concat([col1,col2]).drop_duplicates().reset_index(drop = True),
or if you want the values as a numpy array instead of a pandas series:
pd.concat([col1,col2]).drop_duplicates()
Note that in the last case you can also use numpy arrays from the begginning, which is faster:
import numpy as np
np.unique(np.concatenate((col1.values,col2.values)))
If you want them as a list:
list(pd.concat([col1,col2]).drop_duplicates())
I'm confused as to the highlighted line. What exactly is this line doing. What does .div do? I tried to look through the documentation which said
"Floating division of dataframe and other, element-wise (binary operator truediv)"
I'm not exactly sure what this means. Any help would be appreciated!
You can divide one dataframe by another and pandas will automagically aligned the index and columns and subsequently divide the appropriate values. EG df1 / df2
If you divide a dataframe by series, pandas automatically aligns the series index with the columns of the dataframe. It maybe that you want to align the index of the series with the index of the dataframe instead. If this is the case, then you will have to use the div method.
So instead of:
df / s
You use
df.div(s, axis=0)
Which says to align the index of s with the index of df then perform the division while broadcasting over the other dimension, in this case columns.
In the above example, what it is essentially doing is dividing pclass_xt on axis 0, by the array/series which pclass_xt.sum(0) has generated. In pclass_xt.sum(0), .sum is summing up values along the axis=1, which gives you the total of both survived and not survived along all the pclasses. Then, .div is simply dividing the entire dataframe along 0 axis with the sum generated i.e. a row is divided by the sum of that row.
import pandas as pd,numpy as np
data={"A":np.arange(10),"B":np.random.randint(1,10,10),"C":np.random.random(10)}
#print(data)
df2=pd.DataFrame(data=data)
print("DataFrame values:\n",df2)
s1=pd.Series(np.arange(1,11))
print("s1 series values:\n",s1)
print("Result of Division:\n",df2.div(s1,axis=0))
**#So here, How the div is working as mention below:-
#df Row1/s1 Row1 -0/1 4/1 0.305/1
#df Row2/s1 Row2 -1/2 9/2 0.821/2**
#################Output###########################
DataFrame values:
A B C
0 0 2 0.265396
1 1 2 0.055646
2 2 7 0.963006
3 3 9 0.958677
4 4 6 0.256558
5 5 6 0.859066
6 6 8 0.818831
7 7 4 0.656055
8 8 6 0.885797
9 9 4 0.412497
s1 series values:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
Result of Division:
A B C
0 0.000000 2.000000 0.265396
1 0.500000 1.000000 0.027823
2 0.666667 2.333333 0.321002
3 0.750000 2.250000 0.239669
4 0.800000 1.200000 0.051312
5 0.833333 1.000000 0.143178
6 0.857143 1.142857 0.116976
7 0.875000 0.500000 0.082007
8 0.888889 0.666667 0.098422
9 0.900000 0.400000 0.041250