Finding min of values across multiple columns in pandas - python

I am trying to find min of values across columns in a pandas data frame where cols are ranged and split. For example, I have the dataframe in pandas as shown in the image.
I am iterating over the dataframe for more logic and would like to get the min of values in columns between T3:T6 and T11:T14 in separate variables.
Tried print(df.iloc[2,2:,2:4].min(axis=1))
I expect 9 and 13 for Row1 when I iterate.

create a simple dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
A B C D
0 2 0 5 1
1 9 7 5 5
2 5 5 3 0
3 0 6 3 8
4 4 4 4 0
5 8 2 1 4
6 4 1 1 8
7 6 5 2 9
8 2 4 3 0
9 4 7 1 8
use the min() function:
df.min()
result:
A 0
B 0
C 1
D 0
and if you wish to select specific columns, use the loc:
df.loc[:,'B':'C'].min()
B 0
C 1
Bonus: Take pandas to another level - paint the minimum:
df.style.apply(lambda x: ['background-color : red; font-size: 16px' if v==x.min() else 'font-size: 16px' for _,v in enumerate(x) ],axis=0)

print(df[['T'+str(x) for x in range(3,7)]].min(axis=1)]
print(df[['T'+str(x) for x in range(11,15)]].min(axis=1)]
should print the mins for all the rows of t3, t4, t5,16 and t11, t12, t13,14 separately

For test dataframe:
df = pd.DataFrame({'A':[x for x in range(100)], 'B': [x for x in range(10,110)], 'C' : [x for x in range(20,120)] })
Create a function that can be applied to each row to find the minimum:
def test(row):
print(row[['A','B']].min())
Then use apply to run the function on each row:
df.apply(lambda row: test(row), axis=1)
This will print the minimum of whichever columns you put in the "test function"

Related

How to find the maximum value of a column with pandas?

I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())

How to get dataframe of unique ids

I'm trying to group the following dataframe by unique binId and then parse the resulting rows based of 'z' and pick the row with highest value of 'z'. Here is my dataframe.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3','4','5','6'], 'binId': ['1','2','2','1','1','3'], 'x':[1,4,5,6,3,4], 'y':[11,24,35,16,23,34],'z':[1,4,5,2,3,4]})
`
I tried following code which gives required answer,
def f(x):
tp = df[df['binId'] == x][['binId','ID','x','y','z']].sort_values(by='z', ascending=False).iloc[0]
return tp`
and then,
binids= pd.Series(df.binId.unique())
print binids.apply(f)
The output is,
binId ID x y z
0 1 5 3 23 3
1 2 3 5 35 5
2 3 6 4 34 4
But the execution is too slow. What is the faster way of doing this?
Use idxmax for indices of max and select by loc:
df1 = df.loc[df.groupby('binId')['z'].idxmax()]
Or faster is use sort_values with drop_duplicates:
df1 = df.sort_values(['binId', 'z']).drop_duplicates('binId', keep='last')
print (df1)
ID binId x y z
4 5 1 3 23 3
2 3 2 5 35 5
5 6 3 4 34 4

In pandas Dataframe with multiindex how can I filter by order?

Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)

Pandas: Get highest n rows based on multiple columns and they are matching each other

Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.
How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.
At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?
Consider the example dataframe df
np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
dict(
A=rng,
B=list('abcdefghij'),
C=np.random.permutation(rng),
D=np.random.permutation(rng)
)
)
print(df)
A B C D
0 0 a 9 1
1 1 b 4 3
2 2 c 5 5
3 3 d 1 9
4 4 e 7 4
5 5 f 6 6
6 6 g 8 0
7 7 h 3 2
8 8 i 2 7
9 9 j 0 8
Use nlargest to identify lists. Then use query to filter dataframe
n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)
df.query('C in #c_lrgst & D in #d_lrgst')
A B C D
2 2 c 5 5
5 5 f 6 6

sum of multiplication of cells in the same row but different column for pandas data frame

I have a data frame
df = pd.DataFrame({'A':[1,2,3],'B':[2,3,4]})
My data looks like this
Index A B
0 1 2
1 2 3
2 3 4
I would like to calculate the sum of multiplication between A and B in each row.
The expected result should be (1x2)+(2x3)+(3x4) = 2 + 6 + 12 = 20.
May I know the pythonic way to do this instead of looping?
You can try multiple columns A and B and then use sum :
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[2,3,4]})
print df
A B
0 1 2
1 2 3
2 3 4
print df['A'] * df['B']
0 2
1 6
2 12
dtype: int64
print (df['A'] * df['B']).sum()
20
Or use prod for multiple all columns:
print df.prod(axis=1)
0 2
1 6
2 12
dtype: int64
print df.prod(axis=1).sum()
20
Thank you ajcr for comment:
If you have just two columns, you can also use df.A.dot(df.B) for extra speed, but for three or more columns this is the way to do it!

Categories