Selecting minimum value between columns python - python

I have a DataFrame that looks like the below. "Name" represents a student name and values below each of the Test variables represent the test grade.
Name Test1 Test2 Test3
Ana 87 93 82
Cole 62 73 84
Sia 64 58 60
Max 93 95 99
Leah 93 90 85
Cam 76 80 83
The desired result is the DataFrame below. Where "MinTestGrade" represents that lowest grade each student earned between the 3. "TestNumber" is the Test they got the lowest grade on.
Name TestNumber MinTestGrade
Ana 3 82
Cole 1 62
Sia 2 58
Max 1 93
Leah 3 85
Cam 1 76
How can I do this using python?

You can pass idxmin and min to agg on axis to find the minimum grade and the column name, i.e. TestNumber, that it corresponds to for each student. Then join the outcome with "Name", rename the columns and finally strip the word "Test" from "TestNumber":
out = df[['Name']].join(df.filter(like='Test').agg(['idxmin', 'min'], axis=1)).rename(columns={'idxmin':'TestNumber', 'min':'MinTestGrade'})
out['TestNumber'] = out['TestNumber'].str.lstrip('Test').astype(int)
Output:
Name TestNumber MinTestGrade
0 Ana 3 82
1 Cole 1 62
2 Sia 2 58
3 Max 1 93
4 Leah 3 85
5 Cam 1 76

df.set_index("Name").agg(["idxmin", "min"], axis=1).reset_index()
# Name idxmin min
# 0 Ana Test3 82
# 1 Cole Test1 62
# 2 Sia Test2 58
# 3 Max Test1 93
# 4 Leah Test3 85
# 5 Cam Test1 76

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!
First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12

How to get columns(Series) back from dropped table?

print(df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
df.drop(['Chemistry'],axis=1,inplace=True)
df
Names Maths Physics
0 Khaja 75 91
1 Srihari 81 89
2 Krishna 69 77
3 jain 87 69
4 shakir 79 70
How to get back the dropped column from the table. I tried to get back
the column with reset_drop() but it doesn't work.
The final outcome should look like this:
print(df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
Use pop for extract column to Series and join to add to end of DataFrame:
a = df.pop('Chemistry')
print (a)
0 84
1 71
2 76
3 68
4 74
Name: Chemistry, dtype: int64
print (df)
Names Maths Physics
0 Khaja 75 91
1 Srihari 81 89
2 Krishna 69 77
3 jain 87 69
4 shakir 79 70
df = df.join(a)
print (df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
If column is not last add reindex by original columns:
cols = df.columns
a = df.pop('Maths')
print (a)
0 75
1 81
2 69
3 87
4 79
Name: Maths, dtype: int64
print (df)
Names Physics Chemistry
0 Khaja 91 84
1 Srihari 89 71
2 Krishna 77 76
3 jain 69 68
4 shakir 70 74
df = df.join(a).reindex(columns=cols)
print (df)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
Its always a good practice to have a master Dataframe and then do operations in them
I would suggest keep best naming practice and give subset dataframe meaningful names.
print (Master)
Names Maths Physics Chemistry
0 Khaja 75 91 84
1 Srihari 81 89 71
2 Krishna 69 77 76
3 jain 87 69 68
4 shakir 79 70 74
Chemistry= df.pop('Chemistry')
0 84
1 71
2 76
3 68
4 74
Name: Chemistry, dtype: int64
df_withoutChemistry
Names Maths Physics
0 Khaja 75 91
1 Srihari 81 89
2 Krishna 69 77
3 jain 87 69
4 shakir 79 70

How to build a data frame using pandas where attributes are arranged?

I want to make a data frame in pandas that look like this:
Id Name Gender Math Science English
1 Ram Male 98 92 80
2 Hari Male 30 40 23
3 Gita Female 60 65 77
4 Sita Female 50 45 55
5 Shyam Male 80 88 82
I wrote quote in python like this:
import pandas as pd
d = {'Id':[1,2,3,4,5], 'Name':['Ram','Hari','Gita','Sita','Shyam'],'Gender':['Male','Male','Female','Female','Male'],'Math':[98,30,60,50,80],'Science':[92,40,65,45,88],'English':[80,23,77,55,82]}
df = pd.DataFrame(data=d)
print (df)
It gave me output like this:
English Gender Id Math Name Science
0 80 Male 1 98 Ram 92
1 23 Male 2 30 Hari 40
2 77 Female 3 60 Gita 65
3 55 Female 4 50 Sita 45
4 82 Male 5 80 Shyam 88
How do I remove the first column with no attribute and also arrange attributes in such a way that is given in the question?
I want Id, Name, Gender, Math, Science, English. Thanks
If you don't want index, you can set it by unique column like Id.
import pandas as pd
d = {'Id':[1,2,3,4,5], 'Name':['Ram','Hari','Gita','Sita','Shyam'],'Gender':['Male','Male','Female','Female','Male'],'Math':[98,30,60,50,80],'Science':[92,40,65,45,88],'English':[80,23,77,55,82]}
df = pd.DataFrame(data=d)
df.set_index('Id', inplace=True)
print (df)
Output:
Name Gender Math Science English
Id
1 Ram Male 98 92 80
2 Hari Male 30 40 23
3 Gita Female 60 65 77
4 Sita Female 50 45 55
5 Shyam Male 80 88 82
Try to create directly the DataFrame instead of passing by "d"
df = pd.DataFrame({'Id': [1, 4, 7, 10], etc...})
Then use set_index to fix your Id as it :
df.set_index('Id')

Performing operations on grouped rows in python

I have a dataframe where pic_code value may repeat. If it repeats, I want to set the variable "keep" to "t" for the pic_code that is closest to its mpe_wgt.
For example, the second pic_code has "keep" set to t since it has the "weight" closest to its corresponding "mpe_weight". My code results in "keep" staying 'f' for all and "diff" staying "100" for all.
df['keep']='f'
df['diff']=100
def cln_df(data):
if pd.unique(data['mpe_wgt']).shape==(1,):
data['keep'][0:1]='t'
elif pd.unique(data['mpe_wgt']).shape!=(1,):
data['diff']=abs(data['weight']-(data['mpe_wgt']/100))
data['keep'][data['diff']==min(data['diff'])]='t'
return data
df=df.groupby('pic_code').apply(cln_df)
df before
pic_code weight mpe_wgt keep diff
1234 45 34 f 100
1234 32 23 f 100
45344 54 35 f 100
234 76 98 f 100
234 65 12 f 100
df output should be
pic_code weight mpe_wgt keep diff
1234 45 34 f 11
1234 32 23 t 9
45344 54 35 t 100
234 76 98 t 22
234 65 12 f 53
I'm fairly new to python so please keep the solutions as simple as possible. I really want to make my method work so please don't get too fancy. Thanks in advance for your help.
This is one way. Note I am using Boolean values True / False in place of strings "t" and "f". This is just good practice.
Note that all the below operations are vectorised, while groupby.apply with a custom function certainly is not.
Setup
print(df)
pic_code weight mpe_wgt
0 1234 45 34
1 1234 32 23
2 45344 54 35
3 234 76 98
4 234 65 12
Solution
# calculate difference
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
# sort by pic_code, then by diff
df = df.sort_values(['pic_code', 'diff'])
# define keep column as True only for non-duplicates by pic_code
df['keep'] = ~df.duplicated('pic_code')
Result
print(df)
pic_code weight mpe_wgt diff keep
3 234 76 98 22 True
4 234 65 12 53 False
1 1234 32 23 9 True
0 1234 45 34 11 False
2 45344 54 35 19 True
Use:
df['keep'] = df.assign(closest=(df['mpe_wgt']-df['weight']).abs())\
.sort_values('closest').duplicated(subset=['pic_code'])\
.replace({True:'f',False:'t'})
Output:
pic_code weight mpe_wgt keep
0 1234 45 34 f
1 1234 32 23 t
2 45344 54 35 t
3 234 76 98 t
4 234 65 12 f
Maybe you can try cumcount
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
df['keep'] = df.sort_values('diff').groupby('pic_code').cumcount().eq(0)
df
pic_code weight mpe_wgt diff keep
0 1234 45 34 11 False
1 1234 32 23 9 True
2 45344 54 35 19 True
3 234 76 98 22 True
4 234 65 12 53 False
Using eval and assign to execute similar logic as other answers.
m = dict(zip([False, True], 'tf'))
f = lambda d: d.sort_values('diff').duplicated('pic_code').map(m)
df.eval('diff=abs(weight - mpe_wgt)').assign(keep=f)
pic_code weight mpe_wgt keep diff
0 1234 45 34 f 11.0
1 1234 32 23 t 9.0
2 45344 54 35 t 19.0
3 234 76 98 t 22.0
4 234 65 12 f 53.0

How to split Dataframe using pandas

i have column value like 1ST:[70]2ND:[71]3RD:[71]S1:[71]4TH:[77]5TH:[78]6TH:[78]S2:[78]FIN:[75] in csv, need to extract all merged content into separate column, how to do it pandas
need O/p like:
1ST 2ND 3RD S1 4TH 5TH 6TH S2 FIN
0 70 71 71 71 77 78 78 78 75
here i have pasted some of rows of that column value.
1ST:[80]2ND:[79]3RD:[75]S1:[78]4TH:[76]5TH:[80]6TH:[87]S2:[81]FIN:[80]
1ST:[75]2ND:[74]3RD:[81]S1:[77]4TH:[80]5TH:[78]6TH:[87]S2:[82]FIN:[80]
1ST:[58]2ND:[54]3RD:[65]S1:[59]4TH:[80]5TH:[72]6TH:[74]S2:[75]FIN:[67]
1ST:[90]2ND:[91]3RD:[82]S1:[88]4TH:[84]5TH:[88]6TH:[87]S2:[86]FIN:[87]
1ST:[83]2ND:[79]3RD:[82]S1:[81]4TH:[85]5TH:[84]6TH:[90]S2:[86]FIN:[84]
IN dataframe i have one column contains above value. i need to split into different columns and value will be in rows.
Your question seems confusing. What is your objective from solution structure side?
your file is having value like this
1ST:[70]2ND:[71]3RD:[71]S1:[71]4TH:[77]5TH:[78]6TH:[78]S2:[78]FIN:[75]
you want output should be in this way
1ST 2ND 3RD S1 4TH 5TH 6TH S2 FIN
0 70 71 71 71 77 78 78 78 75
or like this
0 1
0 1ST 70
1 2ND 71
2 3RD 71
3 S1 71
4 4TH 77
5 5TH 78
6 6TH 78
7 S2 78
8 FIN 75
Now, approach to get output from given input
import pandas as pd
# consider your input is string (you can use csv)
file_val = "1ST:[70]2ND:[71]3RD:[71]S1:[71]4TH:[77]5TH:[78]6TH:[78]S2:[78]FIN:[75]"
df = pd.DataFrame([i.split(':') for i in file_val.replace('[',"").split(']') if i!=""])
print(df)
0 1
0 1ST 70
1 2ND 71
2 3RD 71
3 S1 71
4 4TH 77
5 5TH 78
6 6TH 78
7 S2 78
8 FIN 75
Please share the snap shot of csv file or couple of rows, so that I could be able to generate the output as per your requirement.
Coming back to the final solution as per your format
# reading data
with open('sample.csv') as f:
dat = file.read(f)
# spliting rows
dat1 = dat.split(\n)
# method to convert each row to dict
def row_to_dict(row):
return dict([i.split(":") for i in row.replace('[',"").split(']') if i!=""])
# now apply method to each row of dat1 and create single dataframe out of it
# that is nothing but final output
res = pd.DataFrame(map(lambda x:row_to_dict(x), dat1))
print(res)
1ST 2ND 3RD 4TH 5TH 6TH FIN S1 S2
0 80 79 75 76 80 87 80 78 81
1 75 74 81 80 78 87 80 77 82
2 58 54 65 80 72 74 67 59 75
3 90 91 82 84 88 87 87 88 86
4 83 79 82 85 84 90 84 81 86
Find the above result in R
a1=read.csv("c:/Users/Dell/Desktop/NewText.txt",header = FALSE)
a1$V1=as.character(a1$V1)
g1=NULL
g2=NULL
l=list()
for(i in 1:nrow(a1))
{
g1=strsplit(a1$V1[i],"]")
g1=strsplit(g1[[1]],":\\[")
g2=data.frame(g1)
g2[] <- lapply(g2, as.character)
colnames(g2)=g2[1,]
g2=g2[-1,]
l[[i]]=g2
}
l=do.call('rbind',l)

Categories