I have a list of students in a csv file. I want (using Python) to display four columns in that I want to display the male students who have higher marks in Maths, Computer, and Physics.
I tried to use pandas library.
marks = pd.concat([data['name'],
data.loc[data['students']==1, 'maths'].nlargest(n=10)], 'computer'].nlargest(n=10)], 'physics'].nlargest(n=10)])
I used 1 for male students and 0 for female students.
It gives me an error saying: Invalid syntax.
Here's a way to show the top 10 students in each of the disciplines. You could of course just sum the three scores and select the students with the highest total if you want the combined as opposed to the individual performance (see illustration below).
df1 = pd.DataFrame(data={'name': [''.join(random.choice('abcdefgh') for _ in range(8)) for i in range(100)],
'students': np.random.randint(0, 2, size=100)})
df2 = pd.DataFrame(data=np.random.randint(0, 10, size=(100, 3)), columns=['math', 'physics', 'computers'])
data = pd.concat([df1, df2], axis=1)
data.info()
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
name 100 non-null object
students 100 non-null int64
math 100 non-null int64
physics 100 non-null int64
computers 100 non-null int64
dtypes: int64(4), object(1)
memory usage: 4.0+ KB
res = pd.concat([data.loc[:, ['name']], data.loc[data['students'] == 1, 'math'].nlargest(n=10), data.loc[data['students'] == 1, 'physics'].nlargest(n=10), data.loc[data['students'] == 1, 'computers'].nlargest(n=10)], axis=1)
res.dropna(how='all', subset=['math', 'physics', 'computers'])
name math physics computers
0 geghhbce NaN 9.0 NaN
1 hbbdhcef NaN 7.0 NaN
4 ghgffgga NaN NaN 8.0
6 hfcaccgg 8.0 NaN NaN
14 feechdec NaN NaN 8.0
15 dfaabcgh 9.0 NaN NaN
16 ghbchgdg 9.0 NaN NaN
23 fbeggcha NaN NaN 9.0
27 agechbcf 8.0 NaN NaN
28 bcddedeg NaN NaN 9.0
30 hcdgbgdg NaN 8.0 NaN
38 fgdfeefd NaN NaN 9.0
39 fbcgbeda 9.0 NaN NaN
41 agbdaegg 8.0 NaN 9.0
49 adgbefgg NaN 8.0 NaN
50 dehdhhhh NaN NaN 9.0
55 ccbaaagc NaN 8.0 NaN
68 hhggfffe 8.0 9.0 NaN
71 bhggbheg NaN 9.0 NaN
84 aabcefhf NaN NaN 9.0
85 feeeefbd 9.0 NaN NaN
86 hgeecacc NaN 8.0 NaN
88 ggedgfeg 9.0 8.0 NaN
89 faafgbfe 9.0 NaN 9.0
94 degegegd NaN 8.0 NaN
99 beadccdb NaN NaN 9.0
data['total'] = data.loc[:, ['math', 'physics', 'computers']].sum(axis=1)
data[data.students==1].nlargest(10, 'total').sort_values('total', ascending=False)
name students math physics computers total
29 fahddafg 1 8 8 8 24
79 acchhcdb 1 8 9 7 24
9 ecacceff 1 7 9 7 23
16 dccefaeb 1 9 9 4 22
92 dhaechfb 1 4 9 9 22
47 eefbfeef 1 8 8 5 21
60 bbfaaada 1 4 7 9 20
82 fbbbehbf 1 9 3 8 20
18 dhhfgcbb 1 8 8 3 19
1 ehfdhegg 1 5 7 6 18
Related
Appreciate any help on this.
Let's say I have the following df with two columns:
col1 col2
NaN NaN
11 100
12 110
15 115
NaN NaN
NaN NaN
NaN NaN
9 142
12 144
NaN NaN
NaN NaN
NaN NaN
6 155
9 156
7 161
NaN NaN
NaN NaN
I'd like to forward fill and replace the Nan values with the median value of the preceding values. For example, the median of 11,12,15 in 'col1' is 12, therefore I need the Nan values to be filled with 12 until I get to the next non-Nan values in the column and continue iterating the same. See below the expected df:
col1 col2
NaN NaN
11 100
12 110
15 115
12 110
12 110
12 110
9 142
12 144
10.5 143
10.5 143
10.5 143
6 155
9 156
7 161
7 156
7 156
Try:
m1 = (df.col1.isna() != df.col1.isna().shift(1)).cumsum()
m2 = (df.col2.isna() != df.col2.isna().shift(1)).cumsum()
df["col1"] = df["col1"].fillna(
df.groupby(m1)["col1"].transform("median").ffill()
)
df["col2"] = df["col2"].fillna(
df.groupby(m2)["col2"].transform("median").ffill()
)
print(df)
Prints:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
IIUC, if we fill null values like so:
Fill with Median of last 3 non-null items.
Fill with Median of last 2 non-null items.
Front fill values.
We'll get what you're looking for.
out = (df.combine_first(df.rolling(4,3).median())
.combine_first(df.rolling(3,2).median())
.ffill())
print(out)
Output:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
I have dataframe as :
pandas dataframe
I'm consecutively grouping by 'Name' column for total counts, consecutive counts & for 'Age' column I'm applying min, max to generate dataframe as :
Then, I'm getting only first value of every column for each consecutive group as :
Then, I'm trying to get all column values where max 'Age' between 5-20 from each consecutive group is present and then, I'm trying to concat this dataframe with the dataframe which has first values. But I got the output as :
But the expected output is :
Also, this is for a single bin i.e., 5-20, how to include for more than 1 bins, for example, if 1 bin is 5-20 & next bin is 25-40, the expected output is :
For above outputs, this is the code what I have written :
import numpy as np
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['tom', 5], ['nick', 15], ['juli', 14], ['tom', 20],['tom', 10], ['tom', 10], ['juli', 17], ['tom', 30], ['nick', 19], ['juli', 24], ['juli', 29],['tom', 0], ['juli', 76]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
print("df = ",df)
print("")
# acquire min, max, count, consecutive same names
df['min'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('min')#df.groupby("Name",sort=False)['Age'].transform('min')
df['max'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('max')#df.groupby("Name",sort=False)['Age'].transform('max')
df['count'] = df.groupby("Name",sort=False)['Name'].transform('count')
df['cons'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Name'].transform('size')
print(df)
# take the first column values of every consecutive group
df_t = df
temp_df=df.groupby(df['Name'].ne(df['Name'].shift()).cumsum(),as_index=False)[df.columns].agg('first')
print("")
print("temp_df = ",temp_df)
df_t = df_t.reset_index()
df_t = df_t.drop(['index'], axis=1)
print("df_t = ", df_t)
# check max of bin 5-20 for every consecutive group
df_t1 = df_t.groupby(df_t['Name'].ne(df_t['Name'].shift()).cumsum(),as_index=False).apply(lambda x:x['Age'][(x['Age'] >= 5) & (x['Age'] < 20)].agg(lambda y : y.idxmax()))
print("")
print("df_t1 = ", df_t1)
# checking for condition if value is np array
a = df_t1.tolist()
b=[]
c = np.array([2])
c = c.astype('int64')
for i in a:
if type(i)== type(c[0]):
b.append(i)
else:
continue
df_t1 = df_t.iloc[b]
print("")
print("output df_t1 = ", df_t1)
# concat the bin max and first value df
concatdf = pd.concat([temp_df, df_t1],axis=1)
print("")
print("concatdf = ", concatdf)
Thank you in advance :)
You can greatly simplify your code by doing a single groupby for almost all indicators excepts the cumulated count.
Then just mask your data according to your criterion and concatenate.
I believe this is doing what you want:
group = df['Name'].ne(df['Name'].shift()).cumsum()
df2 = (df
.groupby(group, as_index=False)
.agg(**{'Name': ('Name', 'first'),
'Age': ('Age', 'first'),
'min': ('Age', 'min'),
'max': ('Age', 'max'),
'cons': ('Age', 'count')
})
.assign(count=lambda d: d.groupby('Name')['cons'].transform('sum'))
)
out = pd.concat([df2, df2.where(df2['max'].between(5,20))], axis=1)
output:
Name Age min max cons count Name Age min max cons count
0 tom 10 5 10 2 7 tom 10.0 5.0 10.0 2.0 7.0
1 nick 15 15 15 1 2 nick 15.0 15.0 15.0 1.0 2.0
2 juli 14 14 14 1 5 juli 14.0 14.0 14.0 1.0 5.0
3 tom 20 10 20 3 7 tom 20.0 10.0 20.0 3.0 7.0
4 juli 17 17 17 1 5 juli 17.0 17.0 17.0 1.0 5.0
5 tom 30 30 30 1 7 NaN NaN NaN NaN NaN NaN
6 nick 19 19 19 1 2 nick 19.0 19.0 19.0 1.0 2.0
7 juli 24 24 29 2 5 NaN NaN NaN NaN NaN NaN
8 tom 0 0 0 1 7 NaN NaN NaN NaN NaN NaN
9 juli 76 76 76 1 5 NaN NaN NaN NaN NaN NaN
For more bins:
bins = [(5,20), (25,40)]
out = pd.concat([df2]+[df2.where(df2['max'].between(a,b)) for a,b in bins], axis=1)
output:
Name Age min max cons count Name Age min max cons count Name Age min max cons count
0 tom 10 5 10 2 7 tom 10.0 5.0 10.0 2.0 7.0 NaN NaN NaN NaN NaN NaN
1 nick 15 15 15 1 2 nick 15.0 15.0 15.0 1.0 2.0 NaN NaN NaN NaN NaN NaN
2 juli 14 14 14 1 5 juli 14.0 14.0 14.0 1.0 5.0 NaN NaN NaN NaN NaN NaN
3 tom 20 10 20 3 7 tom 20.0 10.0 20.0 3.0 7.0 NaN NaN NaN NaN NaN NaN
4 juli 17 17 17 1 5 juli 17.0 17.0 17.0 1.0 5.0 NaN NaN NaN NaN NaN NaN
5 tom 30 30 30 1 7 NaN NaN NaN NaN NaN NaN tom 30.0 30.0 30.0 1.0 7.0
6 nick 19 19 19 1 2 nick 19.0 19.0 19.0 1.0 2.0 NaN NaN NaN NaN NaN NaN
7 juli 24 24 29 2 5 NaN NaN NaN NaN NaN NaN juli 24.0 24.0 29.0 2.0 5.0
8 tom 0 0 0 1 7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 juli 76 76 76 1 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN
I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
I've the following dataframe:
df =
c f V E
0 M 5 32 22
1 M 7 45 40
2 R 7 42 36
3 R 9 41 38
4 R 3 28 24
And I want a result like this, in which the values of column 'f' are my new columns, and my new indexes are a combination of column 'c' and the rest of columns in the dataframe (the order of rows doesn't matter):
df_result =
3 5 7 9
V(M) NaN 32 45 NaN
E(M) NaN 22 40 NaN
V(R) 28 NaN 42 41
E(R) 24 NaN 36 38
Currently, my code is:
df_result = pd.concat([df.pivot('c','f',col).rename(index = {e: col + '(' + e + ')' for e in df.pivot('c','f',col).index}) for col in [e for e in df.columns if e not in ['c','f']]])
With that code I'm getting:
df_result =
f 3 5 7 9
c
E(M) NaN 22 40 NaN
E(R) 24 NaN 36 38
V(M) NaN 32 45 NaN
V(R) 28 NaN 42 41
I think it's a valid result, however, I don't know if there is a way to get exactly my desire result or, at least, a better way to get what I am already getting.
Thanks you very much in advance.
To get the table, this is .melt + .pivot_table
df_result = df.melt(['f', 'c']).pivot_table(index=['variable', 'c'], columns='f')
Then we can clean up the naming:
df_result = df_result.rename_axis([None, None], 1)
df_result.columns = [y for _,y in df_result.columns]
df_result.index = [f'{x}({y})' for x,y in df_result.index]
# Python 2.: ['{0}({1})'.format(*x) for x in df_result.index]
Output:
3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0
You might consider keeping the MultiIndex instead of flattening to new strings, as it can be simpler for certain aggregations.
Check with pivot_table
s=pd.pivot_table(df,index='c',columns='f',values=['V','E']).stack(level=0).sort_index(level=1)
s.index=s.index.map('{0[1]}({0[0]})'.format)
s
Out[95]:
f 3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0