Pandas - select rows with best values - python

I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.

is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4

use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.

Related

Find duplicates in dataframe by compound criteria?

I have a dataframe which has data like:
col1 col2 col3
1 3 bob
2 1 alice
3 3 bob
4 3 rose
And what I want to do is keep duplicate rows of col2 and discard duplicates with greater than 1 instance of col3's value. Or put another way, duplicates of col2 but only where col3's values are different. So in the above example, what I would end up with is:
col1 col2 col3
1 3 bob
4 3 rose
Alice wouldn't be in the output because obviously there's no second value of col2's '1' - it isn't duplicate. The second entry of Bob (3 3 bob) wouldn't be in the output because while col2's '3' is a duplicate, col3's 'bob' is already in the result set (1 3 bob). (I am aware of the keep= parameter to change the behaviour of keeping first or last, but ignoring it for simplicity.)
Any thoughts? Thank you.
Use a combination of .duplicated(), .drop_duplicates() and the loc accessor
df.loc[df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first').index,:]
col1 col2 col3
0 1 3 bob
3 4 3 rose
How it works
#Filter all duplicated in col2 using duplicated(False)
df[df['col2'].duplicated(False)]
#Drop duplicates in col3 but retaining first using .drop_duplicates(keep='first')
df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first')
#Extract index
df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first').index
#Finally filter using loc accessor
df.loc[index,all columns]
Try:
df.loc[df.drop_duplicates(['col2', 'col3'])
.duplicated(['col2'], keep=False).loc[lambda x: x].index]
Output:
col1 col2 col3
0 1 3 bob
3 4 3 rose
Details:
Inside df.loc find indexes using
first drop_duplicates to get rid of duplicate records of col2 and
col3
use duplicated with keep = False return True for all records with
duplicate 'col2'
lastly, use loc with lambda to boolean select only those True
indexes

Concatenate the column values with above rows when other columns are empty

I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.
Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6

Find where column matches more than one in another column

I need results to have a 1:1 cardinality so I need to test if a value in COL1 exists more than once in COL2
COL1 COL2
A 1
B 2
B 2
B 3
C 4
D 5
E 5
E 5
Using Python (preferrably Pandas unless a better way exists), I want to see all rows where a value in COL1 has more than one match in COL2? In the example above, I want to know when COL1=B has more than 1 match in COL2 (i.e the cardinality in COL1 = B matches/joins with COL2 = 2 & also 3?
If you just want the rows that violate this condition, use groupby and check with nunique:
df[df.groupby('COL1').COL2.transform('nunique') > 1]
Or, with groupby, nunique, and map:
df[df.COL1.map(df.groupby('COL1').COL2.nunique()) > 1]
COL1 COL2
1 B 2
2 B 2
3 B 3
If you want a mapping of COL1 value to COL2 values, you can use an additional groupby and apply:
df[df.groupby('COL1').COL2.transform('nunique') > 1].groupby('COL1').COL2.apply(set)
COL1
B {2, 3}
Name: COL2, dtype: object
And finally, if all you want is the "cardinality" for > 1 COL1 values, use
df.groupby('COL1').COL2.nunique().to_frame().query('COL2 > 1')
COL2
COL1
B 2

Sort a column based on the sorting of a column from another pandas data frame

I have a dataframe like this:
df1:
col1 col2
P 1
Q 3
M 2
I have another dataframe:
df2:
col1 col2
Q 1
M 3
P 9
I want to sort the col1 of df2 based on the order of col1 of df1. So the final dataframe will look like:
df3:
col1 col2
P 1
Q 3
M 9
How to do it using pandas or any other effective method ?
You could set col1 as index in df2 using set_index and index the dataframe using df1.col11 with .loc:
df2.set_index('col1').loc[df1.col1].reset_index()
col1 col2
0 P 9
1 Q 1
2 M 3
Or as #jpp suggests you can also use .reindex instead of .loc:
df2.set_index('col1').reindex(df1.col1).reset_index()

Set value in separate pandas column when mapping dictionary

I have a dictionary:
d = {"A":1, "B":2, "C":3}
I also have a pandas dataframe:
col1
A
G
E
B
C
I'd like to create a new column by mapping the dictionary onto col1. Simultaneously I'd like to set the values in another column to indicate whether the value in that row has been mapped. The desired output would look like this:
col1 col2 col3
A 1 1
G NaN 0
E NaN 0
B 2 1
C 3 1
I know that col2 can be created using df.col1.map(d), but how can I simultaneously create col3?
You can create both column in one function assign - first by map and second by isin for boolean mask with casting to integers:
df = df.assign(col2=df.col1.map(d), col3=df.col1.isin(d.keys()).astype(int))
print (df)
col1 col2 col3
0 A 1.0 1
1 G NaN 0
2 E NaN 0
3 B 2.0 1
4 C 3.0 1
Another 2 step solution with different boolean mask - by checking not missing values:
df['col2'] = df.col1.map(d)
df['col3'] = df['col2'].notnull().astype(int)

Categories