Find duplicates in dataframe by compound criteria?

Find duplicates in dataframe by compound criteria? - python

I have a dataframe which has data like:
col1 col2 col3
1 3 bob
2 1 alice
3 3 bob
4 3 rose
And what I want to do is keep duplicate rows of col2 and discard duplicates with greater than 1 instance of col3's value. Or put another way, duplicates of col2 but only where col3's values are different. So in the above example, what I would end up with is:
col1 col2 col3
1 3 bob
4 3 rose
Alice wouldn't be in the output because obviously there's no second value of col2's '1' - it isn't duplicate. The second entry of Bob (3 3 bob) wouldn't be in the output because while col2's '3' is a duplicate, col3's 'bob' is already in the result set (1 3 bob). (I am aware of the keep= parameter to change the behaviour of keeping first or last, but ignoring it for simplicity.)
Any thoughts? Thank you.

Use a combination of .duplicated(), .drop_duplicates() and the loc accessor
df.loc[df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first').index,:]
col1 col2 col3
0 1 3 bob
3 4 3 rose
How it works
#Filter all duplicated in col2 using duplicated(False)
df[df['col2'].duplicated(False)]
#Drop duplicates in col3 but retaining first using .drop_duplicates(keep='first')
df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first')
#Extract index
df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first').index
#Finally filter using loc accessor
df.loc[index,all columns]

Try:
df.loc[df.drop_duplicates(['col2', 'col3'])
.duplicated(['col2'], keep=False).loc[lambda x: x].index]
Output:
col1 col2 col3
0 1 3 bob
3 4 3 rose
Details:
Inside df.loc find indexes using
first drop_duplicates to get rid of duplicate records of col2 and
col3
use duplicated with keep = False return True for all records with
duplicate 'col2'
lastly, use loc with lambda to boolean select only those True
indexes

Related

Error when selecting rows in pandas dataframe based on column value

I have a dataframe df which looks like this:
col1 col2 col3
A 45 4
A 3 5
B 2 5
I want to make a separate dataframe, df2, which only has the rows where col in df equals A. Hence it should look like:
col1 col2 col3
A 45 4
A 3 5
I just use df2=df1.loc[df1['col1']=='A']. However this returns the error: ValueError: Cannot index with multidimensional key. Any idea what is going wrong here?

What you tried works for me, you can try this:
df2 = df[df.col1 == 'A']
Output
col1 col2 col3
0 A 45 4
1 A 3 5
Edit
Tested on pandas version
pd.__version__
'1.2.4'

Just try:
df.query("col1=='A'")

Concatenate the column values with above rows when other columns are empty

I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.

Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6

How to assign one row column value to another row column based on null value

I have a dataframe
Col1 Col2 Col3 Col4 Col5
A123 13500 2/03/19 0 NaN
B123 2000 3/04/19 0 Distinct
C123 500 8/09/19 1 Match
D123 100 11/01/19 1 NaN
E123 1350 2/03/19 2 NaN
F123 2000 3/04/19 2 Match
G123 500 8/09/19 3 Distinct
H123 100 11/01/19 3 NaN
I want to loop through the rows based on Col4 and update Col5(NaN) row accordingly.
That is, when I pick rows where Col4 is 0, I want to update the Col5 based on other row column value
Output:
Col1 Col2 Col3 Col4 Col5
A123 13500 2/03/19 0 **Distinct**
B123 2000 3/04/19 0 Distinct
C123 500 8/09/19 1 Match
D123 100 11/01/19 1 **Match**
E123 1350 2/03/19 2 **Match**
F123 2000 3/04/19 2 Match
G123 500 8/09/19 3 Distinct
H123 100 11/01/19 3 **Distinct**

I think what you are looking for is the function np.where. I am assuming that you want to assign the value 'Distinct' to Col5 when Col4 = 0 and 'Match' when Col4 = 1. Then you code will be:
df['Col5'] = np.where(df.Col4==0, 'Distinct', 'Match')
Of course, you can adapt the code for whatever conditional statements you need

From your logic, it appears that you wish to map values of 0,3 in Col4 to "Distinct" in Col5, and values of 1,2 to "Match". You only want to update NaN values in Col5.
Try:
df = pd.DataFrame({'Col4': [0,1,2,3,0,1,2,3],
'Col5': ["Distinct", "Match", "Match", "Distinct", np.nan, np.nan, np.nan, np.nan]})
mapper = {
0: "**Distinct**",
1: "**Match**",
2: "**Match**",
3: "**Distinct**"
}
df.loc[df.Col5.isna(), 'Col5'] = df[df.Col5.isna()]['Col4'].map(mapper)
You now get:
Col4 Col5
0 0 Distinct
1 1 Match
2 2 Match
3 3 Distinct
4 0 **Distinct**
5 1 **Match**
6 2 **Match**
7 3 **Distinct**
This makes it easy to later change your mapping if you change your mind about the logic or replacement values.

Okay, I am assuming two things here:
1) You only have two entries for each number in Col 4
2) Both the entries with the same number in Col4 are placed adjacent to each other (That does not matter actually, if it is not the case, you can always sort the dataframe by Col4 and you will have this case)
The code is as follows:
df = df.replace(np.nan,"None")
txt = "None"
for i in range(df.Col4.size):
if (df.loc[i,'Col5']=="None"):
df.loc[i,'Col5'] = txt
txt = "None"
else:
txt = df.loc[i,'Col5']
txt = "None"
for i in reversed(range(df.Col4.size)):
if (df.loc[i,'Col5']=="None"):
df.loc[i,'Col5'] = txt
txt = "None"
else:
txt = df.loc[i,'Col5']
I am doing 3 steps here.
1) Replace all nan with a string, so that I do not have any datatype comparison issue while using the if.
2) The loop in ascending order. If the value in Col5 is 'None', then it is replaced by the value in 'txt'. Else, 'txt' variable is stores with the value in Col5.
3) The same loop in reverse order.
I hope this solves your problem.

How to do group by and take Count of one column divide by count of unique of second column of data frame in python pandas?

I have panda data frame with 4 column say 'col1', 'col2', 'col3' and 'col4' now I want to group by col1 and col2 and want to take aggregate say below.
Count(col3)/(Count(unique col4)) As result_col
How do I do this? I am using MySql with pandas.
I have tried many things from the internet but not getting an exact solution, that's why I am posting here. Give reason of downvote so I can improve my question.

It seems you need aggregate by size and nunique and then div output columns:
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0

Pandas - select rows with best values

I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.

is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4

use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find duplicates in dataframe by compound criteria? - python

Related

Error when selecting rows in pandas dataframe based on column value

Concatenate the column values with above rows when other columns are empty

How to assign one row column value to another row column based on null value

How to do group by and take Count of one column divide by count of unique of second column of data frame in python pandas?

Pandas - select rows with best values

Categories

Resources