Determining if a Pandas dataframe row has multiple specific values - python

I have a Pandas data frame represented by the one below:
A B C D
| 1 1 1 3 |
| 1 1 1 2 |
| 2 3 4 5 |
I need to iterate through this data frame, looking for rows where the values in columns A, B, & C match and if that's true check the values in column D for those rows and delete the row with the smaller value. So, in above example would look like this afterwards.
A B C D
| 1 1 1 3 |
| 2 3 4 5 |
I've written the following code, but something isn't right and it's causing an error. It also looks more complicated than it may need to be, so I am wondering if there is a better, more concise way to write this.
for col, row in df.iterrows():
... df1 = df.copy()
... df1.drop(col, inplace = True)
... for col1, row1 in df1.iterrows():
... if df[0].iloc[col] == df1[0].iloc[col1] & df[1].iloc[col] == df1[1].iloc[col1] &
df[2].iloc[col] == df1[2].iloc[col1] & df1[3].iloc[col1] > df[3].iloc[col]:
... df.drop(col, inplace = True)

Here is one solution:
df[~((df[['A', 'B', 'C']].duplicated(keep=False)) & (df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']))]
Explanation:
df[['A', 'B', 'C']].duplicated(keep=False)
returns a mask for rows with duplicated values of ['A', 'B', 'C'] columns
df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']
returns a mask for rows that have the minimum value for ['D'] column, for each group of ['A', 'B', 'C']
The combination of these masks, selects all these rows (duplicated ['A', 'B', 'C'] and minimum 'D' for the group. With ~ we select all other rows except from these ones.
Result for the provided input:
A B C D
0 1 1 1 3
2 2 3 4 5

You can groupby all the variables (using groupby(['A', 'B', 'C'])) which have to be equal and then exclude the row with minimum value of D (using func)) if there are multiple unique records to get the boolean indices for the rows which has to be retained
def func(x):
if len(x.unique()) != 1:
return x != x.min()
else:
return x == x
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: func(x))]
A B C D
0 1 1 1 3
2 2 3 4 5
If row having just the maximum group value in D has to be retained. Then you can use the below:
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: x == x.max())]

Related

Filter DataFrame for most matches

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:
Index
One
Two
Three
Four
1
a
b
d
c
2
b
b
d
d
3
a
b
d
4
c
b
c
d
5
a
b
c
g
6
a
b
c
7
a
s
c
f
8
a
f
c
9
a
b
10
a
b
t
d
11
a
b
g
...
...
...
...
...
100
a
b
c
d
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = ['a','b'].
Thanks for your help!
I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]
print(out)
Output (ignoring the input row 100):
One Two Three Four
Index
5 a b c g
6 a b c None
Here is my approach. Descriptions are commented below.
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
data = {'One': ['a', 'a', 'a', 'a'],
'Two': ['b', 'b', 'b', 'b'],
'Three': ['c', 'c', 'y', 'c'],
'Four': ['g', 'g', 'z', 'd']}
dataframe_ = pd.DataFrame(data)
#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64
#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])
#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64
#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)
output:
0 0.999343
1 0.999343
2 0.973916
3 1.000000
Filtering rows based on their cosine similarities:
df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0 0.999343
1 0.999343
2 NaN
3 1.000000
From here on you can easily find the rows with non-NaN values by their indexes.

Insert 2 Blank Rows In DF by Group

I basically want the solution from this question to be applied to 2 blank rows.
Insert Blank Row In Python Data frame when value in column changes?
I've messed around with the solution but don't understand the code enough to alter it correctly.
You can do:
num_empty_rows = 2
df = (df.groupby('Col1',as_index=False).apply(lambda g: g.append(
pd.DataFrame(data=[['']*len(df.columns)]*num_empty_rows,
columns=df.columns))).reset_index(drop=True).iloc[:-num_empty_rows])
As you can see, after each group df is appended by a dataframe to accommodate num_empty_rows and then at the end reset_index is performed. The last iloc[:-num_empty_rows] is optional i.e. to remove empty rows at the end.
Example input:
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'C'],
'Col2':['s','s','b','b','l'],
'Col3':['b','j','d','a','k'],
'Col4':['d','k','q','d','p']
})
Output:
Col1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4
5 B b a d
6
7
8 C l k p

Set values in a column based on the values of other columns as a group

I have a df that looks something like this:
name A B C D
1 bar 1 0 1 1
2 foo 0 0 0 1
3 cat 1 0-1 0
4 pet 0 0 0 1
5 ser 0 0-1 0
6 chet 0 0 0 1
I need to use loc method to add values in a new column ('E') based on the values of the other columns as a group for instance if values are [1,0,0,0] value in column E will be 1. I've tried this:
d = {'A': 1, 'B': 0, 'C': 0, 'D': 0}
A = pd.Series(data=d, index=['A', 'B', 'C', 'D'])
df.loc[df.iloc[:, 1:] == A, 'E'] = 1
It didn't work. I need to use loc method or other numpy based method since the dataset is huge. If it is possible to avoid creating a series to compare the row that would also be great, somehow extracting the values of columns A B C D and compare them as a group for each row.
You can compare values with A with test if match all rows in DataFrame.all:
df.loc[(df == A).all(axis=1), 'E'] = 1
For 0,1 column:
df['E'] = (df == A).all(axis=1).astype(int)
df['E'] = np.where(df == A).all(axis=1), 1, 0)

Is there any way add a columns with specific condition?

There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}
You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)

how to groupby and join multiple rows from multiple columns at a time?

I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n

Categories