Filter DataFrame for most matches - python

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:
Index
One
Two
Three
Four
1
a
b
d
c
2
b
b
d
d
3
a
b
d
4
c
b
c
d
5
a
b
c
g
6
a
b
c
7
a
s
c
f
8
a
f
c
9
a
b
10
a
b
t
d
11
a
b
g
...
...
...
...
...
100
a
b
c
d
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = ['a','b'].
Thanks for your help!

I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]
print(out)
Output (ignoring the input row 100):
One Two Three Four
Index
5 a b c g
6 a b c None

Here is my approach. Descriptions are commented below.
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
data = {'One': ['a', 'a', 'a', 'a'],
'Two': ['b', 'b', 'b', 'b'],
'Three': ['c', 'c', 'y', 'c'],
'Four': ['g', 'g', 'z', 'd']}
dataframe_ = pd.DataFrame(data)
#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64
#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])
#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64
#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)
output:
0 0.999343
1 0.999343
2 0.973916
3 1.000000
Filtering rows based on their cosine similarities:
df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0 0.999343
1 0.999343
2 NaN
3 1.000000
From here on you can easily find the rows with non-NaN values by their indexes.

Related

How to compare a row of a dataframe against the whole table considering exact matches and column order importance

I have a pandas dataframe as follows:
col_1
col_2
col_3
col_4
col_5
col_6
a
b
c
d
e
f
a
b
c
h
j
f
a
b
c
k
e
l
x
b
c
d
e
f
And I want to get a score for all the rows to see which of them are mos similar to the first one considering:
First, the number of columns with same values between the first row and the row we are considering
If two rows have the same number of equal value compared to first row, consider digit importance, that is, go from left to right in column order and give more importance to those rows whose matching columns are the most left-ones.
In the example above, the score should be something in the following order:
4th row (the last one), as it has 4 columns values in common with row 1
3rd row, as it has 3 elements in common with row 1 and the non-matching columns are columns 4 and 6, while in row 2 these non matching columns are 4 and 5
2nd row, as it has the same number of coincidences than row 3 but the column 5 is matching in row 3 but not in row 2
I want to solve this using a lambda function that assigns the score to ecah row of the dataframe, given the row one as a constant.
You could use np.lexsort for this. This allows a nested sort based on the count of columns that match row 0, and the sum of the column index matches where leftmost matches are more valuable.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col_1': ['a', 'a', 'a', 'x'],
'col_2': ['b', 'b', 'b', 'b'],
'col_3': ['c', 'c', 'c', 'c'],
'col_4': ['d', 'h', 'k', 'd'],
'col_5': ['e', 'j', 'e', 'e'],
'col_6': ['f', 'f', 'l', 'f']})
df.loc[np.lexsort(((df.eq(df.iloc[0]) * df.columns.get_indexer(df.columns)[::-1]).sum(1).values,
df.eq(df.iloc[0]).sum(1).values))[::-1], 'rank'] = range(len(df))
print(df)
Output
col_1 col_2 col_3 col_4 col_5 col_6 rank
0 a b c d e f 0.0
1 a b c h j f 3.0
2 a b c k e l 2.0
3 x b c d e f 1.0
Set up the problem:
data = [list(s) for s in ["aaax", "bbbb", "cccc", "dhkd", "ejee", "fflf"]]
df = pd.DataFrame(data).T.set_axis([f"col_{n}" for n in range(1, 7)], axis=1)
Solution:
ranking = (df == df.loc[0]).sum(1).iloc[1:]
ranking[~(ranking.duplicated('first') | ranking.duplicated('last'))] *= 10
ranking.update((df.loc[ties] == df.loc[0]).mul(np.arange(6,0,-1)).sum(1))
ranking.argsort()[::-1]
Explanation:
We first calculate each row's similarity to the first row and rank them. Then we split ties and non-ties. The non-ties are multiplied by 10. The ties are recalculated but this time we weight them by a descending scale of weights, to give more weight to a column the further left it is. Then we sum the weights to get the score for each row and update our original ranking. We return the reverse argsort to show the desired order.
You don't need to use a lambda here, it will be slower.
Result:
3 2
2 1
1 0
The left is the row index, ranked in order.

how to spead a list and put into a new column based on number of rows

lets say I have a brand of list:
cat_list = ['a', 'b', 'ab']
and I want this list to repeteadly fill a new column called category as much as my rows have.
I want the first row have 'a', the second row have 'b', and the third row have 'ab', and the cycle repeats until the last rows like the example below:
type value category
a 25 a
a 25 b
a 25 ab
b 50 a
b 50 b
b 50 ab
What I have tried so far is:
cat_list = ['a', 'b', 'ab']
df['category'] = cat_list * len(df)
but I got this kind of error
Length of values does not match length of index
how should I fix my script in order to get the desired results?
thanks.
Use numpy.tile by repeat with integer division for number of repeats:
df = pd.DataFrame({'a':range(8)})
cat_list = ['a', 'b', 'ab']
df['category'] = np.tile(cat_list, (len(df.index) // len(cat_list)) + 1)[:len(df.index)]
print (df)
a category
0 0 a
1 1 b
2 2 ab
3 3 a
4 4 b
5 5 ab
6 6 a
7 7 b

Is there any way add a columns with specific condition?

There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}
You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)

Determining if a Pandas dataframe row has multiple specific values

I have a Pandas data frame represented by the one below:
A B C D
| 1 1 1 3 |
| 1 1 1 2 |
| 2 3 4 5 |
I need to iterate through this data frame, looking for rows where the values in columns A, B, & C match and if that's true check the values in column D for those rows and delete the row with the smaller value. So, in above example would look like this afterwards.
A B C D
| 1 1 1 3 |
| 2 3 4 5 |
I've written the following code, but something isn't right and it's causing an error. It also looks more complicated than it may need to be, so I am wondering if there is a better, more concise way to write this.
for col, row in df.iterrows():
... df1 = df.copy()
... df1.drop(col, inplace = True)
... for col1, row1 in df1.iterrows():
... if df[0].iloc[col] == df1[0].iloc[col1] & df[1].iloc[col] == df1[1].iloc[col1] &
df[2].iloc[col] == df1[2].iloc[col1] & df1[3].iloc[col1] > df[3].iloc[col]:
... df.drop(col, inplace = True)
Here is one solution:
df[~((df[['A', 'B', 'C']].duplicated(keep=False)) & (df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']))]
Explanation:
df[['A', 'B', 'C']].duplicated(keep=False)
returns a mask for rows with duplicated values of ['A', 'B', 'C'] columns
df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']
returns a mask for rows that have the minimum value for ['D'] column, for each group of ['A', 'B', 'C']
The combination of these masks, selects all these rows (duplicated ['A', 'B', 'C'] and minimum 'D' for the group. With ~ we select all other rows except from these ones.
Result for the provided input:
A B C D
0 1 1 1 3
2 2 3 4 5
You can groupby all the variables (using groupby(['A', 'B', 'C'])) which have to be equal and then exclude the row with minimum value of D (using func)) if there are multiple unique records to get the boolean indices for the rows which has to be retained
def func(x):
if len(x.unique()) != 1:
return x != x.min()
else:
return x == x
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: func(x))]
A B C D
0 1 1 1 3
2 2 3 4 5
If row having just the maximum group value in D has to be retained. Then you can use the below:
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: x == x.max())]

Python - Group multiple values from a column to create "Other" values

I have this dataset:
Field
A
A
A
B
C
C
C
D
C
C
C
A
This has been read into pandas through the following code:
data = read_csv('data.csv', header=None)
print(data.describe())
How can I transform the column to get the below result?
Field
A
A
A
Others
C
C
C
Others
C
C
C
A
I want to transform values B and D, since they have low frequency, to an aggregate value "Others".
Here is one way:
import pandas as pd
df = pd.DataFrame({'Field': ['A', 'A', 'A', 'B', 'C', 'C', 'C',
'D', 'C', 'C', 'C', 'C', 'A']})
n = 2
counts = df['Field'].value_counts()
others = set(counts[counts < n].index)
df['Field'] = df['Field'].replace(list(others), 'Others')
Result
Field
0 A
1 A
2 A
3 Others
4 C
5 C
6 C
7 Others
8 C
9 C
10 C
11 C
12 A
Explanation
First get the counts of each value in Field via value_counts.
Filter for values which occur less than n times. n is user-configurable.
Finally replace those values with 'Others'.

Categories