Search common items in 2 lists in pandas dataframe [duplicate] - python

This question already has answers here:
Find intersection of two nested lists?
(21 answers)
Closed 2 years ago.
I have 2 columns (input & target) in the pandas which includes the list. The purpose is to find how many common item in these 2 lists and save the result in the new column "common"
For the 1st row, only have 'ELITE' in common. For the 2nd row, both 'ACURAWATCH' & 'PLUS' exist in both list.
Input:
frame = pd.DataFrame({'input' : [['MINIVAN','TOURING','ELITE'], ['4D','SUV','ACURAWATCH','PLUS']], 'target' : [['MINI','TOUR','ELITE'], ['ACURAWATCH','PLUS']]})
Expect Output:
frame = pd.DataFrame({'input' : [['MINIVAN','TOURING','ELITE'], ['4D','SUV','ACURAWATCH','PLUS']], 'target' : [['MINI','TOUR','ELITE'], ['ACURAWATCH','PLUS']], 'common' :[1, 2]})

You can use set.intersection with df.apply:
In [4307]: frame["common"] = frame.apply(
lambda x: len(set(x["input"]).intersection(set(x["target"]))), 1)
In [4308]: frame
Out[4308]:
input target common
0 [MINIVAN, TOURING, ELITE] [MINI, TOUR, ELITE] 1
1 [4D, SUV, ACURAWATCH, PLUS] [ACURAWATCH, PLUS] 2

You can apply a custom function with np.intersect1d:
frame['common'] = frame.apply(lambda x: len(np.intersect1d(x['input'],x['target'])), axis=1)

You could use:
frame['common'] = [len(set(x) & set(y)) for x, y in frame.to_numpy()]
print(frame)
Output
input target common
0 [MINIVAN, TOURING, ELITE] [MINI, TOUR, ELITE] 1
1 [4D, SUV, ACURAWATCH, PLUS] [ACURAWATCH, PLUS] 2

Related

How to select rows from pandas dataframe by looking a feature' data types when a feature contains more than one type of value [duplicate]

This question already has answers here:
Select row from a DataFrame based on the type of the object(i.e. str)
(3 answers)
Closed 3 months ago.
I have a dataframe with 3 features: id, name and point. I need to select rows that type of 'point' value is string.
id
name
point
0
x
5
1
y
6
2
z
ten
3
t
nine
4
q
two
How can I split the dataframe just looking by type of one feature' value?
I tried to modify select_dtypes method but I lost. Also I tried to divide dataset with using
df[df[point].dtype == str] or df[df[point].dtype is str]
but didn't work.
Technically, the answer would be:
out = df[df['point'].apply(lambda x: isinstance(x, str))]
But this would also select rows containing a string representation of a number ('5').
If you want to select "strings" as opposed to "numbers" whether those are real numbers or string representations, you could use:
m = pd.to_numeric(df['point'], errors='coerce')
out = df[df['point'].notna() & m]
The question is now, what if you have '1A' or 'AB123' as value?

pandas: getting the name of the column corresponding to the highest value in the row [duplicate]

This question already has answers here:
Find the column name which has the maximum value for each row
(5 answers)
Closed 1 year ago.
I have a pandas DF as follows:
What I want to get is a 1 Column df that contains the name of the column of the maximum value of the row, or the order number of that column. (see red circles in the pic)
Any suggestion?
here the data for copy and paste:
matrix = np.array([[0.92234683, 0.94209485, 0.90884652, 0.99763808],
[0.86166401, 0.96755855, 0.9243107 , 0.94240756],
[0.85457367, 0.9169915 , 0.95042024, 0.90661279],
[0.83972504, 0.93902909, 0.91985442, 0.93765059],
[0.84373323, 0.87762977, 0.91005636, 0.88525626]])
thanks
Use idxmax:
df = pd.DataFrame(matrix, columns=['Y_clm1', 'Y_clm2', 'Y_clm3', 'Y_clm4'])
>>> df.idxmax(axis=1)
0 Y_clm4
1 Y_clm2
2 Y_clm3
3 Y_clm2
4 Y_clm3
dtype: object
use max
df = pd.DataFrame(matrix)
max(df) == 3
max(df) corresponds to Y_clm4

counting unique elements in lists

I have a dataframe containing one column of lists.
names unique_values
[B-PER,I-PER,I-PER,B-PER] 2
[I-PER,N-PER,B-PER,I-PER,A-PER] 4
[B-PER,A-PER,I-PER] 3
[B-PER, A-PER,A-PER,A-PER] 2
I have to count each distinct value in a column of lists and If value appears more than once count it as one. How can I achieve it
Thanks
Combine explode with nunique
df["unique_values"] = df.names.explode().groupby(level = 0).nunique()
You can use the inbulit set data type to do this -
df['unique_values'] = df['names'].apply(lambda a : len(set(a)))
This works as sets do not allow any duplicate elements in their construction so when you convert a list to a set it strips all duplicate elements and all you need to do is get the length of the resultant set.
to ignore NaN values in a list you can do the following -
df['unique_values'] = df['names'].apply(lambda a : len([x for x in set(a) if str(x) != 'nan']))
Try:
df["unique_values"] = df.names.explode().groupby(level = 0).unique().str.len()
Output
df
names unique_values
0 [B-PER, I-PER, I-PER, B-PER] 2
1 [I-PER, N-PER, B-PER, I-PER, A-PER] 4
2 [B-PER, A-PER, I-PER] 3
3 [B-PER, A-PER, A-PER, A-PER] 2

Changing pandas column values into another format [duplicate]

This question already has answers here:
How to convert string representation of list to a list
(19 answers)
Closed 3 years ago.
The labels column in my test['labels'] dataframe, looks like:
0 ['Edit Distance']
1 ['Island Perimeter']
2 ['Longest Substring with At Most K Distinct Ch...
3 ['Valid Parentheses']
4 ['Intersection of Two Arrays II']
5 ['N-Queens']
For each value in the column, which is a string representation of list ("['Edit Distance']"), I want to apply the function below to convert it into an actual list.
ast.literal_eval(VALUE HERE)
What is a straightforward way to do this?
Use:
import ast
test['labels'] = test['labels'].apply(ast.literal_eval)
print (test)
labels
0 [Edit Distance]
1 [Island Perimeter]
2 [Longest Substring with At Most K Distinct Ch]
3 [Valid Parentheses]
4 [Intersection of Two Arrays II]
5 [N-Queens]

Slicing a Data frame by checking consecutive elements [duplicate]

This question already has answers here:
Pandas: Drop consecutive duplicates
(8 answers)
Closed 4 years ago.
I have a DF indexed by time and one of its columns (with 2 variables) is like [x,x,y,y,x,x,x,y,y,y,y,x]. I want to slice this DF so Ill get this column without same consecutive variables- in this example :[x,y,x,y,x] and every variable was the first in his subsequence.
Still trying to figure it out...
Thanks!!
Assuming you have df like below
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
We using shift to find the next is equal to the current or not
df[df[0].shift()!=df[0]]
Out[142]:
0
0 x
2 y
4 x
7 y
11 x
You jsut try to loop through and safe the last element used
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
df2=pd.DataFrame()
old = df[0].iloc[0] # get the first element
for column in df:
df[column].iloc[0] != old:
df2.append(df[column].iloc[0])
old = df[column].iloc[0]
EDIT:
Or for a vector use a list
>>> L=[1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [x[0] for x in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]

Categories