Dataframe check if 2 columns contain same elements [duplicate] - python

This question already has answers here:
Pandas - check for string matches in different columns with column values being comma separated
(2 answers)
Closed 9 months ago.
I have a data frame and with 2 columns X & Y.
df = pd.DataFrame({
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'],
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e']
})
I want to create a new column('Match') in the dataframe such if the columns X & Y have the same elements, then True else False.
df = pd.DataFrame({
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'],
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e'],
'Match':['True','True','True','False','False','True']
})
Kindly help me with this

This works:
df['Match']=df['X'].apply(set)==df['Y'].apply(set)
Basically, what I'm doing here is to convert each data point from each column into a set, and then comparing them.
It should work independently of the kind of thata (numbers or strings for example).
Notice, however, it wont differenciate if there're replicates. For example, if you have 'a,c,c,b' vs 'a,c,b', that would yield True.

You can try split the column to list then sort and compare.
df['Match2'] = df['X'].str.split(',').apply(sorted) == df['Y'].str.split(',').apply(sorted)
Or you can convert list to set and compare depending on if you want duplicated
df['Match2'] = df['X'].str.split(',').apply(set) == df['Y'].str.split(',').apply(set)
print(df)
X Y Match Match2
0 a a True True
1 a,b,c a,c,b True True
2 a,d d,a True True
3 e,f e,g False False
4 a,c,d,f a,d,f,g False False
5 e e True True
To avoid repeating, you can do
df['Match'] = df[['X', 'Y']].apply(lambda col: col.str.split(',').apply(sorted)).eval('X == Y')

Lots of ways to do this, one way would be to explode your arrays, sort them and match for equality.
import numpy as np
df1 = df.stack()\
.str.split(',')\
.explode()\
.sort_values()\
.groupby(level=[0,1])\
.agg(list).unstack(1)
df['match'] = np.where(df1['X'].eq(df1['Y']),True,False)
X Y match
0 a a True
1 a,b,c a,c,b True
2 a,d d,a True
3 e,f e,g False
4 a,c,d,f a,d,f,g False
5 e e True

Related

Is there pandas aggregate function that combines features of 'any' and 'unique'?

I have a large dataset with similar data:
>>> df = pd.DataFrame(
... {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
... 'B': ['a', 'b', 'c', 'a', 'a', np.nan]})
>>> df
A B
0 one a
1 two b
2 two c
3 one a
4 one a
5 three NaN
There are two aggregation functions 'any' and 'unique':
>>> df.groupby('A')['B'].any()
A
one True
three False
two True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one [a]
three [nan]
two [b, c]
Name: B, dtype: object
but I want to get the folowing result (or something close to it):
A
one a
three False
two True
I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.
You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:
df1 = df.groupby('A').agg(count=('B','nunique'),
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A
one 1 [a]
three 0 []
two 2 [b, c]
Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:
out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one a
three False
two True
Name: count, dtype: object
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one a
three False
two True
dtype: object
get a hold on the group aggreagator
count number of uniques
if > 1, i.e., more than 1 uniques, put True
if == 1, i.e., only 1 unique, put that unique value
else, i.e., no uniques (full NaNs), put False
You can combine groupby with agg and use boolean mask to choose the correct output:
# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])
Output:
>>> out
A
one a
three False
two True
>>> agg
any unique
A
one True [a]
three False [nan]
two True [b, c]
>>> m
A
one True # choose 'unique' column
three False # choose 'any' column
two False # choose 'any' column
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']
this will give you:
A B
0 one True
1 three False
2 two True
now if we want to find the values we can do:
df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)
which gives:
A
one a
three NaN
two b
The expression
series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))
will give the next result:
one a
three Nan
two [b, c]
where simple value can be identified by the type:
series[series.apply(type) == str]
Think it is easy enough to use often, but probably it is not the optimal solution.

function any is not consistent when applied on columns or the whole dataframe in python

I have a dataframe that might contain NaN values.
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
Output:
0 False
1 False
2 False
3 True
4 False
dtype: bool
When I run:
any(df.isna())
It returns:
True
If there are no NaNs:
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
#df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
0 False
1 False
2 False
3 False
4 False
dtype: bool
However when I run:
any(df.isna())
It returns:
True
Why this is the case? Do I have any misunderstanding of the function any()?
Why this is the case? Do I have any misunderstanding of the function any()?
When you loop over a DataFrame you are actually iterating over its column labels, not its rows or values as you might think. More precisely, the for loop calls Dataframe.__iter__ which returns an iterator over the column labels of the DataFrame.
For instance, in the following
df = pd.DataFrame(columns=['a', 'b', 'c'])
for x in df:
print(x)
# Output:
#
# a
# b
# c
x holds the name of each df column. You can also see what is the output of list(df).
This means that when you do any(df.isna()), under the hood any is actually iterating over the column labels of df and checking their truthiness. If at least one is truthy it returns True.
In both of your examples the column labels are numbers list(df.isna()) = list(df.columns) = [0, 1, 2, 3], from which only 0 is a Falsy value. Therefore, in both cases any(df.isna()) = True.
Solution
The solution is to use DataFrame.any with axis=None instead of using the built-in any function.
df.isna().any(axis=None)

replace empty list with NaN in pandas dataframe

I'm trying to replace some empty list in my data with a NaN values. But how to represent an empty list in the expression?
import numpy as np
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
d.loc[d['x'] == [],['x']] = d.loc[d['x'] == [],'x'].apply(lambda x: np.nan)
d
ValueError: Arrays were different lengths: 4 vs 0
And, I want to select [text] by using d[d['x'] == ["text"]] with a ValueError: Arrays were different lengths: 4 vs 1 error, but select 3 by using d[d['y'] == 3] is correct. Why?
If you wish to replace empty lists in the column x with numpy nan's, you can do the following:
d.x = d.x.apply(lambda y: np.nan if len(y)==0 else y)
If you want to subset the dataframe on rows equal to ['text'], try the following:
d[[y==['text'] for y in d.x]]
I hope this helps.
You can use function "apply" to match the specified cell value no matter it is the instance of string, list and so on.
For example, in your case:
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
if you use d == 3 to select the cell whose value is 3, it's totally ok:
x y
0 False False
1 False False
2 False True
3 False False
However, if you use the equal sign to match a list, there may be out of your exception, like d == [text] or d == ['text'] or d == '[text]', such as the following:
There's some solutions:
Use function apply() on the specified Series in your Dataframe just like the answer on the top:
A more general method with the function applymap() on a Dataframe may be used for the preprocessing step:
d.applymap(lambda x: x == [])
x y
0 False False
1 False False
2 False False
3 True False
Wish it can help you and the following learners and it would be better if you add a type check in you applymap function which would otherwise cause some exceptions probably.
To answer your main question, just leave out the empty lists altogether. The NaN's will automatically get populated in if there's a value in one column and not the other if you use pandas.concat instead of building a dataframe from a dictionary.
>>> import pandas as pd
>>> ser1 = pd.Series([[1,2,3], [1,2], ["text"]], name='x')
>>> ser2 = pd.Series([1,2,3,4], name='y')
>>> result = pd.concat([ser1, ser2], axis=1)
>>> result
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 NaN 4
About your second question, it seems that you can't search inside of an element. Perhaps you should make that a separate question since it's not really related to your main question.

Check for existence of multiple columns

Is there a more sophisticated way to check if a dataframe df contains 2 columns named Column 1 and Column 2:
if numpy.all(map(lambda c: c in df.columns, ['Column 1', 'Columns 2'])):
do_something()
You can use Index.isin:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If need check at least one value use any
cols = ['A', 'B']
print (df.columns.isin(cols).any())
True
cols = ['W', 'B']
print (df.columns.isin(cols).any())
True
cols = ['W', 'Z']
print (df.columns.isin(cols).any())
False
If need check all values:
cols = ['A', 'B', 'C','D','E','F']
print (df.columns.isin(cols).all())
True
cols = ['W', 'Z']
print (df.columns.isin(cols).all())
False
I know it's an old post...
From this answer:
if set(['Column 1', 'Column 2']).issubset(df.columns):
do_something()
or little more elegant:
if {'Column 1', 'Column 2'}.issubset(df.columns):
do_something()
The one issue with the given answer (and maybe it works for the OP) is that it tests to see if all of the dataframe's columns are in a given list - but not that all of the given list's items are in the dataframe columns.
My solution was:
test = all([ i in df.columns for i in ['A', 'B'] ])
Where test is a simple True or False
Also to check the existence of a list items in a dataframe columns, and still using isin, you can do the following:
col_list = ['A', 'B']
pd.index(col_list).isin(df.columns).all()
As explained in the accepted answer, .all() is to check if all items in col_list are present in the columns, while .any() is to test the presence
of any of them.

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Categories