Strings in pandas series in list - python

I have a series where I would like to check if any string exists in a list. For example:
Series A:
A, B
A, B, C
E, D, F
List = ['A', 'B']
I would like to return is any element of List is in Series A, something like:
True
True
False
Thanks!

Assuming your series consists of strings, you can use set.intersection (&):
L = ['A', 'B']
s = pd.Series(['A, B', 'A, B, C', 'E, D, F'])
res = s.str.split(', ').map(set) & set(L)
print(res)
0 True
1 True
2 False
dtype: bool

Can use np.isin
s.agg(lambda k: np.isin(k, List).any())
0 True
1 True
2 False
dtype: bool

This filter should give you a T/F vector: A.apply(lambda _: _ in List)
Also, you mention Series but that looks like you're referring to a DataFrame.

Using issubset
s.str.split(', ').apply(lambda x : {'A','B'}.issubset(tuple(x)))
Out[615]:
0 True
1 True
2 False
Name: c, dtype: bool

you can use str.contains and join the list with | to look for any of the element of the list such as:
print (s.str.contains('|'.join(L))) #with #jpp definition of variable s and L
0 True
1 True
2 False
dtype: bool

Related

Is there pandas aggregate function that combines features of 'any' and 'unique'?

I have a large dataset with similar data:
>>> df = pd.DataFrame(
... {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
... 'B': ['a', 'b', 'c', 'a', 'a', np.nan]})
>>> df
A B
0 one a
1 two b
2 two c
3 one a
4 one a
5 three NaN
There are two aggregation functions 'any' and 'unique':
>>> df.groupby('A')['B'].any()
A
one True
three False
two True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one [a]
three [nan]
two [b, c]
Name: B, dtype: object
but I want to get the folowing result (or something close to it):
A
one a
three False
two True
I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.
You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:
df1 = df.groupby('A').agg(count=('B','nunique'),
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A
one 1 [a]
three 0 []
two 2 [b, c]
Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:
out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one a
three False
two True
Name: count, dtype: object
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one a
three False
two True
dtype: object
get a hold on the group aggreagator
count number of uniques
if > 1, i.e., more than 1 uniques, put True
if == 1, i.e., only 1 unique, put that unique value
else, i.e., no uniques (full NaNs), put False
You can combine groupby with agg and use boolean mask to choose the correct output:
# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])
Output:
>>> out
A
one a
three False
two True
>>> agg
any unique
A
one True [a]
three False [nan]
two True [b, c]
>>> m
A
one True # choose 'unique' column
three False # choose 'any' column
two False # choose 'any' column
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']
this will give you:
A B
0 one True
1 three False
2 two True
now if we want to find the values we can do:
df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)
which gives:
A
one a
three NaN
two b
The expression
series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))
will give the next result:
one a
three Nan
two [b, c]
where simple value can be identified by the type:
series[series.apply(type) == str]
Think it is easy enough to use often, but probably it is not the optimal solution.

Pandas dataframe only select the columns that have all True

Given a dataframe df, I need to select the columns that have only True values
df =
A B C D E
True False True False True
Output should be
output = [A, C, E]
Try boolean indexing with all (for only True values):
df.columns[df.all()]
Output:
Index(['A', 'C', 'E'], dtype='object')
Try iterating through it and putting the keys in a list (you can easily modify this to result in a dict, though).
result = []
for i in df.keys():
if df[i].all():
result.append(i)

Unexpected behaviour from applying np.isin() on a pandas dataframe

While working on an answer to another question, I stumbled upon an unexpected behaviour:
Consider the following DataFrame:
df = pd.DataFrame({
'A':list('AAcdef'),
'B':[4,5,4,5,5,4],
'E':[5,3,6,9,2,4],
'F':list('BaaBbA')
})
print(df)
A B E F
0 A 4 5 B #<— row contains 'A' and 5
1 A 5 3 a #<— row contains 'A' and 5
2 c 4 6 a
3 d 5 9 B
4 e 5 2 b
5 f 4 4 A
If we try to find all columns that contain ['A', 5], we can use jezrael's answer:
cond = [['A'],[5]]
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
which (correctly) yields: [ True True False False False False]
If we however use:
cond = [['A'],[5]]
print( df.apply(lambda x: np.isin([cond],[x]).all(),axis=1) )
this yields:
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
Closer inspection of the second attempt reveals that:
np.isin(['A',5],df.loc[0]) "wrongly" yields array([ True, False]), likely due to numpy infering a dtype <U1, and consequently 5!='5'
np.isin(['A',5],['A',4,5,'B']) "correctly" yields array([ True, True]), which means we can (and should) use df.loc[0].values.tolist() in the .apply() method above
The question, simplified:
Why do I need to specify x.values.tolist() in one case, and can directly use x in the other?
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
print( df.apply(lambda x: np.isin([cond],x.values.tolist()).all(),axis=1 ) )
Edit:
Even worse is what happens if we search for [4,5]:
cond = [[4],[5]]
## this returns False for row 0
print( df.apply(lambda x: np.isin([cond],x.values.tolist() ).all() ,axis=1) )
## this returns True for row 0
print( df.apply(lambda x: np.isin([cond],x.values ).all() ,axis=1) )
I think in DataFrame are mixed numeric with integer solumns, so if loop by rows get Series with mixing types, so numpy coerce the to strings.
Possible solution is convert to array and then to string values in cond:
cond = [[4],[5]]
print(df.apply(lambda x: np.isin(np.array(cond).astype(str), x.values.tolist()).all(),axis=1))
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
Unfortunately for general solution (if possible only numeric columns) need convert both - cond and Series:
f = lambda x: np.isin(np.array(cond).astype(str), x.astype(str).tolist()).all()
print (df.apply(f, axis=1))
Or all data:
f = lambda x: np.isin(np.array(cond).astype(str), x.tolist()).all()
print (df.astype(str).apply(f, axis=1))
If use sets in pure python, it working nice:
print(df.apply(lambda x: set([4,5]).issubset(x),axis=1) )
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
print(df.apply(lambda x: set(['A',5]).issubset(x),axis=1) )
0 True
1 True
2 False
3 False
4 False
5 False
dtype: bool
Because
df.isin applies to pd.Series and np.isin does not.
pd.loc returns a pd.Series.
To transform pd.Series to array-like, your x.values.tolist() should work.

Find the index of a string value in a pandas DataFrame

How can I identify which column(s) in my DataFrame contain a specific string 'foo'?
Sample DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[10,20,42], 'B':['foo','bar','blah'],'C':[3,4,5], 'D':['some','foo','thing']})
I want to find B and D here.
I can search for numbers:
If I'm looking for a number (e.g. 42) instead of a string, I can generate a boolean mask like this:
>>> ~(df.where(df==42)).isnull().all()
A True
B False
C False
D False
dtype: bool
but not strings:
>>> ~(df.where(df=='foo')).isnull().all()
TypeError: Could not compare ['foo'] with block values
I don't want to iterate over each column and row if possible (my actual data is much larger than this example). It feels like there should be a simple and efficient way.
How can I do this?
One way with underlying array data -
df.columns[(df.values=='foo').any(0)].tolist()
Sample run -
In [209]: df
Out[209]:
A B C D
0 10 foo 3 some
1 20 bar 4 foo
2 42 blah 5 thing
In [210]: df.columns[(df.values=='foo').any(0)].tolist()
Out[210]: ['B', 'D']
If you are looking for just the column-mask -
In [205]: (df.values=='foo').any(0)
Out[205]: array([False, True, False, True], dtype=bool)
Option 1 df.values
~(df.where(df.values=='foo')).isnull().all()
Out[91]:
A False
B True
C False
D True
dtype: bool
Option 2 isin
~(df.where(df.isin(['foo']))).isnull().all()
Out[94]:
A False
B True
C False
D True
dtype: bool
Unfortunately, it won't index a str through the syntax you gave. It has to be run as a series of type string to compare it with string, unless I am missing something.
try this
~df101.where(df101.isin(['foo'])).isnull().all()
A False
B True
C False
D True
dtype: bool

Element wise comparison between two dataframes (one of them containing lists)

I have two two Dataframes and one is made up of list.
in [00]: table01
out[00]:
a b
0 1 2
1 2 3
in [01]: table02
out[01]:
a b
0 [2] [3]
1 [1,2] [1,2]
And now I want to compare two tables. If the element in table01 also in the same position list of table02, return True otherwise return False. So the table I want to have is:
a b
0 False False
1 True False
I have tried table01 in table02 but get a error message: 'DataFrame' objects are mutable, thus they cannot be hashed.
Please share the correct solution of this problem with me. Thanks a lot!
Using sets and df.applymap:
df3 = df1.applymap(lambda x: {x})
df4 = df2.applymap(set)
df3 & df4
a b
0 {} {}
1 {2} {}
(df3 & df4).astype(bool)
a b
0 False False
1 True False
user3847943's solution is a good alternative, but could be improved using a set membership test.
def find_in_array(a, b):
return a in b
for c in df2.columns:
df2[c] = df2[c].map(set)
vfunc = np.vectorize(find_in_array)
df = pd.DataFrame(vfunc(df1, df2), index=df1.index, columns=df1.columns)
df
a b
0 False False
1 True False
You can easily do this by using numpy.vectorize. Sample code as below.
import numpy as np
import pandas as pd
t1 = pd.DataFrame([[1, 2],[2,3]])
t2 = pd.DataFrame([[[2],[3]],[[1,2],[1,2]]])
def find_in_array(a, b):
return a in b
vfunc = np.vectorize(find_in_array)
print(vfunc(t1, t2))
Try this
df=pd.melt(df1.reset_index(),'index')
df['v2']=pd.melt(df2.reset_index(),'index').value
pd.melt(df2.reset_index(),'index')
df['BOOL']=df.apply(lambda x: True if x.value in x.v2 else False, axis = 1)
df.pivot('index','variable','BOOL')
Out[491]:
variable a b
index
0 False False
1 True False
Finally :
df1.apply(lambda x: [(x==df2.loc[y,x.name])[y] for y in x.index])
Out[668]:
a b
0 False False
1 True False

Categories