Related
I have a large dataset with similar data:
>>> df = pd.DataFrame(
... {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
... 'B': ['a', 'b', 'c', 'a', 'a', np.nan]})
>>> df
A B
0 one a
1 two b
2 two c
3 one a
4 one a
5 three NaN
There are two aggregation functions 'any' and 'unique':
>>> df.groupby('A')['B'].any()
A
one True
three False
two True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one [a]
three [nan]
two [b, c]
Name: B, dtype: object
but I want to get the folowing result (or something close to it):
A
one a
three False
two True
I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.
You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:
df1 = df.groupby('A').agg(count=('B','nunique'),
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A
one 1 [a]
three 0 []
two 2 [b, c]
Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:
out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one a
three False
two True
Name: count, dtype: object
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one a
three False
two True
dtype: object
get a hold on the group aggreagator
count number of uniques
if > 1, i.e., more than 1 uniques, put True
if == 1, i.e., only 1 unique, put that unique value
else, i.e., no uniques (full NaNs), put False
You can combine groupby with agg and use boolean mask to choose the correct output:
# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])
Output:
>>> out
A
one a
three False
two True
>>> agg
any unique
A
one True [a]
three False [nan]
two True [b, c]
>>> m
A
one True # choose 'unique' column
three False # choose 'any' column
two False # choose 'any' column
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']
this will give you:
A B
0 one True
1 three False
2 two True
now if we want to find the values we can do:
df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)
which gives:
A
one a
three NaN
two b
The expression
series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))
will give the next result:
one a
three Nan
two [b, c]
where simple value can be identified by the type:
series[series.apply(type) == str]
Think it is easy enough to use often, but probably it is not the optimal solution.
This question already has answers here:
Pandas - check for string matches in different columns with column values being comma separated
(2 answers)
Closed 9 months ago.
I have a data frame and with 2 columns X & Y.
df = pd.DataFrame({
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'],
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e']
})
I want to create a new column('Match') in the dataframe such if the columns X & Y have the same elements, then True else False.
df = pd.DataFrame({
'X': ['a', 'a,b,c', 'a,d', 'e,f', 'a,c,d,f', 'e'],
'Y': ['a', 'a,c,b', 'd,a', 'e,g', 'a,d,f,g', 'e'],
'Match':['True','True','True','False','False','True']
})
Kindly help me with this
This works:
df['Match']=df['X'].apply(set)==df['Y'].apply(set)
Basically, what I'm doing here is to convert each data point from each column into a set, and then comparing them.
It should work independently of the kind of thata (numbers or strings for example).
Notice, however, it wont differenciate if there're replicates. For example, if you have 'a,c,c,b' vs 'a,c,b', that would yield True.
You can try split the column to list then sort and compare.
df['Match2'] = df['X'].str.split(',').apply(sorted) == df['Y'].str.split(',').apply(sorted)
Or you can convert list to set and compare depending on if you want duplicated
df['Match2'] = df['X'].str.split(',').apply(set) == df['Y'].str.split(',').apply(set)
print(df)
X Y Match Match2
0 a a True True
1 a,b,c a,c,b True True
2 a,d d,a True True
3 e,f e,g False False
4 a,c,d,f a,d,f,g False False
5 e e True True
To avoid repeating, you can do
df['Match'] = df[['X', 'Y']].apply(lambda col: col.str.split(',').apply(sorted)).eval('X == Y')
Lots of ways to do this, one way would be to explode your arrays, sort them and match for equality.
import numpy as np
df1 = df.stack()\
.str.split(',')\
.explode()\
.sort_values()\
.groupby(level=[0,1])\
.agg(list).unstack(1)
df['match'] = np.where(df1['X'].eq(df1['Y']),True,False)
X Y match
0 a a True
1 a,b,c a,c,b True
2 a,d d,a True
3 e,f e,g False
4 a,c,d,f a,d,f,g False
5 e e True
I have a series where I would like to check if any string exists in a list. For example:
Series A:
A, B
A, B, C
E, D, F
List = ['A', 'B']
I would like to return is any element of List is in Series A, something like:
True
True
False
Thanks!
Assuming your series consists of strings, you can use set.intersection (&):
L = ['A', 'B']
s = pd.Series(['A, B', 'A, B, C', 'E, D, F'])
res = s.str.split(', ').map(set) & set(L)
print(res)
0 True
1 True
2 False
dtype: bool
Can use np.isin
s.agg(lambda k: np.isin(k, List).any())
0 True
1 True
2 False
dtype: bool
This filter should give you a T/F vector: A.apply(lambda _: _ in List)
Also, you mention Series but that looks like you're referring to a DataFrame.
Using issubset
s.str.split(', ').apply(lambda x : {'A','B'}.issubset(tuple(x)))
Out[615]:
0 True
1 True
2 False
Name: c, dtype: bool
you can use str.contains and join the list with | to look for any of the element of the list such as:
print (s.str.contains('|'.join(L))) #with #jpp definition of variable s and L
0 True
1 True
2 False
dtype: bool
I have an empty dataframe.
df=pd.DataFrame(columns=['a'])
for some reason I want to generate df2, another empty dataframe, with two columns 'a' and 'b'.
If I do
df.columns=df.columns+'b'
it does not work (I get the columns renamed to 'ab')
and neither does the following
df.columns=df.columns.tolist()+['b']
How to add a separate column 'b' to df, and df.emtpy keep on being True?
Using .loc is also not possible
df.loc[:,'b']=None
as it returns
Cannot set dataframe with no defined index and a scalar
Here are few ways to add an empty column to an empty dataframe:
df=pd.DataFrame(columns=['a'])
df['b'] = None
df = df.assign(c=None)
df = df.assign(d=df['a'])
df['e'] = pd.Series(index=df.index)
df = pd.concat([df,pd.DataFrame(columns=list('f'))])
print(df)
Output:
Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []
I hope it helps.
If you just do df['b'] = None then df.empty is still True and df is:
Empty DataFrame
Columns: [a, b]
Index: []
EDIT:
To create an empty df2 from the columns of df and adding new columns, you can do:
df2 = pd.DataFrame(columns = df.columns.tolist() + ['b', 'c', 'd'])
If you want to add multiple columns at the same time you can also reindex.
new_cols = ['c', 'd', 'e', 'f', 'g']
df2 = df.reindex(df.columns.union(new_cols), axis=1)
#Empty DataFrame
#Columns: [a, c, d, e, f, g]
#Index: []
This is one way:
df2 = df.join(pd.DataFrame(columns=['b']))
The advantage of this method is you can add an arbitrary number of columns without explicit loops.
In addition, this satisfies your requirement of df.empty evaluating to True if no data exists.
You can use concat:
df=pd.DataFrame(columns=['a'])
df
Out[568]:
Empty DataFrame
Columns: [a]
Index: []
df2=pd.DataFrame(columns=['b', 'c', 'd'])
pd.concat([df,df2])
Out[571]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
Is there a more sophisticated way to check if a dataframe df contains 2 columns named Column 1 and Column 2:
if numpy.all(map(lambda c: c in df.columns, ['Column 1', 'Columns 2'])):
do_something()
You can use Index.isin:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If need check at least one value use any
cols = ['A', 'B']
print (df.columns.isin(cols).any())
True
cols = ['W', 'B']
print (df.columns.isin(cols).any())
True
cols = ['W', 'Z']
print (df.columns.isin(cols).any())
False
If need check all values:
cols = ['A', 'B', 'C','D','E','F']
print (df.columns.isin(cols).all())
True
cols = ['W', 'Z']
print (df.columns.isin(cols).all())
False
I know it's an old post...
From this answer:
if set(['Column 1', 'Column 2']).issubset(df.columns):
do_something()
or little more elegant:
if {'Column 1', 'Column 2'}.issubset(df.columns):
do_something()
The one issue with the given answer (and maybe it works for the OP) is that it tests to see if all of the dataframe's columns are in a given list - but not that all of the given list's items are in the dataframe columns.
My solution was:
test = all([ i in df.columns for i in ['A', 'B'] ])
Where test is a simple True or False
Also to check the existence of a list items in a dataframe columns, and still using isin, you can do the following:
col_list = ['A', 'B']
pd.index(col_list).isin(df.columns).all()
As explained in the accepted answer, .all() is to check if all items in col_list are present in the columns, while .any() is to test the presence
of any of them.