Pandas Dataframe Find Rows Where all Columns Equal

Pandas Dataframe Find Rows Where all Columns Equal - python

I have a dataframe that has characters in it - I want a boolean result by row that tells me if all columns for that row have the same value.
For example, I have
df = [ a b c d
0 'C' 'C' 'C' 'C'
1 'C' 'C' 'A' 'A'
2 'A' 'A' 'A' 'A' ]
and I want the result to be
0 True
1 False
2 True
I've tried .all but it seems I can only check if all are equal to one letter. The only other way I can think of doing it is by doing a unique on each row and see if that equals 1? Thanks in advance.

I think the cleanest way is to check all columns against the first column using eq:
In [11]: df
Out[11]:
a b c d
0 C C C C
1 C C A A
2 A A A A
In [12]: df.iloc[:, 0]
Out[12]:
0 C
1 C
2 A
Name: a, dtype: object
In [13]: df.eq(df.iloc[:, 0], axis=0)
Out[13]:
a b c d
0 True True True True
1 True True False False
2 True True True True
Now you can use all (if they are all equal to the first item, they are all equal):
In [14]: df.eq(df.iloc[:, 0], axis=0).all(1)
Out[14]:
0 True
1 False
2 True
dtype: bool

Compare array by first column and check if all Trues per row:
Same solution in numpy for better performance:
a = df.values
b = (a == a[:, [0]]).all(axis=1)
print (b)
[ True True False]
And if need Series:
s = pd.Series(b, axis=df.index)
Comparing solutions:
data = [[10,10,10],[12,12,12],[10,12,10]]
df = pd.DataFrame(data,columns=['Col1','Col2','Col3'])
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#jez - numpy array
In [14]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
141 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#jez - Series
In [15]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
...: pd.Series(b, index=df.index)
169 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Andy Hayden
In [16]: %%timeit
...: df.eq(df.iloc[:, 0], axis=0).all(axis=1)
2.22 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Wen1
In [17]: %%timeit
...: list(map(lambda x : len(set(x))==1,df.values))
56.8 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#K.-Michael Aye
In [18]: %%timeit
...: df.apply(lambda x: len(set(x)) == 1, axis=1)
686 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Wen2
In [19]: %%timeit
...: df.nunique(1).eq(1)
2.87 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one)
df.nunique(axis = 1).eq(1)
Out[308]:
0 True
1 False
2 True
dtype: bool
Or you can using map with set
list(map(lambda x : len(set(x))==1,df.values))

df = pd.DataFrame.from_dict({'a':'C C A'.split(),
'b':'C C A'.split(),
'c':'C A A'.split(),
'd':'C A A'.split()})
df.apply(lambda x: len(set(x)) == 1, axis=1)
0 True
1 False
2 True
dtype: bool
Explanation: set(x) has only 1 element, if all elements of the row are the same. The axis=1 option applies any given function over the rows instead.

You can use nunique(axis=1) so the results (added to a new column) can be obtained by:
df['unique'] = df.nunique(axis=1) == 1
The answer by #yo-and-ben-w uses eq(1) but I think == 1 is easier to read.

Related

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(

Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

Replacing cell values with column header value

I've got a dataframe:
a-line abstract ... zippered
0 0 ... 0
0 1 ... 0
0 0 ... 1
Where the value of the cell is 1 I need to replace it with the header name.
df.dtypes returns Length: 1000, dtype: object
I have tried df.apply(lambda x: x.astype(object).replace(1, x.name))
but get TypeError: Invalid "to_replace" type: 'int'
other attempts:
df.apply(lambda x: x.astype(object).replace(str(1), x.name)) == TypeError: Invalid "to_replace" type: 'str'
df.apply(lambda x: x.astype(str).replace(str(1), x.name)) == Invalid "to_replace" type: 'str'

The key idea to all three solutions below is to loop through columns. The first method is with replace.
for col in df:
df[col]=df[col].replace(1, df[col].name)
Alternatively, per your attempt to apply a lambda:
for col in df_new:
df_new[col]=df_new[col].astype(str).apply(lambda x: x.replace('1',df_new[col].name))
Finally, this is with np.where:
for col in df_new:
df_new[col]=np.where(df_new[col] == 1, df_new[col].name, df_new[col])
Output for all three:
a-line abstract ... zippered
0 0 0 ... 0
1 0 abstract ... 0
2 0 0 ... zippered

You might consider to play from this idea
import pandas as pd
df = pd.DataFrame([[0,0,0],
[0,1,0],
[0,0,1],
[0,1,0]],
columns=["a","b","c"])
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
UPDATE: Timing
#David Erickson solution it's perfect but you can avoid the loop. In particular if you have many columns.
Generate data
import pandas as pd
import numpy as np
n = 1_000
columns = ["{:04d}".format(i) for i in range(n)]
df = pd.DataFrame(np.random.randint(0, high=2, size=(4,n)),
columns=columns)
# we test against the same dataframe
df_bk = df.copy()
David's solution #1
%%timeit -n10
for col in df:
df[col]=df[col].replace(1, df[col].name)
1.01 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #2
%%timeit -n10
df = df_bk.copy()
for col in df:
df[col]=df[col].astype(str).apply(lambda x: x.replace('1',df[col].name))
890 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #3
%%timeit -n10
for col in df:
df[col]=np.where(df[col] == 1, df[col].name, df[col])
886 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Avoiding loops
%%timeit -n10
df = df_bk.copy()
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
455 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

feature crossing in pandas

I have 2 columns in pandas DF:
col_A col_B
0 1
0 0
0 1
0 1
1 0
1 0
1 1
I want to create a new columns for each value of the combination of col_A and col_B similar to get_dummies(), but the only change is here I am trying to use a combination of columns
Example OP - In this column the value of Col_A is 0 and col_B is 1:
col_A_0_col_B_1
1
0
1
1
0
0
0
I am currently using the iterrows() to iterate through every row to check the value and then change
Is there a usual pandas shorter approach to achieve this.

Convert chained boolean masks to integers:
df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
For better performance:
df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
Performance: Depends of number of rows and 0, 1 values:
np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)
In [92]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
...:
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [93]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
...:
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [94]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
...:
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: %%timeit
...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
...:
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [96]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
...:
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [97]: %%timeit
...: df['col_A_0_col_B_1'] = 0
...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
...:
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can use np.where
df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)

First create your column and assign is e.g. 0 for False
df['col_A_0_col_B_1'] = 0
Then using loc you can filter by where col_A == 0 and col_B ==1 and then assign 1 to the new column
df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1

If I understood correctly, you could do something like this:
import pandas as pd
data = [[0, 1],
[0, 0],
[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 1]]
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)
Output
col_A col_B col_A_0_col_B_1
0 0 1 1
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 0
5 1 0 0
6 1 1 0
Or as alternative:
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)

You can use pandas ~ for boolean not, coupled with 1 and 0 being true and false.
df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']

Selecting rows where condition given by column x is true for values in column y

I have a Pandas dataframe that looks like the following:
Column_X Column_Y A-Indicator
Val1 A True
Val1 B True
Val2 B False
Val2 B False
I want to create the "A-Indicator" column. This column is True for all rows with Column_X = 'Val1' if a single Val1-row has Column_Y = A. Since no rows with Column_X = 'Val2' has Column_Y = 'A', the A-indicator is false for all these rows. Is there an easy way to achieve this?

If performance is important, don use groupby:
df['A-Indicator'] = df['Column_X'].isin(df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique())
print (df)
Column_X Column_Y A-Indicator
0 Val1 A True
1 Val1 B True
2 Val2 B False
3 Val2 B False
Explanation:
First compare by eq (==):
print (df['Column_Y'].eq('A'))
0 True
1 False
2 False
3 False
Name: Column_Y, dtype: bool
Find all values of column Column_X:
print (df.loc[df['Column_Y'].eq('A'), 'Column_X'])
0 Val1
Name: Column_X, dtype: object
Get unique values for better performance:
print (df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique())
['Val1']
And last compare by isin:
print (df['Column_X'].isin(df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique()))
0 True
1 True
2 False
3 False
Name: Column_X, dtype: bool
Performance: It depends of number of rows and number of matched values:
np.random.seed(123)
N = 1000000
L = list('ABCDEFGHIJK')
df = pd.DataFrame({
'Column_X':np.random.randint(1000, size=N),
'Column_Y': np.random.choice(L, N),
})
print (df)
In [193]: %timeit df['A-Indicator'] = df['Column_X'].isin(df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique())
92.1 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [194]: %timeit df['A-Indicator']=df.groupby('Column_X')['Column_Y'].transform(lambda x: x.isin(['A']).any())
724 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [195]: %timeit df['A-Indicator']=df.groupby('Column_X')['Column_Y'].transform(lambda x: 'A' in x.unique())
770 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Find indices of where Pandas Series contains element containing character

Example:
import pandas as pd
arr = pd.Series(['a',['a','b'],'c'])
I would like to get the indices of where the series contains elements containing 'a'. So I would like to get back indices 0 and 1.
I've tried writing
arr.str.contains('a')
but this returns
0 True
1 NaN
2 False
dtype: object
while I'd like it to return
0 True
1 True
2 False
dtype: object

use Series.str.join() to concatenate lists/arrays in cells into a single string and then use .str.contains('a'):
In [78]: arr.str.join(sep='~').str.contains('a')
Out[78]:
0 True
1 True
2 False
dtype: bool

Use Series.apply and Python's in keyword which works on both lists and strings
arr.apply(lambda x: 'a' in x)
This will work fine if you don't have any NaN values in your Series, but if you do, you can use:
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
This is much faster than using Series.str.
Benchmarks:
%%timeit
arr.str.join(sep='~').str.contains('a')
Takes: 249 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x)
Takes: 70.1 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
Takes: 69 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe Find Rows Where all Columns Equal - python

nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one) df.nunique(axis = 1).eq(1) Out[308]: 0 True 1 False 2 True dtype: bool Or you can using map with set list(map(lambda x : len(set(x))==1,df.values))

You can use nunique(axis=1) so the results (added to a new column) can be obtained by: df['unique'] = df.nunique(axis=1) == 1 The answer by #yo-and-ben-w uses eq(1) but I think == 1 is easier to read.

Related

How to apply changes to subset dataframe to source dataframe

Replacing cell values with column header value

feature crossing in pandas

Selecting rows where condition given by column x is true for values in column y

Find indices of where Pandas Series contains element containing character

Categories

Resources