I have a dataframe that has characters in it - I want a boolean result by row that tells me if all columns for that row have the same value.
For example, I have
df = [ a b c d
0 'C' 'C' 'C' 'C'
1 'C' 'C' 'A' 'A'
2 'A' 'A' 'A' 'A' ]
and I want the result to be
0 True
1 False
2 True
I've tried .all but it seems I can only check if all are equal to one letter. The only other way I can think of doing it is by doing a unique on each row and see if that equals 1? Thanks in advance.
I think the cleanest way is to check all columns against the first column using eq:
In [11]: df
Out[11]:
a b c d
0 C C C C
1 C C A A
2 A A A A
In [12]: df.iloc[:, 0]
Out[12]:
0 C
1 C
2 A
Name: a, dtype: object
In [13]: df.eq(df.iloc[:, 0], axis=0)
Out[13]:
a b c d
0 True True True True
1 True True False False
2 True True True True
Now you can use all (if they are all equal to the first item, they are all equal):
In [14]: df.eq(df.iloc[:, 0], axis=0).all(1)
Out[14]:
0 True
1 False
2 True
dtype: bool
Compare array by first column and check if all Trues per row:
Same solution in numpy for better performance:
a = df.values
b = (a == a[:, [0]]).all(axis=1)
print (b)
[ True True False]
And if need Series:
s = pd.Series(b, axis=df.index)
Comparing solutions:
data = [[10,10,10],[12,12,12],[10,12,10]]
df = pd.DataFrame(data,columns=['Col1','Col2','Col3'])
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#jez - numpy array
In [14]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
141 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#jez - Series
In [15]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
...: pd.Series(b, index=df.index)
169 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Andy Hayden
In [16]: %%timeit
...: df.eq(df.iloc[:, 0], axis=0).all(axis=1)
2.22 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Wen1
In [17]: %%timeit
...: list(map(lambda x : len(set(x))==1,df.values))
56.8 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#K.-Michael Aye
In [18]: %%timeit
...: df.apply(lambda x: len(set(x)) == 1, axis=1)
686 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Wen2
In [19]: %%timeit
...: df.nunique(1).eq(1)
2.87 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one)
df.nunique(axis = 1).eq(1)
Out[308]:
0 True
1 False
2 True
dtype: bool
Or you can using map with set
list(map(lambda x : len(set(x))==1,df.values))
df = pd.DataFrame.from_dict({'a':'C C A'.split(),
'b':'C C A'.split(),
'c':'C A A'.split(),
'd':'C A A'.split()})
df.apply(lambda x: len(set(x)) == 1, axis=1)
0 True
1 False
2 True
dtype: bool
Explanation: set(x) has only 1 element, if all elements of the row are the same. The axis=1 option applies any given function over the rows instead.
You can use nunique(axis=1) so the results (added to a new column) can be obtained by:
df['unique'] = df.nunique(axis=1) == 1
The answer by #yo-and-ben-w uses eq(1) but I think == 1 is easier to read.
Related
I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True
I've got a dataframe:
a-line abstract ... zippered
0 0 ... 0
0 1 ... 0
0 0 ... 1
Where the value of the cell is 1 I need to replace it with the header name.
df.dtypes returns Length: 1000, dtype: object
I have tried df.apply(lambda x: x.astype(object).replace(1, x.name))
but get TypeError: Invalid "to_replace" type: 'int'
other attempts:
df.apply(lambda x: x.astype(object).replace(str(1), x.name)) == TypeError: Invalid "to_replace" type: 'str'
df.apply(lambda x: x.astype(str).replace(str(1), x.name)) == Invalid "to_replace" type: 'str'
The key idea to all three solutions below is to loop through columns. The first method is with replace.
for col in df:
df[col]=df[col].replace(1, df[col].name)
Alternatively, per your attempt to apply a lambda:
for col in df_new:
df_new[col]=df_new[col].astype(str).apply(lambda x: x.replace('1',df_new[col].name))
Finally, this is with np.where:
for col in df_new:
df_new[col]=np.where(df_new[col] == 1, df_new[col].name, df_new[col])
Output for all three:
a-line abstract ... zippered
0 0 0 ... 0
1 0 abstract ... 0
2 0 0 ... zippered
You might consider to play from this idea
import pandas as pd
df = pd.DataFrame([[0,0,0],
[0,1,0],
[0,0,1],
[0,1,0]],
columns=["a","b","c"])
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
UPDATE: Timing
#David Erickson solution it's perfect but you can avoid the loop. In particular if you have many columns.
Generate data
import pandas as pd
import numpy as np
n = 1_000
columns = ["{:04d}".format(i) for i in range(n)]
df = pd.DataFrame(np.random.randint(0, high=2, size=(4,n)),
columns=columns)
# we test against the same dataframe
df_bk = df.copy()
David's solution #1
%%timeit -n10
for col in df:
df[col]=df[col].replace(1, df[col].name)
1.01 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #2
%%timeit -n10
df = df_bk.copy()
for col in df:
df[col]=df[col].astype(str).apply(lambda x: x.replace('1',df[col].name))
890 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #3
%%timeit -n10
for col in df:
df[col]=np.where(df[col] == 1, df[col].name, df[col])
886 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Avoiding loops
%%timeit -n10
df = df_bk.copy()
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
455 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have 2 columns in pandas DF:
col_A col_B
0 1
0 0
0 1
0 1
1 0
1 0
1 1
I want to create a new columns for each value of the combination of col_A and col_B similar to get_dummies(), but the only change is here I am trying to use a combination of columns
Example OP - In this column the value of Col_A is 0 and col_B is 1:
col_A_0_col_B_1
1
0
1
1
0
0
0
I am currently using the iterrows() to iterate through every row to check the value and then change
Is there a usual pandas shorter approach to achieve this.
Convert chained boolean masks to integers:
df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
For better performance:
df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
Performance: Depends of number of rows and 0, 1 values:
np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)
In [92]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
...:
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [93]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
...:
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [94]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
...:
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: %%timeit
...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
...:
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [96]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
...:
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [97]: %%timeit
...: df['col_A_0_col_B_1'] = 0
...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
...:
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use np.where
df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
First create your column and assign is e.g. 0 for False
df['col_A_0_col_B_1'] = 0
Then using loc you can filter by where col_A == 0 and col_B ==1 and then assign 1 to the new column
df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
If I understood correctly, you could do something like this:
import pandas as pd
data = [[0, 1],
[0, 0],
[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 1]]
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)
Output
col_A col_B col_A_0_col_B_1
0 0 1 1
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 0
5 1 0 0
6 1 1 0
Or as alternative:
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)
You can use pandas ~ for boolean not, coupled with 1 and 0 being true and false.
df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']
I have a Pandas dataframe that looks like the following:
Column_X Column_Y A-Indicator
Val1 A True
Val1 B True
Val2 B False
Val2 B False
I want to create the "A-Indicator" column. This column is True for all rows with Column_X = 'Val1' if a single Val1-row has Column_Y = A. Since no rows with Column_X = 'Val2' has Column_Y = 'A', the A-indicator is false for all these rows. Is there an easy way to achieve this?
If performance is important, don use groupby:
df['A-Indicator'] = df['Column_X'].isin(df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique())
print (df)
Column_X Column_Y A-Indicator
0 Val1 A True
1 Val1 B True
2 Val2 B False
3 Val2 B False
Explanation:
First compare by eq (==):
print (df['Column_Y'].eq('A'))
0 True
1 False
2 False
3 False
Name: Column_Y, dtype: bool
Find all values of column Column_X:
print (df.loc[df['Column_Y'].eq('A'), 'Column_X'])
0 Val1
Name: Column_X, dtype: object
Get unique values for better performance:
print (df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique())
['Val1']
And last compare by isin:
print (df['Column_X'].isin(df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique()))
0 True
1 True
2 False
3 False
Name: Column_X, dtype: bool
Performance: It depends of number of rows and number of matched values:
np.random.seed(123)
N = 1000000
L = list('ABCDEFGHIJK')
df = pd.DataFrame({
'Column_X':np.random.randint(1000, size=N),
'Column_Y': np.random.choice(L, N),
})
print (df)
In [193]: %timeit df['A-Indicator'] = df['Column_X'].isin(df.loc[df['Column_Y'].eq('A'), 'Column_X'].unique())
92.1 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [194]: %timeit df['A-Indicator']=df.groupby('Column_X')['Column_Y'].transform(lambda x: x.isin(['A']).any())
724 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [195]: %timeit df['A-Indicator']=df.groupby('Column_X')['Column_Y'].transform(lambda x: 'A' in x.unique())
770 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Example:
import pandas as pd
arr = pd.Series(['a',['a','b'],'c'])
I would like to get the indices of where the series contains elements containing 'a'. So I would like to get back indices 0 and 1.
I've tried writing
arr.str.contains('a')
but this returns
0 True
1 NaN
2 False
dtype: object
while I'd like it to return
0 True
1 True
2 False
dtype: object
use Series.str.join() to concatenate lists/arrays in cells into a single string and then use .str.contains('a'):
In [78]: arr.str.join(sep='~').str.contains('a')
Out[78]:
0 True
1 True
2 False
dtype: bool
Use Series.apply and Python's in keyword which works on both lists and strings
arr.apply(lambda x: 'a' in x)
This will work fine if you don't have any NaN values in your Series, but if you do, you can use:
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
This is much faster than using Series.str.
Benchmarks:
%%timeit
arr.str.join(sep='~').str.contains('a')
Takes: 249 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x)
Takes: 70.1 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
Takes: 69 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)