feature crossing in pandas - python

I have 2 columns in pandas DF:
col_A col_B
0 1
0 0
0 1
0 1
1 0
1 0
1 1
I want to create a new columns for each value of the combination of col_A and col_B similar to get_dummies(), but the only change is here I am trying to use a combination of columns
Example OP - In this column the value of Col_A is 0 and col_B is 1:
col_A_0_col_B_1
1
0
1
1
0
0
0
I am currently using the iterrows() to iterate through every row to check the value and then change
Is there a usual pandas shorter approach to achieve this.

Convert chained boolean masks to integers:
df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
For better performance:
df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
Performance: Depends of number of rows and 0, 1 values:
np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)
In [92]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
...:
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [93]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
...:
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [94]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
...:
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: %%timeit
...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
...:
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [96]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
...:
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [97]: %%timeit
...: df['col_A_0_col_B_1'] = 0
...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
...:
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can use np.where
df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)

First create your column and assign is e.g. 0 for False
df['col_A_0_col_B_1'] = 0
Then using loc you can filter by where col_A == 0 and col_B ==1 and then assign 1 to the new column
df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1

If I understood correctly, you could do something like this:
import pandas as pd
data = [[0, 1],
[0, 0],
[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 1]]
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)
Output
col_A col_B col_A_0_col_B_1
0 0 1 1
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 0
5 1 0 0
6 1 1 0
Or as alternative:
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)

You can use pandas ~ for boolean not, coupled with 1 and 0 being true and false.
df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']

Related

Replacing cell values with column header value

I've got a dataframe:
a-line abstract ... zippered
0 0 ... 0
0 1 ... 0
0 0 ... 1
Where the value of the cell is 1 I need to replace it with the header name.
df.dtypes returns Length: 1000, dtype: object
I have tried df.apply(lambda x: x.astype(object).replace(1, x.name))
but get TypeError: Invalid "to_replace" type: 'int'
other attempts:
df.apply(lambda x: x.astype(object).replace(str(1), x.name)) == TypeError: Invalid "to_replace" type: 'str'
df.apply(lambda x: x.astype(str).replace(str(1), x.name)) == Invalid "to_replace" type: 'str'
The key idea to all three solutions below is to loop through columns. The first method is with replace.
for col in df:
df[col]=df[col].replace(1, df[col].name)
Alternatively, per your attempt to apply a lambda:
for col in df_new:
df_new[col]=df_new[col].astype(str).apply(lambda x: x.replace('1',df_new[col].name))
Finally, this is with np.where:
for col in df_new:
df_new[col]=np.where(df_new[col] == 1, df_new[col].name, df_new[col])
Output for all three:
a-line abstract ... zippered
0 0 0 ... 0
1 0 abstract ... 0
2 0 0 ... zippered
You might consider to play from this idea
import pandas as pd
df = pd.DataFrame([[0,0,0],
[0,1,0],
[0,0,1],
[0,1,0]],
columns=["a","b","c"])
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
UPDATE: Timing
#David Erickson solution it's perfect but you can avoid the loop. In particular if you have many columns.
Generate data
import pandas as pd
import numpy as np
n = 1_000
columns = ["{:04d}".format(i) for i in range(n)]
df = pd.DataFrame(np.random.randint(0, high=2, size=(4,n)),
columns=columns)
# we test against the same dataframe
df_bk = df.copy()
David's solution #1
%%timeit -n10
for col in df:
df[col]=df[col].replace(1, df[col].name)
1.01 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #2
%%timeit -n10
df = df_bk.copy()
for col in df:
df[col]=df[col].astype(str).apply(lambda x: x.replace('1',df[col].name))
890 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #3
%%timeit -n10
for col in df:
df[col]=np.where(df[col] == 1, df[col].name, df[col])
886 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Avoiding loops
%%timeit -n10
df = df_bk.copy()
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
455 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas: add column with progressive count of elements meeting a condition

Given the following dataframe df:
df = pd.DataFrame({'A':['Tony', 'Mike', 'Jen', 'Anna'], 'B': ['no', 'yes', 'no', 'yes']})
A B
0 Tony no
1 Mike yes
2 Jen no
3 Anna yes
I want to add another column that counts, progressively, the elements with df['B']='yes':
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
How can I do this?
You can use numpy.where with cumsum of boolean mask:
m = df['B']=='yes'
df['C'] = np.where(m, m.cumsum(), 0)
Another solution is count boolean mask created by filtering and then add 0 values by reindex:
m = df['B']=='yes'
df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
print (df)
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
Performance (in real data should be different, best check it first):
np.random.seed(123)
N = 10000
L = ['yes','no']
df = pd.DataFrame({'B': np.random.choice(L, N)})
print (df)
In [150]: %%timeit
...: m = df['B']=='yes'
...: df['C'] = np.where(m, m.cumsum(), 0)
...:
1.57 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [151]: %%timeit
...: m = df['B']=='yes'
...: df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
...:
2.53 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %%timeit
...: df['C'] = df.groupby('B').cumcount() + 1
...: df['C'].where(df['B'] == 'yes', 0, inplace=True)
4.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use GroupBy + cumcount followed by pd.Series.where:
df['C'] = df.groupby('B').cumcount() + 1
df['C'].where(df['B'] == 'yes', 0, inplace=True)
print(df)
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2

How to map new variable in pandas in effective way

Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.
Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Python: Assign value to a new column in Pandas as list using other columns

I have below pandas dataframe:
Name1 Name2 Score1 Score2
Bruce Jacob 3 4
Aida Stephan 0 1
I want to create a new column in the dataframe "list_score" which is a list of score 1 and 2
Expected result:
Name1 Name2 Score1 Score2 list_score
Bruce Jacob 3 4 [3,4]
Aida Stephan 0 1 [0,1]
Use zip with convert tuples to lists:
df['list_score'] = [list(x) for x in zip(df['Score1'], df['Score2'])]
Or:
df['list_score'] = list(map(list, zip(df['Score1'], df['Score2'])))
print (df)
Name1 Name2 Score1 Score2 list_score
0 Bruce Jacob 3 4 [3, 4]
1 Aida Stephan 0 1 [0, 1]
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [105]: %timeit df['list_score'] = [list(x) for x in zip(df['Score1'], df['Score2'])]
851 µs ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [106]: %timeit df['list_score'] = list(map(list, zip(df['Score1'], df['Score2'])))
745 µs ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %timeit df['list_score'] = df[['Score1', 'Score2']].apply(tuple, axis=1).apply(list)
35.5 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [108]: %timeit df['list_score'] = df[['Score1', 'Score2']].values.tolist()
949 µs ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This was the setup used to generate the perfplot above:
def list_comp(df):
df['list_score'] = [list(x) for x in zip(df['Score1'], df['Score2'])]
return df
def map_list(df):
df['list_score'] = list(map(list, zip(df['Score1'], df['Score2'])))
return df
def apply(df):
df['list_score'] = df[['Score1', 'Score2']].apply(tuple, axis=1).apply(list)
return df
def values(df):
df['list_score'] = df[['Score1', 'Score2']].values.tolist()
return df
def make_df(n):
df = pd.DataFrame(np.random.randint(10, size=(n, 2)), columns=['Score1','Score2'])
return df
perfplot.show(
setup=make_df,
kernels=[list_comp, map_list, apply, values],
n_range=[2**k for k in range(2, 15)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')
df['list_score'] = df[['score1', 'score2']].values.tolist()
One way is to use pd.DataFrame.apply to convert to tuple and then list. If tuple is sufficient, the second part may be omitted.
df['list_score'] = df[['Score1', 'Score2']].apply(tuple, axis=1).apply(list)
print(df)
Name1 Name2 Score1 Score2 list_score
0 Bruce Jacob 3 4 [3, 4]
1 Aida Stephan 0 1 [0, 1]

Pandas Dataframe Find Rows Where all Columns Equal

I have a dataframe that has characters in it - I want a boolean result by row that tells me if all columns for that row have the same value.
For example, I have
df = [ a b c d
0 'C' 'C' 'C' 'C'
1 'C' 'C' 'A' 'A'
2 'A' 'A' 'A' 'A' ]
and I want the result to be
0 True
1 False
2 True
I've tried .all but it seems I can only check if all are equal to one letter. The only other way I can think of doing it is by doing a unique on each row and see if that equals 1? Thanks in advance.
I think the cleanest way is to check all columns against the first column using eq:
In [11]: df
Out[11]:
a b c d
0 C C C C
1 C C A A
2 A A A A
In [12]: df.iloc[:, 0]
Out[12]:
0 C
1 C
2 A
Name: a, dtype: object
In [13]: df.eq(df.iloc[:, 0], axis=0)
Out[13]:
a b c d
0 True True True True
1 True True False False
2 True True True True
Now you can use all (if they are all equal to the first item, they are all equal):
In [14]: df.eq(df.iloc[:, 0], axis=0).all(1)
Out[14]:
0 True
1 False
2 True
dtype: bool
Compare array by first column and check if all Trues per row:
Same solution in numpy for better performance:
a = df.values
b = (a == a[:, [0]]).all(axis=1)
print (b)
[ True True False]
And if need Series:
s = pd.Series(b, axis=df.index)
Comparing solutions:
data = [[10,10,10],[12,12,12],[10,12,10]]
df = pd.DataFrame(data,columns=['Col1','Col2','Col3'])
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#jez - numpy array
In [14]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
141 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#jez - Series
In [15]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
...: pd.Series(b, index=df.index)
169 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Andy Hayden
In [16]: %%timeit
...: df.eq(df.iloc[:, 0], axis=0).all(axis=1)
2.22 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Wen1
In [17]: %%timeit
...: list(map(lambda x : len(set(x))==1,df.values))
56.8 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#K.-Michael Aye
In [18]: %%timeit
...: df.apply(lambda x: len(set(x)) == 1, axis=1)
686 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Wen2
In [19]: %%timeit
...: df.nunique(1).eq(1)
2.87 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one)
df.nunique(axis = 1).eq(1)
Out[308]:
0 True
1 False
2 True
dtype: bool
Or you can using map with set
list(map(lambda x : len(set(x))==1,df.values))
df = pd.DataFrame.from_dict({'a':'C C A'.split(),
'b':'C C A'.split(),
'c':'C A A'.split(),
'd':'C A A'.split()})
df.apply(lambda x: len(set(x)) == 1, axis=1)
0 True
1 False
2 True
dtype: bool
Explanation: set(x) has only 1 element, if all elements of the row are the same. The axis=1 option applies any given function over the rows instead.
You can use nunique(axis=1) so the results (added to a new column) can be obtained by:
df['unique'] = df.nunique(axis=1) == 1
The answer by #yo-and-ben-w uses eq(1) but I think == 1 is easier to read.

Categories