How to efficiently partial argsort Pandas dataframe across columns - python

I would like to replace values with column labels according to the largest 3 values for each row. Let's assume this input:
p1 p2 p3 p4
0 0 9 1 4
1 0 2 3 4
2 1 3 10 7
3 1 5 3 1
4 2 3 7 10
Given n = 3, I am looking for:
Top1 Top2 Top3
0 p2 p4 p3
1 p4 p3 p2
2 p3 p4 p2
3 p2 p3 p1
4 p4 p3 p2
I'm not concerned about duplicates, e.g. for index 3, Top3 can be 'p1' or 'p4'.
Attempt 1
My first attempt is a full sort using np.ndarray.argsort:
res = pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
But in reality I have more than 4 columns and this will be inefficient.
Attempt 2
Next I tried np.argpartition. But since values within each partition are not sorted, this required a subsequent sort:
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
res = pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]],
columns=[f'Top{i}' for i in range(1, n+1)])
This, in fact, works out slower than the first attempt for larger dataframes. Is there a more efficient way which takes advantage of partial sorting? You can use the below code for benchmarking purposes.
Benchmarking
# Python 3.6.0, NumPy 1.11.3, Pandas 0.19.2
import pandas as pd, numpy as np
df = pd.DataFrame({'p1': [0, 0, 1, 1, 2],
'p2': [9, 2, 3, 5, 3],
'p3': [1, 3, 10, 3, 7],
'p4': [4, 4, 7, 1, 10]})
def full_sort(df):
return pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
def partial_sort(df):
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
return pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]])
df = pd.concat([df]*10**5)
%timeit full_sort(df) # 86.3 ms per loop
%timeit partial_sort(df) # 158 ms per loop

With a decent number of columns, we can use np.argpartition with some slicing and indexing, like so -
def topN_perrow_colsindexed(df, N):
# Extract array data
a = df.values
# Get top N indices per row with not necessarily sorted order
idxtopNpart = np.argpartition(a,-N,axis=1)[:,-1:-N-1:-1]
# Index into input data with those and use argsort to force sorted order
sidx = np.take_along_axis(a,idxtopNpart,axis=1).argsort(1)
idxtopN = np.take_along_axis(idxtopNpart,sidx[:,::-1],axis=1)
# Index into column values with those for final output
c = df.columns.values
return pd.DataFrame(c[idxtopN], columns=[['Top'+str(i+1) for i in range(N)]])
Sample run -
In [65]: df
Out[65]:
p1 p2 p3 p4
0 0 9 1 4
1 0 2 3 4
2 1 3 10 7
3 1 5 3 1
4 2 3 7 10
In [66]: topN_perrow_colsindexed(df, N=3)
Out[66]:
Top1 Top2 Top3
0 p2 p4 p3
1 p4 p3 p2
2 p3 p4 p2
3 p2 p3 p4
4 p4 p3 p2
Timings -
In [143]: np.random.seed(0)
In [144]: df = pd.DataFrame(np.random.rand(10000,30))
In [145]: %timeit full_sort(df)
...: %timeit partial_sort(df)
...: %timeit topN_perrow_colsindexed(df,N=3)
100 loops, best of 3: 7.96 ms per loop
100 loops, best of 3: 13.9 ms per loop
100 loops, best of 3: 5.47 ms per loop
In [146]: df = pd.DataFrame(np.random.rand(10000,100))
In [147]: %timeit full_sort(df)
...: %timeit partial_sort(df)
...: %timeit topN_perrow_colsindexed(df,N=3)
10 loops, best of 3: 34 ms per loop
10 loops, best of 3: 56.1 ms per loop
100 loops, best of 3: 13.6 ms per loop

Related

Python pandas dataframe get all combinations of column values?

I have a pandas dataframe which looks like this:
colour points
0 red 1
1 yellow 10
2 black -3
Then I'm trying to do the following algorithm:
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].values[i1]
colour_secondary = df['colour'].values[i2]
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 + point2
combos.append(combo)
points.append(new_points)
df_new = pd.DataFrame({'colours': combos,
'points': points})
print(df_new)
I want to get all combinations and sum points:
if the colour is used as main I want to sum his value
if the colour is used as a secondary I want to sum opposite value
Example:
red_yellow = 1 + (-10) = -9
red_black = 1 + ( +3) = 4
black_red = -3 + ( -1) = -4
The output I currently get:
colours points
0 red_red 2
1 red_yellow 11
2 red_black -2
3 yellow_red 11
4 yellow_yellow 20
5 yellow_black 7
6 black_red -2
7 black_yellow 7
8 blac_kblack -6
The output I'm looking for:
red_yellow -9
red_black 4
yellow_red 9
yellow_black 13
black_red -4
black_yellow -13
I don't know how to apply my logic to this code, also I bet there is a more simplest way to get all combinations without doing two loops, but currently, that's the only thing that comes to my mind.
I would like to:
get deserved output
improve the performance in cases when we get like 20 input colours
remove duplicates like red_red
Here is a timeit comparison of a few alternatives.
| method | ms per loop |
|--------------------+-------------|
| alt2 | 2.36 |
| using_concat | 3.26 |
| using_double_merge | 22.4 |
| orig | 22.6 |
| alt | 45.8 |
The timeit results were generated using IPython:
In [138]: df = make_df(20)
In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop
import numpy as np
import pandas as pd
def alt(df):
df['const'] = 1
result = pd.merge(df, df, on='const', how='outer')
result = result.loc[(result['colour_x'] != result['colour_y'])]
result['color'] = result['colour_x'] + '_' + result['colour_y']
result['points'] = result['points_x'] - result['points_y']
result = result[['color', 'points']]
return result
def alt2(df):
points = np.add.outer(df['points'], -df['points'])
color = pd.MultiIndex.from_product([df['colour'], df['colour']])
mask = color.labels[0] != color.labels[1]
color = color.map('_'.join)
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
return result
def orig(df):
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].iloc[i1]
colour_secondary = df['colour'].iloc[i2]
if colour_main != colour_secondary:
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 - point2
combos.append(combo)
points.append(new_points)
return pd.DataFrame({'color':combos, 'points':points})
def using_concat(df):
"""https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
return df2
def using_double_merge(df):
"""https://stackoverflow.com/a/51641007/190597 (sacul)"""
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_y'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
return new[['colours', 'sum']]
def make_df(N):
df = pd.DataFrame({'colour': np.arange(N),
'points': np.random.randint(10, size=N)})
df['colour'] = df['colour'].astype(str)
return df
The main idea in alt2 is to use np.add_outer to construct an addition table
out of df['points']:
In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]:
array([[ 0, -9, 4],
[ 9, 0, 13],
[ -4, -13, 0]])
ravel is used to make the array 1-dimensional:
In [152]: points.ravel()
Out[152]: array([ 0, -9, 4, 9, 0, 13, -4, -13, 0])
and the color combinations are generated with pd.MultiIndex.from_product:
In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]:
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
'yellow_black', 'black_red', 'black_yellow', 'black_black'],
dtype='object')
A mask is generated to remove duplicates:
mask = color.labels[0] != color.labels[1]
and then the result is generated from these parts:
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
The idea behind alt is explained in my original answer, here.
This is a bit long-winded, but gets you the output you want:
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_x'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
>>> new
level_0 level_1 sum colours
3 yellow red -9 yellow_red
6 black red 4 black_red
1 red yellow 9 red_yellow
7 black yellow 13 black_yellow
2 red black -4 red_black
5 yellow black -13 yellow_black
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
comb values
0 red_yellow -9
1 red_black 4
2 yellow_black 13
0 yellow_red 9
1 black_red -4
2 black_yellow -13
You have to change this line in your code
new_points = point1 + point2
to this
new_points = point1 - point2

How to find the maximum consecutive number for multiple columns?

I need to identify the highest number of consecutive values that meet a certain criteria for multiple columns.
If my df is:
A B C D E
26 24 21 23 24
26 23 22 15 23
24 19 17 11 15
27 22 28 24 24
26 27 30 23 11
26 26 29 27 29
I want to know the maximum consecutive times that numbers over 25 occur for each column. So the output would be:
A 3
B 2
C 3
D 1
E 1
Using the following code, I can obtain the outcome for one column at a time; is there a way to create a table as above rather than repeating for each column (I have over 40 columns in total).
df.A.isnull().astype(int).groupby(df.A.notnull().astype(int).cumsum()).sum().max()
Thanks in advance.
Is this what you want ? pandas approach (PS: never thought I can make it one line LOL)
(df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max()
Out[320]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64
One option using numpy to calculate the max consecutive:
def max_consecutive(arr):
# calculate the indices where the condition changes
split_indices = np.flatnonzero(np.ediff1d(arr.values, to_begin=1, to_end=1))
# calculate the chunk length of consecutive values and pick every other value based on
# the initial value
try:
max_size = np.diff(split_indices)[not arr.iat[0]::2].max()
except ValueError:
max_size = 0
return max_size
df.gt(25).apply(max_consecutive)
#A 3
#B 2
#C 3
#D 1
#E 1
#dtype: int64
Timing compared with the other approach:
%timeit df.gt(25).apply(max_consecutive)
# 520 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
# 10.3 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here's one with NumPy -
# mask is 2D boolean array representing islands as True values per col
def max_island_len_cols(mask):
m,n = mask.shape
out = np.zeros(n,dtype=int)
b = np.zeros((m+2,n),dtype=bool)
b[1:-1] = mask
for i in range(mask.shape[1]):
idx = np.flatnonzero(b[1:,i] != b[:-1,i])
if len(idx)>0:
out[i] = (idx[1::2] - idx[::2]).max()
return out
output = pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Sample run -
In [690]: df
Out[690]:
A B C D E
0 26 24 21 23 24
1 26 23 22 15 23
2 24 19 17 11 15
3 27 22 28 24 24
4 26 27 30 23 11
5 26 26 29 27 29
In [690]:
In [691]: pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Out[691]:
A 3
B 2
C 3
D 1
E 1
dtype: int64
Runtime test
Inspired by the given sample that has numbers in range (24,28) and with 40 cols, let's setup a bigger input dataframe and test out all the solutions -
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))
# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop
# #Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop
# #Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop
An approach using pandas and scipy.ndimage.label, for fun.
import pandas as pd
from scipy.ndimage import label
struct = [[0, 1, 0], # Structure used for segmentation
[0, 1, 0], # Equivalent to axis=0 in `numpy`
[0, 1, 0]] # Or 'columns' in `pandas`
labels, nlabels = label(df > 25, structure=struct)
>>> labels # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 3, 0, 0],
[2, 4, 3, 0, 0],
[2, 4, 3, 5, 6]])
labels_df = pd.DataFrame(columns=df.columns, data=labels) # Add original columns names
res = (labels_df.apply(lambda x: x.value_counts()) # Execute `value_counts` on each column
.iloc[1:] # slice results for labels > 0
.max()) # and get max value
>>> res
Out[]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64

Removing usernames from a dataframe that do not appear a certain number of times?

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

Marking all groups in DataFrame smaller than N

I'm trying to mark (in ok) all groups in a pandas DataFrame which are smaller than 'N'. I have a working solution but it's slow, is there a way to speed this up?
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
df['ok'] = True
c = df.groupby(keys)['ok'].count()
for vals in c[c < N].index:
local_dict = dict(zip(keys, vals))
query = ' & '.join(f'{key}==#{key}' for key in keys)
idx = df.query(query, local_dict=local_dict).index
df.loc[idx, 'ok'] = False
print(df)
Instead of using groupby/count, use groupby/transform/count to form a Series which is the same length as the original DataFrame df:
c = df.groupby(keys)['z'].transform('count')
Then you can form a boolean mask which has the same length as df:
In [35]: c<N
Out[35]:
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
Name: ok, dtype: bool
Assignment to ok goes much more smoothly now, without a loop, querying or sub-indexing:
df['ok'] = c >= N
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
print(df)
yields
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True
Since the builtin groupby/transform methods (such as transform('count')) are
Cythonized they are in general faster than calling groupby/transform
with an custom lambda function.
Thus, computing the ok column in two steps using
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
is faster than
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
In addition, vectorized operations over an entire column (such as c >= N), are
faster than multiple operations over subgroups. transform(lambda x: x.size >=
N)) performs the comparison x.size >= N once for each group. If there are
many groups, then computing c >= N yields an improvement in performance.
For example, with this 1000-row DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2017)
df = pd.DataFrame(np.random.randint(10, size=(1000, 3)), columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
using transform('count') is about 12x faster:
In [37]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.69 ms per loop
In [38]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
1 loop, best of 3: 20.2 ms per loop
In [39]: 20.2/1.69
Out[39]: 11.95266272189349
In the example above there were 100 groups:
In [47]: df.groupby(keys).ngroups
Out[47]: 100
The speed advantage of using transform('count') increases as the number of
groups increase. For example, with 955 groups:
In [48]: np.random.seed(2017); df = pd.DataFrame(np.random.randint(100, size=(1000, 3)), columns=['x', 'y', 'z'])
In [51]: df.groupby(keys).ngroups
Out[51]: 955
the transform('count') method performs about 92x faster:
In [49]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.88 ms per loop
In [50]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
10 loops, best of 3: 173 ms per loop
In [52]: 173/1.88
Out[52]: 92.02127659574468
Input variables:
keys = ['x','y']
N = 3
Calculate okay or not with groupby, transform and size:
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
Output:
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True

how to compute a new column based on the values of other columns in pandas - python

Let's say my data frame contains these data:
>>> df = pd.DataFrame({'a':['l1','l2','l1','l2','l1','l2'],
'b':['1','2','2','1','2','2']})
>>> df
a b
0 l1 1
1 l2 2
2 l1 2
3 l2 1
4 l1 2
5 l2 2
l1 should correspond to 1 whereas l2 should correspond to 2.
I'd like to create a new column 'c' such that, for each row, c = 1 if a = l1 and b = 1 (or a = l2 and b = 2). If a = l1 and b = 2 (or a = l2 and b = 1) then c = 0.
The resulting data frame should look like this:
a b c
0 l1 1 1
1 l2 2 1
2 l1 2 0
3 l2 1 0
4 l1 2 0
5 l2 2 1
My data frame is very large so I'm really looking for the most efficient way to do this using pandas.
df = pd.DataFrame({'a': numpy.random.choice(['l1', 'l2'], 1000000),
'b': numpy.random.choice(['1', '2'], 1000000)})
A fast solution assuming only two distinct values:
%timeit df['c'] = ((df.a == 'l1') == (df.b == '1')).astype(int)
10 loops, best of 3: 178 ms per loop
#Viktor Kerkes:
%timeit df['c'] = (df.a.str[-1] == df.b).astype(int)
1 loops, best of 3: 412 ms per loop
#user1470788:
%timeit df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)
1 loops, best of 3: 363 ms per loop
#herrfz
%timeit df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
1 loops, best of 3: 387 ms per loop
You can also use the string methods.
df['c'] = (df.a.str[-1] == df.b).astype(int)
df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
You can just use logical operators. I'm not sure why you're using strings of 1 and 2 rather than ints, but here's a solution. The astype at the end converts it from boolean to 0's and 1's.
df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)

Categories