Creating a Column in a Dataframe Based on a Conditional - python

I have a data frame
pd.DataFrame({"A":[0,1,0,1],
"B":[-1,0,0,0],
"C":[0,0,0,0]},
index = [.1,.2,.3, .4])
The way I first logically approached the problem
for index, row in iterrows():
if df['A'] == 1:
df['C'] == 1
elif df['B'] == -1
df['C'] == -1
else:
df['C'] == 0
I want
pd.DataFrame({"A":[0,1,0,1],
"B":[-1,0,0,0],
"C":[-1,1,0,1]},
index = [.1,.2,.3, .4])
After trying the first method I tried a variety of methods proposed in other questions, but none seem to fit my problem.

You could use nested np.where calls:
df.C = np.where(df.A == 1, 1, np.where(df.B == -1, -1, 0))
df
A B C
0.1 0 -1 -1
0.2 1 0 1
0.3 0 0 0
0.4 1 0 1
Performance
df = pd.concat([df] * 100000)
%timeit np.select([df.A == 1, df.B == -1], [1, -1])
100 loops, best of 3: 5.25 ms per loop
%timeit np.where(df.A == 1, 1, np.where(df.B == -1, -1, 0))
100 loops, best of 3: 2.86 ms per loop

Use numpy.select:
df['C'] = pd.np.select([df.A == 1, df.B == -1], [1, -1])
df
# A B C
#0.1 0 -1 -1
#0.2 1 0 1
#0.3 0 0 0
#0.4 1 -1 1

Related

Python pandas dataframe get all combinations of column values?

I have a pandas dataframe which looks like this:
colour points
0 red 1
1 yellow 10
2 black -3
Then I'm trying to do the following algorithm:
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].values[i1]
colour_secondary = df['colour'].values[i2]
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 + point2
combos.append(combo)
points.append(new_points)
df_new = pd.DataFrame({'colours': combos,
'points': points})
print(df_new)
I want to get all combinations and sum points:
if the colour is used as main I want to sum his value
if the colour is used as a secondary I want to sum opposite value
Example:
red_yellow = 1 + (-10) = -9
red_black = 1 + ( +3) = 4
black_red = -3 + ( -1) = -4
The output I currently get:
colours points
0 red_red 2
1 red_yellow 11
2 red_black -2
3 yellow_red 11
4 yellow_yellow 20
5 yellow_black 7
6 black_red -2
7 black_yellow 7
8 blac_kblack -6
The output I'm looking for:
red_yellow -9
red_black 4
yellow_red 9
yellow_black 13
black_red -4
black_yellow -13
I don't know how to apply my logic to this code, also I bet there is a more simplest way to get all combinations without doing two loops, but currently, that's the only thing that comes to my mind.
I would like to:
get deserved output
improve the performance in cases when we get like 20 input colours
remove duplicates like red_red
Here is a timeit comparison of a few alternatives.
| method | ms per loop |
|--------------------+-------------|
| alt2 | 2.36 |
| using_concat | 3.26 |
| using_double_merge | 22.4 |
| orig | 22.6 |
| alt | 45.8 |
The timeit results were generated using IPython:
In [138]: df = make_df(20)
In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop
import numpy as np
import pandas as pd
def alt(df):
df['const'] = 1
result = pd.merge(df, df, on='const', how='outer')
result = result.loc[(result['colour_x'] != result['colour_y'])]
result['color'] = result['colour_x'] + '_' + result['colour_y']
result['points'] = result['points_x'] - result['points_y']
result = result[['color', 'points']]
return result
def alt2(df):
points = np.add.outer(df['points'], -df['points'])
color = pd.MultiIndex.from_product([df['colour'], df['colour']])
mask = color.labels[0] != color.labels[1]
color = color.map('_'.join)
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
return result
def orig(df):
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].iloc[i1]
colour_secondary = df['colour'].iloc[i2]
if colour_main != colour_secondary:
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 - point2
combos.append(combo)
points.append(new_points)
return pd.DataFrame({'color':combos, 'points':points})
def using_concat(df):
"""https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
return df2
def using_double_merge(df):
"""https://stackoverflow.com/a/51641007/190597 (sacul)"""
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_y'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
return new[['colours', 'sum']]
def make_df(N):
df = pd.DataFrame({'colour': np.arange(N),
'points': np.random.randint(10, size=N)})
df['colour'] = df['colour'].astype(str)
return df
The main idea in alt2 is to use np.add_outer to construct an addition table
out of df['points']:
In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]:
array([[ 0, -9, 4],
[ 9, 0, 13],
[ -4, -13, 0]])
ravel is used to make the array 1-dimensional:
In [152]: points.ravel()
Out[152]: array([ 0, -9, 4, 9, 0, 13, -4, -13, 0])
and the color combinations are generated with pd.MultiIndex.from_product:
In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]:
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
'yellow_black', 'black_red', 'black_yellow', 'black_black'],
dtype='object')
A mask is generated to remove duplicates:
mask = color.labels[0] != color.labels[1]
and then the result is generated from these parts:
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
The idea behind alt is explained in my original answer, here.
This is a bit long-winded, but gets you the output you want:
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_x'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
>>> new
level_0 level_1 sum colours
3 yellow red -9 yellow_red
6 black red 4 black_red
1 red yellow 9 red_yellow
7 black yellow 13 black_yellow
2 red black -4 red_black
5 yellow black -13 yellow_black
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
comb values
0 red_yellow -9
1 red_black 4
2 yellow_black 13
0 yellow_red 9
1 black_red -4
2 black_yellow -13
You have to change this line in your code
new_points = point1 + point2
to this
new_points = point1 - point2

Create columns for every category in Pandas DataFrame

I have a data frame with many columns with binaries representing the presence of a category in the observation. Each observation has exactly 3 categories with a value of 1, the rest 0. I want to create 3 new columns, 1 for each category, where the value is instead the name of the category (so the name of the binary column) if it's equal to one.To make it clearer:
I have:
x|y|z|k|w
---------
0|1|1|0|1
To be:
cat1|cat2|cat3
--------------
y |z |w
Can I do this ?
For better performance use numpy solution:
print (df)
x y z k w
0 0 1 1 0 1
1 1 1 0 0 1
c = df.columns.values
df = pd.DataFrame(c[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat')
print (df)
cat0 cat1 cat2
0 y z w
1 x y w
Details:
#get indices of 1s
print (np.where(df))
(array([0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 4, 0, 1, 4], dtype=int64))
#seelct second array
print (np.where(df)[1])
[1 2 4 0 1 4]
#reshape to 3 columns
print (np.where(df)[1].reshape(-1, 3))
[[1 2 4]
[0 1 4]]
#indexing
print (c[np.where(df)[1].reshape(-1, 3)])
[['y' 'z' 'w']
['x' 'y' 'w']]
Timings:
df = pd.concat([df] * 1000, ignore_index=True)
#jezrael solution
In [390]: %timeit (pd.DataFrame(df.columns.values[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat'))
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 503 µs per loop
#jpp solution
In [391]: %timeit (pd.DataFrame(df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()))
10 loops, best of 3: 111 ms per loop
#Zero solution working only with one row DataFrame, so not included
Here is one way:
import pandas as pd
df = pd.DataFrame({'x': [0, 1], 'y': [1, 1], 'z': [1, 0], 'k': [0, 1], 'w': [1, 1]})
split = df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()
df2 = pd.DataFrame(split)
# 0 1 2 3
# 0 w y z None
# 1 k w x y
You could
In [13]: pd.DataFrame([df.columns[df.astype(bool).values[0]]]).add_prefix('cat')
Out[13]:
cat0 cat1 cat2
0 y z w

Marking all groups in DataFrame smaller than N

I'm trying to mark (in ok) all groups in a pandas DataFrame which are smaller than 'N'. I have a working solution but it's slow, is there a way to speed this up?
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
df['ok'] = True
c = df.groupby(keys)['ok'].count()
for vals in c[c < N].index:
local_dict = dict(zip(keys, vals))
query = ' & '.join(f'{key}==#{key}' for key in keys)
idx = df.query(query, local_dict=local_dict).index
df.loc[idx, 'ok'] = False
print(df)
Instead of using groupby/count, use groupby/transform/count to form a Series which is the same length as the original DataFrame df:
c = df.groupby(keys)['z'].transform('count')
Then you can form a boolean mask which has the same length as df:
In [35]: c<N
Out[35]:
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
Name: ok, dtype: bool
Assignment to ok goes much more smoothly now, without a loop, querying or sub-indexing:
df['ok'] = c >= N
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
print(df)
yields
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True
Since the builtin groupby/transform methods (such as transform('count')) are
Cythonized they are in general faster than calling groupby/transform
with an custom lambda function.
Thus, computing the ok column in two steps using
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
is faster than
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
In addition, vectorized operations over an entire column (such as c >= N), are
faster than multiple operations over subgroups. transform(lambda x: x.size >=
N)) performs the comparison x.size >= N once for each group. If there are
many groups, then computing c >= N yields an improvement in performance.
For example, with this 1000-row DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2017)
df = pd.DataFrame(np.random.randint(10, size=(1000, 3)), columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
using transform('count') is about 12x faster:
In [37]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.69 ms per loop
In [38]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
1 loop, best of 3: 20.2 ms per loop
In [39]: 20.2/1.69
Out[39]: 11.95266272189349
In the example above there were 100 groups:
In [47]: df.groupby(keys).ngroups
Out[47]: 100
The speed advantage of using transform('count') increases as the number of
groups increase. For example, with 955 groups:
In [48]: np.random.seed(2017); df = pd.DataFrame(np.random.randint(100, size=(1000, 3)), columns=['x', 'y', 'z'])
In [51]: df.groupby(keys).ngroups
Out[51]: 955
the transform('count') method performs about 92x faster:
In [49]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.88 ms per loop
In [50]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
10 loops, best of 3: 173 ms per loop
In [52]: 173/1.88
Out[52]: 92.02127659574468
Input variables:
keys = ['x','y']
N = 3
Calculate okay or not with groupby, transform and size:
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
Output:
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True

vectorize conditional assignment in pandas dataframe

If I have a dataframe df with column x and want to create column y based on values of x using this in pseudo code:
if df['x'] < -2 then df['y'] = 1
else if df['x'] > 2 then df['y'] = -1
else df['y'] = 0
How would I achieve this? I assume np.where is the best way to do this but not sure how to code it correctly.
One simple method would be to assign the default value first and then perform 2 loc calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where then you could do it with a nested np.where:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where method is twice as fast
Use np.select for multiple conditions
np.select(condlist, choicelist, default=0)
Return elements in choicelist depending on the corresponding condition in condlist.
The default element is used when all conditions evaluate to False.
condlist = [
df['x'] < -2,
df['x'] > 2,
]
choicelist = [
1,
-1,
]
df['y'] = np.select(condlist, choicelist, default=0)
np.select is much more readable than a nested np.where but just as fast:
df = pd.DataFrame({'x': np.random.randint(-5, 5, size=n)})
This is a good use case for pd.cut where you define ranges and based on those ranges you can assign labels:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
set fixed value to 'c2' where the condition is met
df.loc[df['c1'] == 'Value', 'c2'] = 10
You can do it easily using the index and 2 loc calls:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
x
0 0
1 -3
2 5
3 -1
4 1
df['y'] = 0
idx_1 = df.loc[df['x'] < -2, 'y'].index
idx_2 = df.loc[df['x'] > 2, 'y'].index
df.loc[idx_1, 'y'] = 1
df.loc[idx_2, 'y'] = -1
df
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0

how to compute a new column based on the values of other columns in pandas - python

Let's say my data frame contains these data:
>>> df = pd.DataFrame({'a':['l1','l2','l1','l2','l1','l2'],
'b':['1','2','2','1','2','2']})
>>> df
a b
0 l1 1
1 l2 2
2 l1 2
3 l2 1
4 l1 2
5 l2 2
l1 should correspond to 1 whereas l2 should correspond to 2.
I'd like to create a new column 'c' such that, for each row, c = 1 if a = l1 and b = 1 (or a = l2 and b = 2). If a = l1 and b = 2 (or a = l2 and b = 1) then c = 0.
The resulting data frame should look like this:
a b c
0 l1 1 1
1 l2 2 1
2 l1 2 0
3 l2 1 0
4 l1 2 0
5 l2 2 1
My data frame is very large so I'm really looking for the most efficient way to do this using pandas.
df = pd.DataFrame({'a': numpy.random.choice(['l1', 'l2'], 1000000),
'b': numpy.random.choice(['1', '2'], 1000000)})
A fast solution assuming only two distinct values:
%timeit df['c'] = ((df.a == 'l1') == (df.b == '1')).astype(int)
10 loops, best of 3: 178 ms per loop
#Viktor Kerkes:
%timeit df['c'] = (df.a.str[-1] == df.b).astype(int)
1 loops, best of 3: 412 ms per loop
#user1470788:
%timeit df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)
1 loops, best of 3: 363 ms per loop
#herrfz
%timeit df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
1 loops, best of 3: 387 ms per loop
You can also use the string methods.
df['c'] = (df.a.str[-1] == df.b).astype(int)
df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
You can just use logical operators. I'm not sure why you're using strings of 1 and 2 rather than ints, but here's a solution. The astype at the end converts it from boolean to 0's and 1's.
df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)

Categories