vectorize conditional assignment in pandas dataframe - python

If I have a dataframe df with column x and want to create column y based on values of x using this in pseudo code:
if df['x'] < -2 then df['y'] = 1
else if df['x'] > 2 then df['y'] = -1
else df['y'] = 0
How would I achieve this? I assume np.where is the best way to do this but not sure how to code it correctly.

One simple method would be to assign the default value first and then perform 2 loc calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where then you could do it with a nested np.where:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where method is twice as fast

Use np.select for multiple conditions
np.select(condlist, choicelist, default=0)
Return elements in choicelist depending on the corresponding condition in condlist.
The default element is used when all conditions evaluate to False.
condlist = [
df['x'] < -2,
df['x'] > 2,
]
choicelist = [
1,
-1,
]
df['y'] = np.select(condlist, choicelist, default=0)
np.select is much more readable than a nested np.where but just as fast:
df = pd.DataFrame({'x': np.random.randint(-5, 5, size=n)})

This is a good use case for pd.cut where you define ranges and based on those ranges you can assign labels:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0

set fixed value to 'c2' where the condition is met
df.loc[df['c1'] == 'Value', 'c2'] = 10

You can do it easily using the index and 2 loc calls:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
x
0 0
1 -3
2 5
3 -1
4 1
df['y'] = 0
idx_1 = df.loc[df['x'] < -2, 'y'].index
idx_2 = df.loc[df['x'] > 2, 'y'].index
df.loc[idx_1, 'y'] = 1
df.loc[idx_2, 'y'] = -1
df
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0

Related

Cumsum column while skipping rows or setting fixed values on a condition based on the result of the actual cumsum

I'm trying to find a vectorized solution in pandas that is quite common in spreadsheets which is to cumsum while skipping or setting fixed values on a condition based on the result of the actual cumsum. I have the following:
A
1 0
2 -1
3 2
4 3
5 -2
6 -3
7 1
8 -1
9 1
10 -2
11 1
12 2
13 -1
14 -2
What I need is to add a second column with the cumsum of 'A' and if one of these sums gives a positive value replace it with 0 and continue the cumsum using that 0. At the same time if the cumsum gives a negative value that is lower than the lowest value in column A recorded after a 0 in column B I will need to replace it with that lowest value in column A. I know this is quite a problem but is there a vectorized solution for this? Maybe using an auxiliary column. The result should look like this:
A B
1 0 0
2 -1 -1 # -1+0 = -1
3 2 0 # -1 + 2 = 1 but 1>0 so this is 0
4 3 0 # same as previous row
5 -2 -2 # -2+0 = -2
6 -3 -3 # -2-3 = -5 but the lowest value in column A since last 0 is -3 so this is replaced by -3
7 1 -2 # 1-3 = -2
8 -1 -3 # -1-2 = -3
9 1 -2 # -3 + 1 = -2
10 -2 -3 # -2-2 = -4 but the lowest value in column A since last 0 is -3 so this is replaced by -3
11 1 -2 # -3 +1 = -2
12 2 0 # -2+2 = 0
13 -1 -1 # 0-1 = -1
14 -2 -2 # -1-2 = -3 but the lowest value in column A since last cap is -2 so this is -2 instead of -3
For the moment I made this but does not work 100% and again is not really efficient:
df['B'] = 0
df['B'][0] = 0
for x in range(len(df)-1):
A = df['A'][x + 1]
B = df['B'][x] + A
if B >= 0:
df['B'][x+1] = 0
elif B < 0 and A < 0 and B < A:
df['B'][x+1] = A
else:
df['B'][x + 1] = B
Using df['A'].expanding(1).apply(function) I could run own function which first get only one row, next 2 rows, next 3 rows, etc. I doesn't give result from previous calculation and it needs to make all calculations again and again but it doesn't need global
variables and hardcoded df['A']
Doc: Series.expanding
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
def function(values):
#print(values)
#print(type(valuse)
#print(len(values))
result = 0
last_zero = 0
for index, value in enumerate(values):
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(values[last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
#print('result:', result)
return result
df['B'] = df['A'].expanding(1).apply(function)
df['B'] = df['B'].astype(int)
print(df)
Result:
A B
0 0 0
1 -1 -1
2 2 0
3 3 0
4 -2 -2
5 -3 -3
6 1 -2
7 -1 -3
8 1 -2
9 -2 -3
10 1 -2
11 2 0
12 -1 -1
13 -2 -2
The same but with normal apply() - it needs global variables and hardcoded df['A']
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
result = 0
last_zero = 0
index = 0
def function(value):
global result
global last_zero
global index
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
index += 1
#print('result:', result)
return result
df['B'] = df['A'].apply(function)
df['B'] = df['B'].astype(int)
print(df)
The same using normal for-loop
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
all_values = []
result = 0
last_zero = 0
for index, value in df['A'].iteritems():
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
all_values.append(result)
df['B'] = all_values
print(df)

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

How to remove rows from DataFrame

I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?
You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973

Pandas convert positive number to 1 and negative number to -1

I have a column of positive and negative number. How to convert this column to a new column to realize convert positive number to 1 and negative number to -1?
You need numpy.sign
df['new'] = np.sign(df['col'])
Sample:
df = pd.DataFrame({ 'col':[-1,3,-5,7,1,0]})
df['new'] = np.sign(df['col'])
print (df)
col new
0 -1 -1
1 3 1
2 -5 -1
3 7 1
4 1 1
5 0 0
It's really easy to perform this task by -
For whole data frame -
df[df < 0] = -1
df[df > 0] = 1
For specific column -
df['column_name'][df['column_name'] < 0] = -1
df['column_name'][df['column_name'] > 0] = 1
df[df < 0] = -1
df[df > 0] = 1
no behaviour defined for df == 0

Creating a Column in a Dataframe Based on a Conditional

I have a data frame
pd.DataFrame({"A":[0,1,0,1],
"B":[-1,0,0,0],
"C":[0,0,0,0]},
index = [.1,.2,.3, .4])
The way I first logically approached the problem
for index, row in iterrows():
if df['A'] == 1:
df['C'] == 1
elif df['B'] == -1
df['C'] == -1
else:
df['C'] == 0
I want
pd.DataFrame({"A":[0,1,0,1],
"B":[-1,0,0,0],
"C":[-1,1,0,1]},
index = [.1,.2,.3, .4])
After trying the first method I tried a variety of methods proposed in other questions, but none seem to fit my problem.
You could use nested np.where calls:
df.C = np.where(df.A == 1, 1, np.where(df.B == -1, -1, 0))
df
A B C
0.1 0 -1 -1
0.2 1 0 1
0.3 0 0 0
0.4 1 0 1
Performance
df = pd.concat([df] * 100000)
%timeit np.select([df.A == 1, df.B == -1], [1, -1])
100 loops, best of 3: 5.25 ms per loop
%timeit np.where(df.A == 1, 1, np.where(df.B == -1, -1, 0))
100 loops, best of 3: 2.86 ms per loop
Use numpy.select:
df['C'] = pd.np.select([df.A == 1, df.B == -1], [1, -1])
df
# A B C
#0.1 0 -1 -1
#0.2 1 0 1
#0.3 0 0 0
#0.4 1 -1 1

Categories