I have a column with numbers (some of them are infinite).
What I must do is the following:
All numbers greater than 15 or under -15 must be assigned the value 1, otherwise (between -15 and 15) it is assigned the value 0.
I have tried with:
df['B'] = df['B'].mask((df['B'] > 15, 1) | (df['B'] < -15, 1))
df['B'] = df['B'].where(df['B'] == 1, 0)
But got:
TypeError: unsupported operand type(s) for |: 'tuple' and 'tuple'
You could do this with the .between() method:
>>> df # example DF
B
0 1
1 9
2 -27
3 15
4 45
5 -6
>>> df["B"][df["B"].between(-15, 15)] = 0
>>> df["B"][~df["B"].between(-15, 15)] = 1
>>> df
B
0 0
1 0
2 1
3 0
4 1
5 0
You can also use .apply() method:
df = pd.DataFrame({'B': [-20, -12, 8, 11, 24]})
df['B'] = df['B'].apply(lambda x: 0 if x > 15 or x < -15 else 1)
I felt bad spamming sj95126, so I'll just provide some extra solutions here.
If you actually need 0s and 1s:
(~df["B"].between(-15, 15)).astype(int)
If you're already using numpy, but need more generic replacement (not 0s and 1s):
np.where(df["B"].between(-15, 15), val_if_between, val_if_not_between)
If you're not using numpy but still need more generic replacement:
df["B"].between(-15, 15).replace({True: val_if_between, False: val_if_not_between})
To closely follow your thinking process:
df.loc[
(df['B'] > 15) | (df['B'] < -15),
'B',
] = 1
df.loc[...] is a mandatory syntax to master for pandas
Related
Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0
I have a pandas dataframe(100,000 obs) with 11 columns.
I'm trying to assign df['trade_sign'] values based on the df['diff'] (which is a pd.series object of integer values)
If diff is positive, then trade_sign = 1
if diff is negative, then trade_sign = -1
if diff is 0, then trade_sign = 0
What I've tried so far:
pos['trade_sign'] = (pos['trade_sign']>0) <br>
pos['trade_sign'].replace({False: -1, True: 1}, inplace=True)
But this obviously doesn't take into account 0 values.
I also tried for loops with if conditions but that didn't work.
Essentially, how do I fix my .replace function to take account of diff values of 0.
Ideally, I'd prefer a solution that uses numpy over for loops with if conditions.
There's a sign function in numpy:
df["trade_sign"] = np.sign(df["diff"])
If you want integers,
df["trade_sign"] = np.sign(df["diff"]).astype(int)
a = [-1 if df['diff'].values[i] < 0 else 1 for i in range(len(df['diff'].values))]
df['trade_sign'] = a
You could do it this way:
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
The boolean results of the element-wise > and < comparisons automatically get converted to int in order to allow multiplication with 1 and -1, respectively.
This sample input and test code:
import pandas as pd
pos = pd.DataFrame({'diff':[-9,0,9,-8,0,8,-7-6-5,4,3,2,0]})
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
print(pos)
... gives this output:
diff trade_sign
0 -9 -1
1 0 0
2 9 1
3 -8 -1
4 0 0
5 8 1
6 -18 -1
7 4 1
8 3 1
9 2 1
10 0 0
UPDATE: In addition to the solution above, as well as some of the other excellent ideas in other answers, you can use numpy where:
pos['trade_sign'] = np.where(pos['diff'] > 0, 1, np.where(pos['diff'] < 0, -1, 0))
I would like to replace certain value-thresholds in a df with another value.
For example all values between 1 and <3.3 should be summarized as 1.
After that all values between >=3.3 and <10 should be summarized as 2 and so on.
I tried it like this:
tndf is my df and tnn the column
tndf.loc[(tndf.tnn < 1), 'tnn'] = 0
tndf.loc[((tndf.tnn >= 1) | (tndf.tnn < 3.3)), 'tnn'] = 1
tndf.loc[((tndf.tnn >=3.3) | (tndf.tnn < 10)), 'tnn'] = 2
tndf.loc[((tndf.tnn >=10) | (tndf.tnn < 20)), 'tnn'] = 3
tndf.loc[((tndf.tnn >=20) | (tndf.tnn < 33.3)), 'tnn'] = 4
tndf.loc[((tndf.tnn >=33.3) | (tndf.tnn < 50)), 'tnn'] = 5
tndf.loc[((tndf.tnn >=50) | (tndf.tnn < 100)), 'tnn'] = 6
tndf.loc[(tndf.tnn == 100), 'tnn'] = 7
But every value at the end will be summarized as a 6. I think that's why because of the second part of each condition. But I don't know how to tell the program to only look in a specific range (for example from >=3.3 and <10).
i will use np.where() here is the documentation:
np.where()
import numpy as np
tnddf0=np.where((tndf.tnn < 1),0,"tnn")
tnddf1=np.where(((tndf.tnn >= 1) & (tndf.tnn < 3.3)),1,"tnn")
#and so on....
To form categories like these use pd.cut
pd.cut(df.tnn, [0, 1, 3.3, 10, 20, 33.3, 50, 100], right=False, labels=range(0, 7))
Sample output of pd.cut
tnn cat
0 76.518227 6
1 44.808386 5
2 46.798994 5
3 70.798699 6
4 67.301112 6
5 13.701745 3
6 47.310570 5
7 74.048936 6
8 37.904632 5
9 38.617358 5
OR
Use np.select. It is meant exactly for your use-case.
conditions = [tndf.tnn < 1, (tndf.tnn >= 1) | (tndf.tnn < 3.3)]
values = [0, 1]
np.select(conditions, values, default="unknown")
I am trying to create another label column which is based on multiple conditions in my existing data
df
ind group people value value_50 val_minmax
1 1 5 100 1 10
1 2 2 90 1 na
2 1 10 80 1 80
2 2 20 40 0 na
3 1 7 10 0 10
3 2 23 30 0 na
import pandas as pd
import numpy as np
df = pd.read_clipboard()
Then trying to put label on rows as per below conditions
df['label'] = np.where(np.logical_and(df.group == 2, df.value_50 == 1, df.value > 50), 1, 0)
but it is giving me an error
TypeError: return arrays must be of ArrayType
How to perform it in python?
Use & between masks:
df['label'] = np.where((df.group == 2) & (df.value_50 == 1) & (df.value > 50), 1, 0)
Alternative:
df['label'] = ((df.group == 2) & (df.value_50 == 1) & (df.value > 50)).astype(int)
Your solution should working if use reduce with list of boolean masks:
mask = np.logical_and.reduce([df.group == 2, df.value_50 == 1, df.value > 50])
df['label'] = np.where(mask, 1, 0)
#alternative
#df['label'] = mask.astype(int)
If I have a dataframe df with column x and want to create column y based on values of x using this in pseudo code:
if df['x'] < -2 then df['y'] = 1
else if df['x'] > 2 then df['y'] = -1
else df['y'] = 0
How would I achieve this? I assume np.where is the best way to do this but not sure how to code it correctly.
One simple method would be to assign the default value first and then perform 2 loc calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where then you could do it with a nested np.where:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where method is twice as fast
Use np.select for multiple conditions
np.select(condlist, choicelist, default=0)
Return elements in choicelist depending on the corresponding condition in condlist.
The default element is used when all conditions evaluate to False.
condlist = [
df['x'] < -2,
df['x'] > 2,
]
choicelist = [
1,
-1,
]
df['y'] = np.select(condlist, choicelist, default=0)
np.select is much more readable than a nested np.where but just as fast:
df = pd.DataFrame({'x': np.random.randint(-5, 5, size=n)})
This is a good use case for pd.cut where you define ranges and based on those ranges you can assign labels:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
set fixed value to 'c2' where the condition is met
df.loc[df['c1'] == 'Value', 'c2'] = 10
You can do it easily using the index and 2 loc calls:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
x
0 0
1 -3
2 5
3 -1
4 1
df['y'] = 0
idx_1 = df.loc[df['x'] < -2, 'y'].index
idx_2 = df.loc[df['x'] > 2, 'y'].index
df.loc[idx_1, 'y'] = 1
df.loc[idx_2, 'y'] = -1
df
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0