I'm relatively new to Python and hoping someone can help point me in the right direction.
For context, I want create a new column in a Pandas dataframe that assigns a score of linear integer values to a new column based on values in an existing column being within certain ranges.
There is a lower and upper bound, say 0 and 0.75. Being below or above those respectively will yield the lowest / highest value.
Written manually with relatively few conditions it looks like this using np.select():
d = {'col1': [-1, 0, .1, .6, .8],'col2': [-4,-0.02, 0.07, 1, 2]}
df = pd.DataFrame(data=d)
conditions = [
(df['col1'] < 0),
(df['col1'] >= 0) & (df['col1'] <= .25),
(df['col1'] >= .25) & (df['col1'] <= .5),
(df['col1'] >= .5) & (df['col1'] <= .75),
(df['col1'] >= .75)
]
values = [0, 1, 2, 3, 4]
df['col3'] = np.select(conditions,values,default=None)
I would like to be able to dynamically divide the mid-range between bounds into many more conditions, which is easy enough using np.linspace.
Where I'm having trouble is in assigning the values. I have tried to do this using pd.cut and operating on a list to feed into np.select. This is the closest I have come with these:
d = {'col1': [-1, 0, .1, .6, .8],'col2': [-4,-0.02, 0.07, 1, 2]}
df = pd.DataFrame(data=d)
conditions_no = 9 # Choose number of conditions to divide the mid-range
choices = [n for n in range(1, conditions_no + 2)] # Assign values to apply starting from 1
mid_range = np.linspace(0,.75,conditions_no) # Divide mid-range by number of conditions
mid_range = np.append(mid_range[0],mid_range) # Repeat lower bound at start for < condition
cols = ['df["col1"]' for c in range(0, conditions + 1)] # Generate list of column references
conditions = list(zip(cols,mid_range)) # List with range as values, df as key
conditions = [f'{k} >= {v}' for k, v in conditions] # Combine column references and
conditions[0] = conditions[0].replace('>=','<') # Change first condition to less than lower bound
conditions = conditions[::-1] # Reverse values and assigned choices to check > highest value first
choices = choices[::-1]
Here the conditions are a list of strings rather than code:
['df["col1"] >= 0.75',
'df["col1"] >= 0.65625',
'df["col1"] >= 0.5625',
'df["col1"] >= 0.46875',
'df["col1"] >= 0.375',
'df["col1"] >= 0.28125',
'df["col1"] >= 0.1875',
'df["col1"] >= 0.09375',
'df["col1"] >= 0.0',
'df["col1"] < 0.0']
So they understandably throw an error:
df['col3'] = np.select(conditions, choices, default=None)
# TypeError: invalid entry 0 in condlist: should be boolean ndarray
I understand that eval() might be able to help here, but haven't been able to find a way to get that to run with np.select. I've also read that it's best to try and avoid using eval().
This is the effort so far using pd.cut:
conditions = 9
choices = [n for n in range(1, conditions + 2)]
mid_range = np.linspace(0,.75,conditions)
mid_range = np.append(-float("inf"),mid_range)
mid_range = np.append(mid_range,float("inf"))
df['col3'] = pd.cut(df['col1'], mid_range, labels=choices)
df['col4'] = pd.cut(df['col2'], mid_range, labels=choices)
This works, but assigns a categorical that I then can't operate on as needed:
df['col3'] + df['col4']
# TypeError: unsupported operand type(s) for +: 'Categorical' and 'Categorical'
After everything I've looked up, I keep coming back to np.select as likely being the best solution here. However, I can't figure out how to dynamically create the conditions - are any of these efforts along the right lines or is there a better approach I should look at?
Related
I have a dataset which I want to create a new column that is based on a division of two other columns using a for-loop with if-conditions.
This is the dataset, with the empty 'solo_fare' column created beforehand.
The task is to loop through each row and divide 'Fare' by 'relatives' to get the per-passenger fare. However, there are certain if-conditions to follow (passengers in this category should see per-passenger prices of between 3 and 8)
The code I have tried here doesn't seem to fill in the 'solo_fare' rows at all. It returns an empty column (same as above df).
for i in range(0, len(fare_result)):
p = fare_result.iloc[i]['Fare']/fare_result.iloc[i]['relatives']
q = fare_result.iloc[i]['Fare']
r = fare_result.iloc[i]['relatives']
# if relatives == 0, return original Fare amount
if (r == 0):
fare_result.iloc[i]['solo_fare'] = q
# if the divided fare is below 3 or more than 8, return original Fare amount again
elif (p < 3) or (p > 8):
fare_result.iloc[i]['solo_fare'] = q
# else, return the divided fare to get solo_fare
else:
fare_result.iloc[i]['solo_fare'] = p
How can I get this to work?
You should probably not use a loop for this but instead just use loc
if you first create the 'solo fare' column and give every row the default value from Fare you can then change the value for the conditions you have set out
fare_result['solo_fare'] = fare_result['Fare']
fare_results.loc[(
(fare_results.Fare / fare_results.relatives) >= 3) & (
(fare_results.Fare / fare_results.relatives) <= 8), 'solo_fare'] = (
fare_results.Fare / fare_results.relatives)
Did you try to initialize those new colums first ?
By that I mean that the statement fare_result.iloc[i]['solo_fare'] = q
only means that you are assigning the value q to the field solo_fare of the line i
The issue there is that at this moment, the line i does not have any solo_fare key. Hence, you are only filling the last value of your table here.
To solve this issue, try declaring the solo_fare column before the for loop like:
fare_result['solo_fare'] = np.nan
One way to do is to define a row-wise function, and apply it to the dataframe:
# row-wise function (mockup)
def foo(fare, relative):
# your logic here. Mine just serves as example
if relative > 100:
res = fare/relative
elif (relative < 10):
res = fare
else:
res = 10
return res
Then apply it to the dataframe (row-wise):
fare_result['solo_fare'] = fare_result.apply(lambda row: foo(row['Fare'], row['relatives']) , axis=1)
I am new to Python / Numpy. Currently, I am working on code to evaluate optimum condition / conditions from condition list (multiple conditions given in condition list).
I am aware of Numpy Select function, which return an array drawn from elements in choicelist, depending on conditions.
https://numpy.org/doc/stable/reference/generated/numpy.select.html
Syntax : numpy.select(condlist, choicelist, default = 0)
Parameters :
condlist : [list of bool ndarrays] It determine from which array in choicelist the output elements are taken. When multiple conditions are satisfied, the first one encountered in condlist is used.
choicelist : [list of ndarrays] The list of arrays from which the output elements are taken. It has to be of the same length as condlist.
default : [scalar, optional] The element inserted in output when all conditions evaluate to False.
Return : [ndarray] An array drawn from elements in choicelist, depending on conditions.
By using Numpy Select function, when multiple conditions are satisfied, the first one encountered in condlist is used.
Problem:
Is there a function available in Numpy / Python, which provide output of ALL SATISFIED CONDITIONS (NOT first satisfied condition as provided by Numpy Select function) from condlist?
If such function is not available, can someone help to build such function?
Example:
df = pd.DataFrame({"A": [10, 10, 20, 20], "B": [10, 0, 10, 0]})
condlist = [(df.A + df.B == 20), (df.A == 10) & (df.B == 0), (df.A == 20) & (df.B == 10), (df.A == 20) & (df.B == 0)]
choicelist = [(df.A + df.B), 'No', 'Hi', 'YES']
calculate = np.select(condlist, choicelist)
df['RESULT'] = pd.Series(calculate, index=df.index)
df
Output:
A
B
RESULT
0
10
10
20
1
10
0
No
2
20
10
Hi
3
20
0
20 # Desire Output: 20, YES
Last raw, desire RESULT output should be 20, YES (as condition # 4 (df.A == 20) & (df.B == 0) also TRUE).
Reference: Numpy Select source code hyperlink is as below. Refer to Line 626 to Line 719 for detail.
https://github.com/numpy/numpy/blob/v1.21.0/numpy/lib/function_base.py#L626-L719
I appreciate your consideration and support.
Warm regards,
Keyush
Will something like this work?
import numpy as np
def myselect(A, B):
conditions = [lambda x,y: x == y,
lambda x,y: (x ==20) & (y == 20),
lambda x,y: x + y == 20,
...
]
# These are the desired output for each satisfied condition
result_values = ['0isTrue', '1isTrue' ...]
assert len(conditions) == len(result_values)
assert len(A) == len(B)
C = [] * len(A)
for i in range(len(conditions)):
r = conditions[i](x, y)
for j, v in enumerate(r):
if v:
C[j].append(result_values[i])
return C
# Generate a bunch of values to test with
top = <top-value-to-generate>
elements = <number-of-elements-to-test>
rng = np.random.default_rng()
A = rng.integers(0, top, elements)
B = rng.integers(0, top, elements)
result = myselect(A, B)
Notes:
The output will be "ragged" (different number of 2nd dimension values) so it it not suitable to be a numpy array and therefore not suitable to be inside NumPy. Also your example shows mixed str and int in the result_values, which is also not really NumPy friendly. So I used a list-of-lists for C.
Perhaps there is a way to do this with less looping. But as always in writing code: first make it correct, then make it fast.
The lambdas (you could use functions instead) are evaluated one-at-a-time. np.select seems to take a pre-evaluated condition list. So this solution might use less memory when elements is very large.
You could pass conditions and result_values into myselect
The correct code is as below with help from one of the experts.
df = pd.DataFrame({"A": [10, 10, 20, 20], "B": [10, 0, 10, 0]})
condlist = [(df.A + df.B == 20), (df.A == 10) & (df.B == 0), (df.A == 20) & (df.B == 10), (df.A == 20) & (df.B == 0)]
condarray = np.array(condlist) # bool representation where each row is a condition and each column is a row of the df
cond_true = [np.where(i)[0] for i in condarray.T]
df['result']= cond_true
I am using np.select() to construct an ndarray with values of either 1, -1, or 0, depending on some conditions. It is possible that none of these will be met, so I need a default value. I would like this value to be the value that the array holds in the previous index, if that makes sense. My naive code, which runs on some columns of a DataFrame named "total" and which raises an error, is below:
condlist = [total.ratios > total.s_entry, total.ratios < total.b_entry, (total.ratios > total.b_entry) & (total.ratios < total.s_entry)]
choicelist = (-1, 1, 0)
pos1 = pd.Series(np.select(condlist, choicelist, pos1))
Is there a way to do what I am asking? For example, having the array start
1
1
0
-1
-1
and then the sixth element doesn't satisfy any of the conditions, so its value defaults to -1 due to that being the most recent value of the array?
I had the same problem but didn't want to go through a complicated mechanism just for the trouble of the default value (given I had already a working version using .loc instead) as seen in the responses here.
I simply tried passing the dataframe column/series as the default to keep that value, when it was already populated in my case, and it worked:
# e.g. if task_type ~= nan then it already has a value of "C"
# that I want to keep
conditions = [
result_df["task_type"].isna() & result_df["maintenance_task"],
result_df["task_type"].isna(),
]
choices = ["A", "B"]
result_df["task_type"] = np.select(conditions, choices, default=result_df["task_type"])
I noticed this approach was slightly more performant than the one I had working with .loc and it would scale/read better in the long run if more conditions appear.
I am not sure if you will be happy with this solution but you could assign some default value and then change it to what you want while iterating:
x = np.arange(20)
condlist = [x < 4, np.logical_and(x > 8, x < 15), x > 15, True]
choicelist = (-1, 1, 0, None)
pos1 = pd.Series(np.select(condlist, choicelist, x))
for index, row in pos1.items():
if row == None and index == 0:
pass # Not sure what you want to do here
elif row == None:
pos1.at[index] = pos1.at[index-1]
Try leaving None as the default value in np.select
Then you can fill them using .fillna() method which accepts pd.Series as an argument for index-wise filling.
In your case the argument is the same series with shifted index (it can be done using deque .rotate() method). Hope this works for you:
from collections import deque
condlist = [total.ratios > total.s_entry, total.ratios < total.b_entry, (total.ratios > total.b_entry) & (total.ratios < total.s_entry)]
choicelist = (-1, 1, 0)
pos1 = pd.Series(np.select(condlist, choicelist, None))
pos1_index_shift = deque(pos1.index) # [0, 1, 2, ...]
pos1_index_shift.rotate(1) # [n, 0, 1, ...] - done inplace
pos1_prev = pos1.copy()
pos1_prev.index = pos1_index_shift
pos1 = pos1.fillna(pos1_prev)
Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting
How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned
df = df.where(operator.and_(df > (-1 * .2), df < 0),0)
df = df.where(df > 0 , df * 1.2)
Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2
You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic:
df = df.where((df <= (-1 * .2)) | (df >= 0), 0)
df = df.where(df <= 0 , df * 1.2)
where allows you to have a one-line solution, which is great. I prefer to use a mask like so.
idx = (df < 0) & (df >= -0.2)
df[idx] = 0
I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well.
df[(df < 0) & (df >= -0.2)] = 0
Just another option.