how to use pandas for condition selection - python

I have a table like this
AREA AMOUNT
A 1000
A 10
B 30
B 3000
C 22
D 300
What I want to get is more that 100 in AREA A & more than 100 in AREA B & less than 100 in AREA C and more than 100 in AREA D . I have many of these kind of area to analyse.
so what I want to get is below.
AREA AMOUNT
A 1000
B 3000
C 22
D 300

You can use .isin() and pass the three columns > 100 and then == for just the C column using & and | for and and or. Pay attention to parentheses here:
df = df[((df['AREA'].isin(['A','B','D'])) & (df['AMOUNT'] > 100)) |
((df['AREA'] == 'C') & (df['AMOUNT'] < 100))]
df
Out[1]:
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300

You can write in this way also by creating custom function for setting up the condition
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
g = lambda x, y, z: (df['AREA'].eq(x)) & (ops[z](df['AMOUNT'], y))
df[g('A', 100, 'gt')| g('B', 100, 'gt') | g('C', 100, 'lt') | g('D', 100, 'gt') ]
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300

Related

More than one condition meet numpy select

I hv the following dataframe:
A B C D E F
100 0 0 0 100 0
0 100 0 0 0 100
-100 0 0 0 100 0
and this code:
cond = [
(df['A'] == 100),
(df['A'] == -100),
(df['B'] == 100),
(df['C'] == 100),
(df['D'] == 100),
(df['E'] == 100),
(df['F'] == 100),
]
choices = ['A','neg_A', 'B', 'C','D', 'E', 'F']
df['result'] = np.select(cond, choices)
For both rows there will be two results but I want only one to be selected. I want the selection to be made with this criteria:
+A = 67%
-A = 68%
B = 70%
C = 75%
D = 66%
E = 54%
F = 98%
Percentage shows accuracy rate so i would want the one with highest percentage to be preferred over the other.
Intended result:
A B C D E F result
100 0 0 0 100 0 A
0 100 0 0 0 100 F
-100 0 0 0 100 0 neg_A
Little help will be appreciated. THANKS!
EDIT:
Some of the columns (like A) may have a mix of 100 and -100. Positive 100 will yield a simple A (see row 1) but a -100 should yield some other name like "neg_A" in the result (see row 3).
Let's sort the columns of dataframe based on the priority values then use .eq + .idxmax on axis=1 to get the column name with first occurrence of 100:
# define a dict with col names and priority values
d = {'A': .67, 'B': .70, 'C': .75, 'D': .66, 'E': .54, 'F': .98}
df['result'] = df[sorted(d, key=lambda x: -d[x])].eq(100).idxmax(axis=1)
A B C D E F result
0 100 0 0 0 100 0 A
1 0 100 0 0 0 100 F

Create custom buckets for df based on column

I want to add a new column with custom buckets (see example below)based on the price values in the price column.
< 400 = low
>=401 and <=1000 = medium
>1000 = expensive
Table
product_id price
2 1203
4 500
5 490
6 200
3 429
5 321
Output table
product_id price price_category
2 1001 high
4 500 medium
5 490 medium
6 200 low
3 429 medium
5 321 low
This what I have tried so far:
from numba import njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x <= 50):
bins[idx] = 1
elif (x >= 51) & (x <= 100):
bins[idx] = 2
elif (x >= 101) & (x <= 250):
bins[idx] = 3
elif (x >= 251) & (x <= 1000):
bins[idx] = 4
else:
bins[idx] = 5
return bins
a = cut(df2['average_listings'].to_numpy())
conversion_dict = {1: 'S',
2: 'M',
3: 'L',
4: 'XL',
5: 'XXL'}
bins = list(map(conversion_dict.get, a))
--> But I am struggling to add this to the main df
pandas has it's own cut method. Specify the right bin edges and the corresponding labels.
df['price_category'] = pd.cut(df.price, [-np.inf, 400, 1000, np.inf],
labels=['low', 'medium', 'high'])
product_id price price_category
0 2 1203 high
1 4 500 medium
2 5 490 medium
3 6 200 low
4 3 429 medium
5 5 321 low
Without the labels argument, you get back the exact bins (and closure, right by default) used for the data, which in this case are:
Categories (3, interval[float64]): [(-inf, 400.0] < (400.0, 1000.0] < (1000.0, inf]]
You can use, np.select:
conditions = [
df['price'].lt(400),
df['price'].ge(401) & df['price'].le(1000),
df['price'].gt(1000)]
choices = ['low', 'medium', 'high']
df['price_category'] = np.select(conditions, choices)
# print(df)
product_id price price_category
0 2 1203 high
1 4 500 medium
2 5 490 medium
3 6 200 low
4 3 429 medium
5 5 321 low
A simple solution would be something like this:
df.loc[df.price < 400, 'price_category'] = 'low'

New column based in multiple conditions

a b
0 100 90
1 30 117
2 90 99
3 200 94
I want to create a new df["c"] with next conditions:
If a > 50 and b is into (a ± 0.5a), then c = a
If a > 50 and b is out (a ± 0.5a), then c = b
If a <= 50, then *c = a*
Output should be:
a b c
0 100 90 100
1 30 117 30
2 90 99 90
3 200 94 94
I´ve tried:
df['c'] = np.where(df.eval("0.5 * a <= b <= 1.5 * a"), df.a, df.b)
But I don´t know how to include the last condition (If a <= 50, then c = a) in this sentence.
You're almost there, you'll just need to add an or clause inside your eval string.
np.where(df.eval("(0.5 * a <= b <= 1.5 * a) or (a <= 50)"), df.a, df.b)
# ~~~~~~~~~~~~
array([100, 30, 90, 94])

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:
>> print a
id | value
1 | 250
2 | 150
3 | 350
4 | 550
5 | 450
>> print b
low | high | class
100 | 200 | 'A'
200 | 300 | 'B'
300 | 500 | 'A'
500 | 600 | 'C'
I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:
>> print a
id | value | class
1 | 250 | 'B'
2 | 150 | 'A'
3 | 350 | 'A'
4 | 550 | 'C'
5 | 450 | 'A'
I have the following code written that sort of does what I want:
a['class'] = pd.Series()
for i in range(len(a)):
val = a['value'][i]
cl = (b['class'][ (b['low'] <= val) \
(b['high'] >= val) ].iat[0])
a['class'].set_value(i,cl)
Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?
Here is a way to do a range join inspired by #piRSquared's solution:
A = a['value'].values
bh = b.high.values
bl = b.low.values
i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))
pd.DataFrame(
np.column_stack([a.values[i], b.values[j]]),
columns=a.columns.append(b.columns)
)
Output:
id value low high class
0 1 250 200 300 'B'
1 2 150 100 200 'A'
2 3 350 300 500 'A'
3 4 550 500 600 'C'
4 5 450 300 500 'A'
Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!
I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.
The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!
# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])
blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])
aclass = [[]]*len(avalue) # initialize aclass
start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
aclass[j]=bclass[i]
# add the class column to the original a DataFrame
a['class'] = aclass
print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Applying diff on selected rows for comparing angles from math.atan2

I have a data frame like this that want to apply diff function on:
test = pd.DataFrame({ 'Observation' : ['0','1','2',
'3','4','5',
'6','7','8'],
'Value' : [30,60,170,-170,-130,-60,-30,10,20]
})
Observation Value
0 30
1 60
2 170
3 -170
4 -130
5 -60
6 -30
7 10
8 20
The column 'Value' is in degrees. So, the difference between -170 and 170 should be 20, not -340. In other words, when d2*d1 < 0, instead of d2-d1, I'd like to get 360-(abs(d1)+abs(d2))
Here's why I try. But then I don't know how to continue it without using a for loop:
test['Value_diff_1st_attempt'] = test['Value'].diff(1)
test['sign_temp'] = test['Value'].shift()
test['Sign'] = np.sign(test['Value']*test['sign_temp'])
Here's what the result should look like:
Observation Value Delta_Value
0 30 NAN
1 60 30
2 170 110
3 -170 20
4 -130 40
5 -60 70
6 -30 30
7 10 40
8 20 10
Eventually I'd like to get just the magnitude of differences all in positive values. Thanks.
Update: So, the value results are derived from math.atan2 function. The values are from 0<theta<180 or -180<theta<0. The problem arises when we are dealing with a change of direction from 170 (upper left corner) to -170 (lower left corner) for example, where the change is really just 20 degrees. However, when we go from -30 (Lower right corner) to 10 (upper right corner), the change is really 40 degrees. I hope I explained it well.
I believe this should work (took the definition from #JasonD's answer):
test["Value"].rolling(2).apply(lambda x: 180 - abs(abs(x[0] - x[1]) - 180))
Out[45]:
0 NaN
1 30.0
2 110.0
3 20.0
4 40.0
5 70.0
6 30.0
7 40.0
8 10.0
Name: Value, dtype: float64
How it works:
Based on your question, the two angles a and b are between 0 and +/-180. For 0 < d < 180 I will write d < 180 and for -180 < d < 0 I will write d < 0. There are four possibilities:
a < 180, b < 180 -> the result is simply |a - b|. And since |a - b| - 180 cannot be greater than 180, the formula will simplify to a - b if a > b and b - a if b > a.
a < 0, b < 0 - > The same logic applies here. Both negative and their absolute difference cannot be greater than 180. The result will be |a - b|.
a < 180, b < 0 - > a - b will be greater than 0 for sure. For the cases where |a - b| > 180, we should look at the other angle and this translates to 360 - |a - b|.
a < 0, b < 180 -> again, similar to the above. If the absolute difference is greater than 180, calculate 360 - absolute difference.
For the pandas part: rolling(n) creates arrays of size n. For 2: (row 0, row1), (row1, row2), ... With apply, you apply that formula to every rolling pair where x[0] is the first element (a) and x[1] is the second element.

Categories