I have a table like this
AREA AMOUNT
A 1000
A 10
B 30
B 3000
C 22
D 300
What I want to get is more that 100 in AREA A & more than 100 in AREA B & less than 100 in AREA C and more than 100 in AREA D . I have many of these kind of area to analyse.
so what I want to get is below.
AREA AMOUNT
A 1000
B 3000
C 22
D 300
You can use .isin() and pass the three columns > 100 and then == for just the C column using & and | for and and or. Pay attention to parentheses here:
df = df[((df['AREA'].isin(['A','B','D'])) & (df['AMOUNT'] > 100)) |
((df['AREA'] == 'C') & (df['AMOUNT'] < 100))]
df
Out[1]:
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300
You can write in this way also by creating custom function for setting up the condition
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
g = lambda x, y, z: (df['AREA'].eq(x)) & (ops[z](df['AMOUNT'], y))
df[g('A', 100, 'gt')| g('B', 100, 'gt') | g('C', 100, 'lt') | g('D', 100, 'gt') ]
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300
Related
I hv the following dataframe:
A B C D E F
100 0 0 0 100 0
0 100 0 0 0 100
-100 0 0 0 100 0
and this code:
cond = [
(df['A'] == 100),
(df['A'] == -100),
(df['B'] == 100),
(df['C'] == 100),
(df['D'] == 100),
(df['E'] == 100),
(df['F'] == 100),
]
choices = ['A','neg_A', 'B', 'C','D', 'E', 'F']
df['result'] = np.select(cond, choices)
For both rows there will be two results but I want only one to be selected. I want the selection to be made with this criteria:
+A = 67%
-A = 68%
B = 70%
C = 75%
D = 66%
E = 54%
F = 98%
Percentage shows accuracy rate so i would want the one with highest percentage to be preferred over the other.
Intended result:
A B C D E F result
100 0 0 0 100 0 A
0 100 0 0 0 100 F
-100 0 0 0 100 0 neg_A
Little help will be appreciated. THANKS!
EDIT:
Some of the columns (like A) may have a mix of 100 and -100. Positive 100 will yield a simple A (see row 1) but a -100 should yield some other name like "neg_A" in the result (see row 3).
Let's sort the columns of dataframe based on the priority values then use .eq + .idxmax on axis=1 to get the column name with first occurrence of 100:
# define a dict with col names and priority values
d = {'A': .67, 'B': .70, 'C': .75, 'D': .66, 'E': .54, 'F': .98}
df['result'] = df[sorted(d, key=lambda x: -d[x])].eq(100).idxmax(axis=1)
A B C D E F result
0 100 0 0 0 100 0 A
1 0 100 0 0 0 100 F
I want to add a new column with custom buckets (see example below)based on the price values in the price column.
< 400 = low
>=401 and <=1000 = medium
>1000 = expensive
Table
product_id price
2 1203
4 500
5 490
6 200
3 429
5 321
Output table
product_id price price_category
2 1001 high
4 500 medium
5 490 medium
6 200 low
3 429 medium
5 321 low
This what I have tried so far:
from numba import njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x <= 50):
bins[idx] = 1
elif (x >= 51) & (x <= 100):
bins[idx] = 2
elif (x >= 101) & (x <= 250):
bins[idx] = 3
elif (x >= 251) & (x <= 1000):
bins[idx] = 4
else:
bins[idx] = 5
return bins
a = cut(df2['average_listings'].to_numpy())
conversion_dict = {1: 'S',
2: 'M',
3: 'L',
4: 'XL',
5: 'XXL'}
bins = list(map(conversion_dict.get, a))
--> But I am struggling to add this to the main df
pandas has it's own cut method. Specify the right bin edges and the corresponding labels.
df['price_category'] = pd.cut(df.price, [-np.inf, 400, 1000, np.inf],
labels=['low', 'medium', 'high'])
product_id price price_category
0 2 1203 high
1 4 500 medium
2 5 490 medium
3 6 200 low
4 3 429 medium
5 5 321 low
Without the labels argument, you get back the exact bins (and closure, right by default) used for the data, which in this case are:
Categories (3, interval[float64]): [(-inf, 400.0] < (400.0, 1000.0] < (1000.0, inf]]
You can use, np.select:
conditions = [
df['price'].lt(400),
df['price'].ge(401) & df['price'].le(1000),
df['price'].gt(1000)]
choices = ['low', 'medium', 'high']
df['price_category'] = np.select(conditions, choices)
# print(df)
product_id price price_category
0 2 1203 high
1 4 500 medium
2 5 490 medium
3 6 200 low
4 3 429 medium
5 5 321 low
A simple solution would be something like this:
df.loc[df.price < 400, 'price_category'] = 'low'
a b
0 100 90
1 30 117
2 90 99
3 200 94
I want to create a new df["c"] with next conditions:
If a > 50 and b is into (a ± 0.5a), then c = a
If a > 50 and b is out (a ± 0.5a), then c = b
If a <= 50, then *c = a*
Output should be:
a b c
0 100 90 100
1 30 117 30
2 90 99 90
3 200 94 94
I´ve tried:
df['c'] = np.where(df.eval("0.5 * a <= b <= 1.5 * a"), df.a, df.b)
But I don´t know how to include the last condition (If a <= 50, then c = a) in this sentence.
You're almost there, you'll just need to add an or clause inside your eval string.
np.where(df.eval("(0.5 * a <= b <= 1.5 * a) or (a <= 50)"), df.a, df.b)
# ~~~~~~~~~~~~
array([100, 30, 90, 94])
The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:
>> print a
id | value
1 | 250
2 | 150
3 | 350
4 | 550
5 | 450
>> print b
low | high | class
100 | 200 | 'A'
200 | 300 | 'B'
300 | 500 | 'A'
500 | 600 | 'C'
I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:
>> print a
id | value | class
1 | 250 | 'B'
2 | 150 | 'A'
3 | 350 | 'A'
4 | 550 | 'C'
5 | 450 | 'A'
I have the following code written that sort of does what I want:
a['class'] = pd.Series()
for i in range(len(a)):
val = a['value'][i]
cl = (b['class'][ (b['low'] <= val) \
(b['high'] >= val) ].iat[0])
a['class'].set_value(i,cl)
Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?
Here is a way to do a range join inspired by #piRSquared's solution:
A = a['value'].values
bh = b.high.values
bl = b.low.values
i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))
pd.DataFrame(
np.column_stack([a.values[i], b.values[j]]),
columns=a.columns.append(b.columns)
)
Output:
id value low high class
0 1 250 200 300 'B'
1 2 150 100 200 'A'
2 3 350 300 500 'A'
3 4 550 500 600 'C'
4 5 450 300 500 'A'
Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!
I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.
The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!
# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])
blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])
aclass = [[]]*len(avalue) # initialize aclass
start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
aclass[j]=bclass[i]
# add the class column to the original a DataFrame
a['class'] = aclass
print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))
I have a data frame like this that want to apply diff function on:
test = pd.DataFrame({ 'Observation' : ['0','1','2',
'3','4','5',
'6','7','8'],
'Value' : [30,60,170,-170,-130,-60,-30,10,20]
})
Observation Value
0 30
1 60
2 170
3 -170
4 -130
5 -60
6 -30
7 10
8 20
The column 'Value' is in degrees. So, the difference between -170 and 170 should be 20, not -340. In other words, when d2*d1 < 0, instead of d2-d1, I'd like to get 360-(abs(d1)+abs(d2))
Here's why I try. But then I don't know how to continue it without using a for loop:
test['Value_diff_1st_attempt'] = test['Value'].diff(1)
test['sign_temp'] = test['Value'].shift()
test['Sign'] = np.sign(test['Value']*test['sign_temp'])
Here's what the result should look like:
Observation Value Delta_Value
0 30 NAN
1 60 30
2 170 110
3 -170 20
4 -130 40
5 -60 70
6 -30 30
7 10 40
8 20 10
Eventually I'd like to get just the magnitude of differences all in positive values. Thanks.
Update: So, the value results are derived from math.atan2 function. The values are from 0<theta<180 or -180<theta<0. The problem arises when we are dealing with a change of direction from 170 (upper left corner) to -170 (lower left corner) for example, where the change is really just 20 degrees. However, when we go from -30 (Lower right corner) to 10 (upper right corner), the change is really 40 degrees. I hope I explained it well.
I believe this should work (took the definition from #JasonD's answer):
test["Value"].rolling(2).apply(lambda x: 180 - abs(abs(x[0] - x[1]) - 180))
Out[45]:
0 NaN
1 30.0
2 110.0
3 20.0
4 40.0
5 70.0
6 30.0
7 40.0
8 10.0
Name: Value, dtype: float64
How it works:
Based on your question, the two angles a and b are between 0 and +/-180. For 0 < d < 180 I will write d < 180 and for -180 < d < 0 I will write d < 0. There are four possibilities:
a < 180, b < 180 -> the result is simply |a - b|. And since |a - b| - 180 cannot be greater than 180, the formula will simplify to a - b if a > b and b - a if b > a.
a < 0, b < 0 - > The same logic applies here. Both negative and their absolute difference cannot be greater than 180. The result will be |a - b|.
a < 180, b < 0 - > a - b will be greater than 0 for sure. For the cases where |a - b| > 180, we should look at the other angle and this translates to 360 - |a - b|.
a < 0, b < 180 -> again, similar to the above. If the absolute difference is greater than 180, calculate 360 - absolute difference.
For the pandas part: rolling(n) creates arrays of size n. For 2: (row 0, row1), (row1, row2), ... With apply, you apply that formula to every rolling pair where x[0] is the first element (a) and x[1] is the second element.