Create custom buckets for df based on column - python

I want to add a new column with custom buckets (see example below)based on the price values in the price column.
< 400 = low
>=401 and <=1000 = medium
>1000 = expensive
Table
product_id price
2 1203
4 500
5 490
6 200
3 429
5 321
Output table
product_id price price_category
2 1001 high
4 500 medium
5 490 medium
6 200 low
3 429 medium
5 321 low
This what I have tried so far:
from numba import njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x <= 50):
bins[idx] = 1
elif (x >= 51) & (x <= 100):
bins[idx] = 2
elif (x >= 101) & (x <= 250):
bins[idx] = 3
elif (x >= 251) & (x <= 1000):
bins[idx] = 4
else:
bins[idx] = 5
return bins
a = cut(df2['average_listings'].to_numpy())
conversion_dict = {1: 'S',
2: 'M',
3: 'L',
4: 'XL',
5: 'XXL'}
bins = list(map(conversion_dict.get, a))
--> But I am struggling to add this to the main df

pandas has it's own cut method. Specify the right bin edges and the corresponding labels.
df['price_category'] = pd.cut(df.price, [-np.inf, 400, 1000, np.inf],
labels=['low', 'medium', 'high'])
product_id price price_category
0 2 1203 high
1 4 500 medium
2 5 490 medium
3 6 200 low
4 3 429 medium
5 5 321 low
Without the labels argument, you get back the exact bins (and closure, right by default) used for the data, which in this case are:
Categories (3, interval[float64]): [(-inf, 400.0] < (400.0, 1000.0] < (1000.0, inf]]

You can use, np.select:
conditions = [
df['price'].lt(400),
df['price'].ge(401) & df['price'].le(1000),
df['price'].gt(1000)]
choices = ['low', 'medium', 'high']
df['price_category'] = np.select(conditions, choices)
# print(df)
product_id price price_category
0 2 1203 high
1 4 500 medium
2 5 490 medium
3 6 200 low
4 3 429 medium
5 5 321 low

A simple solution would be something like this:
df.loc[df.price < 400, 'price_category'] = 'low'

Related

More than one condition meet numpy select

I hv the following dataframe:
A B C D E F
100 0 0 0 100 0
0 100 0 0 0 100
-100 0 0 0 100 0
and this code:
cond = [
(df['A'] == 100),
(df['A'] == -100),
(df['B'] == 100),
(df['C'] == 100),
(df['D'] == 100),
(df['E'] == 100),
(df['F'] == 100),
]
choices = ['A','neg_A', 'B', 'C','D', 'E', 'F']
df['result'] = np.select(cond, choices)
For both rows there will be two results but I want only one to be selected. I want the selection to be made with this criteria:
+A = 67%
-A = 68%
B = 70%
C = 75%
D = 66%
E = 54%
F = 98%
Percentage shows accuracy rate so i would want the one with highest percentage to be preferred over the other.
Intended result:
A B C D E F result
100 0 0 0 100 0 A
0 100 0 0 0 100 F
-100 0 0 0 100 0 neg_A
Little help will be appreciated. THANKS!
EDIT:
Some of the columns (like A) may have a mix of 100 and -100. Positive 100 will yield a simple A (see row 1) but a -100 should yield some other name like "neg_A" in the result (see row 3).
Let's sort the columns of dataframe based on the priority values then use .eq + .idxmax on axis=1 to get the column name with first occurrence of 100:
# define a dict with col names and priority values
d = {'A': .67, 'B': .70, 'C': .75, 'D': .66, 'E': .54, 'F': .98}
df['result'] = df[sorted(d, key=lambda x: -d[x])].eq(100).idxmax(axis=1)
A B C D E F result
0 100 0 0 0 100 0 A
1 0 100 0 0 0 100 F

how to use pandas for condition selection

I have a table like this
AREA AMOUNT
A 1000
A 10
B 30
B 3000
C 22
D 300
What I want to get is more that 100 in AREA A & more than 100 in AREA B & less than 100 in AREA C and more than 100 in AREA D . I have many of these kind of area to analyse.
so what I want to get is below.
AREA AMOUNT
A 1000
B 3000
C 22
D 300
You can use .isin() and pass the three columns > 100 and then == for just the C column using & and | for and and or. Pay attention to parentheses here:
df = df[((df['AREA'].isin(['A','B','D'])) & (df['AMOUNT'] > 100)) |
((df['AREA'] == 'C') & (df['AMOUNT'] < 100))]
df
Out[1]:
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300
You can write in this way also by creating custom function for setting up the condition
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
g = lambda x, y, z: (df['AREA'].eq(x)) & (ops[z](df['AMOUNT'], y))
df[g('A', 100, 'gt')| g('B', 100, 'gt') | g('C', 100, 'lt') | g('D', 100, 'gt') ]
AREA AMOUNT
0 A 1000
3 B 3000
4 C 22
5 D 300

rolling mean with a moving window

My dataframe has a daily price column and a window size column :
df = pd.DataFrame(columns = ['price', 'window'],
data = [[100, 1],[120, 2], [115, 2], [116, 2], [100, 4]])
df
price window
0 100 1
1 120 2
2 115 2
3 116 2
4 100 4
I would like to compute the rolling mean of price for each row using the window of the window column.
The result would be this :
df
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I don't find any elegant way to do it with apply and I refuse to loop over each row of my DataFrame...
The best solutions, in terms of raw speed and complexity, are based on ideas from summed-area table. The problem can be consider as a table of one dimension. Below you can find several approaches, ranked from best to worst.
Numpy + Linear complexity
size = len(df['price'])
price = np.zeros(size + 1)
price[1:] = df['price'].values.cumsum()
window = np.clip(np.arange(size) - (df['window'].values - 1), 0, None)
df['rolling_mean_price'] = (price[1:] - price[window]) / df['window'].values
print(df)
Output
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Loopy + Linear complexity
price = df['price'].values.cumsum()
df['rolling_mean_price'] = [(price[i] - float((i - w) > -1) * price[i-w]) / w for i, w in enumerate(df['window'])]
Loopy + Quadratic complexity
price = df['price'].values
df['rolling_mean_price'] = [price[i - (w - 1):i + 1].mean() for i, w in enumerate(df['window'])]
I would not recommend this approach using pandas.DataFrame.apply() (reasons described here), but if you insist on it, here is one solution:
df['rolling_mean_price'] = df.apply(
lambda row: df.rolling(row.window).price.mean().iloc[row.name], axis=1)
The output looks like this:
>>> print(df)
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75

Pandas Group by a range

I have a data like
{a : 100, b:102, c:500, d:99, e:78, d:88}
I want group it by a range with interval of 100.
Example:
{ 100: 2, 0: 3, 500:1 }
that is in English
2 occourances of a number between 100..199
1 occourances of a number between 500..599
3 occourances of a number between 0..99
How to express this in pandas?
IIUC, group by a range is usually pd.cut:
d = {'a' : 100, 'b':102,'c':500, 'd':99, 'e':78, 'd':88}
bins = np.arange(0,601,100)
pd.cut(pd.Series(d), bins=bins, labels=bins[:-1], right=False).value_counts(sort=False)
Output:
0 3
100 2
200 0
300 0
400 0
500 1
dtype: int64
Update: actually, pd.cut seems overkilled and your case is a bit easier:
(pd.Series(d)//100).value_counts(sort=False)
Output:
0 3
1 2
5 1
dtype: int64
Solution with maximal value of Series used for bins anf for labels all values without last by b[:-1] in cut, then count values by GroupBy.size:
d = {'a' : 100, 'b':102, 'c':500, 'd':99, 'e':78, 'f':88}
s = pd.Series(d)
max1 = int(s.max() // 100 + 1) * 100
b = np.arange(0, max1 + 100, 100)
print (b)
[ 0 100 200 300 400 500 600]
d1 = s.groupby(pd.cut(s, bins=b, labels=b[:-1], right=False)).size().to_dict()
print (d1)
{0: 3, 100: 2, 200: 0, 300: 0, 400: 0, 500: 1}

Fill with the values from neighbor value compering other column in Pandas

I am having dataframe like this:
azimuth id
15 100
15 1
15 100
150 2
150 100
240 3
240 100
240 100
350 100
What I need is to fill instead 100 values values from row where azimuth is the closest:
Desired output:
azimuth id
15 1
15 1
15 1
150 2
150 2
240 3
240 3
240 3
350 1
350 is near to 15 because this is a circle (angle representation). The difference is 25.
What I have:
def mysubstitution(x):
for i in x.index[x['id'] == 100]:
i = int(i)
diff = (x['azimuth'] - x.loc[i, 'azimuth']).abs()
for ind in diff.index:
if diff[ind] > 180:
diff[ind] = 360 - diff[ind]
else:
pass
exclude = [y for y in x.index if y not in x.index[x['id'] == 100]]
closer_idx = diff[exclude]
closer_df = pd.DataFrame(closer_idx)
sorted_df = closer_df.sort_values('azimuth', ascending=True)
try:
a = sorted_df.index[0]
x.loc[i, 'id'] = x.loc[a, 'id']
except Exception as a:
print(a)
return x
Which works ok most of the time, but I guess there is some simpler solution.
Thanks in advance.
I tried to implement the functionality in two steps. First, for each azimuth, I grouped another dataframe that holds their id value(for values other than 100).
Then, using this array I implemented the replaceAzimuth function, which takes each row in the dataframe, first checks if the value already exists. If so, it directly replaces it. Otherwise,it replaces the id value with the closest azimuth value from the grouped dataframe.
Here is the implementation:
df = pd.DataFrame([[15,100],[15,1],[15,100],[150,2],[150,100],[240,3],[240,100],[240,100],[350,100]],columns=['azimuth','id'])
df_non100 = df[df['id'] != 100]
df_grouped = df_non100.groupby(['azimuth'])['id'].min().reset_index()
def replaceAzimuth(df_grouped,id_val):
real_id = df_grouped[df_grouped['azimuth'] == id_val['azimuth']]['id']
if real_id.size == 0:
df_diff = df_grouped
df_diff['azimuth'] = df_diff['azimuth'].apply(lambda x: min(abs(id_val['azimuth'] - x),(360 - id_val['azimuth'] + x)))
id_val['id'] = df_grouped.iloc[df_diff['azimuth'].idxmin()]['id']
else:
id_val['id'] = real_id
return id_val
df = df.apply(lambda x: replaceAzimuth(df_grouped,x), axis = 1)
df
For me, the code seems to give the output you have shown. But not sure if will work on all cases!
First set all ids to nan if they are 100.
df.id = np.where(df.id==100, np.nan, df.id)
Then calculate the angle diff pairwise and find the closest ID to fill the nans.
df.id = df.id.combine_first(
pd.DataFrame(np.abs(((df.azimuth.values[:,None]-df.azimuth.values) +180) % 360 - 180))
.pipe(np.argsort)
.applymap(lambda x: df.id.iloc[x])
.apply(lambda x: x.dropna().iloc[0], axis=1)
)
df
azimuth id
0 15 1.0
1 15 1.0
2 15 1.0
3 150 2.0
4 150 2.0
5 240 3.0
6 240 3.0
7 240 3.0
8 350 1.0

Categories