Python: set average values to outliers - python

I have dataframe
ID Value
A 70
A 80
A 1000
A 100
A 200
A 130
A 60
A 300
A 800
A 200
A 150
A 250
I need to replace outliers to median value.
I use
df = pd.read_excel("test.xlsx")
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' :
grouped['Value'].quantile(.75)})
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
q3 = statBefore.loc[row.ID]['q3']
q1 = statBefore.loc[row.ID]['q1']
if row.Value > (q3 + (3 * iq_range)) or row.Value < (q1 - (3 * iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
But it returns me median - 175 and q1 - 92, but I get - 90, and it returns me q3 - 262,5, but I count and get - 275.
What wrong there?

This is simple and performant, with no Python for-loops to slow it down:
s = pd.Series([30, 31, 32, 45, 50, 999]) # example data
s.where(s.between(*s.quantile([0.25, 0.75])), s.median())
It gives you:
0 38.5
1 38.5
2 32.0
3 45.0
4 38.5
5 38.5
Unpacking that code, we have s.quantile([0.25, 0.75]) to get this:
0.25 31.25
0.75 48.75
We then use the values (31.25 and 48.75) as arguments to between(), with the * operator to unpack them because between() expects two separate arguments, not an array of length 2. That gives us:
0 False
1 False
2 True
3 True
4 False
5 False
Now that we have the binary mask, we use s.where() to choose the original values at the True locations, and fall back to s.median() otherwise.

This is just how quantiles are defined
df = pd.DataFrame(np.array([60,70,80,100,130,150,200,200,250,300,800,1000]))
print df.quantile(.25)
print df.quantile(.50)
print df.quantile(.75)
(The q1 for your data set is 95 btw)
The median is in between 150 and 200 (175)
The first quantile is 3 quarters between 80 and 100 (95)
The thrid quantile is 1 quarter in between 250 and 300 (262.5)

Related

Assign new column in DataFrame based on if value is in a certain value range

I have two DataFrames as follows:
df_discount = pd.DataFrame(data={'Graduation' : np.arange(0,1000,100), 'Discount %' : np.arange(0,50,5)})
df_values = pd.DataFrame(data={'Sum' : [20,801,972,1061,1251]})
Now my goal is to get a new column df_values['New Sum'] for my df_values that applies the corresponding discount to df_values['Sum'] based on the value of df_discount['Graduation']. If the Sum is >= the Graduation the corresponding discount is applied.
Examples: Sum 801 should get a discount of 40% resulting in 480.6, Sum 1061 gets 45% resulting in 583.55.
I know I could write a funtion with if else conditions and the returning values. However, is there a better way to do this if you have very many different conditions?
You could try if pd.merge_asof() works for you:
df_discount = pd.DataFrame({
'Graduation': np.arange(0, 1000, 100), 'Discount %': np.arange(0, 50, 5)
})
df_values = pd.DataFrame({'Sum': [20, 100, 101, 350, 801, 972, 1061, 1251]})
df_values = (
pd.merge_asof(
df_values, df_discount,
left_on="Sum", right_on="Graduation",
direction="backward"
)
.assign(New_Sum=lambda df: df["Sum"] * (1 - df["Discount %"] / 100))
.drop(columns=["Graduation", "Discount %"])
)
Result (without the last .drop(columns=...) to see what's happening):
Sum Graduation Discount % New_Sum
0 20 0 0 20.00
1 100 100 5 95.00
2 101 100 5 95.95
3 350 300 15 297.50
4 801 800 40 480.60
5 972 900 45 534.60
6 1061 900 45 583.55
7 1251 900 45 688.05
pandas.cut() is made for problems like this where you need to segment your data into bins (i.e. discount % based on value range).
First define the column, the ranges, and the corresponding bins.
# The column we need to segment
col = df_values['Sum']
# The ranges: [0, 100, 200,... ,900, np.inf] means (0,100), (100,200), ... (900,inf)
graduation = np.append(df_discount['Graduation'], np.inf)
# For each range what is the corresponding bin (i.e. discount)
discount = df_discount['Discount %']
Now call pandas.cut() and do the discount calculation.
df_values['Discount %'] = pd.cut(col,
graduation,
labels=discount)
# Convert the string label to an int for calculation
df_values['Discount %'] = df_values['Discount %'].astype(int)
df_values['New Sum'] = df_values['Sum'] * (1-df_values['Discount %']/100)
Sum Discount % New Sum
0 20 0 20.00
1 801 40 480.60
2 972 45 534.60
3 1061 45 583.55
4 1251 45 688.05
You can use pandas.DataFrame.mask. Basically if your condition is true it replaces the value. But for that your sum column has to be inside first dataframe.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html

Process and return data from a group of a group

I have a pandas dataframe of 3 variables, 2 categorical and 2 numeric.
ID
Trimester
State
Tax
rate
45
T1
NY
20
0.25
23
T3
FL
34
0.3
35
T2
TX
45
0.6
I would like to get a new table of the form:
ID
Trimester
State
Tax
rate
Tax_per_state_per_trimester
45
T1
NY
20
0.25
H
23
T3
FL
34
0.3
L
35
T2
TX
45
0.6
M
where the new variable 'Tax_per_state_per_trimester' is a categorical variable representing the tertiles of the corresponding subgroup, where L = first tertile, M = second tertile, L = last tertile
I understand I can do a double grouping with:
df.groupby(['State', 'Trimester'])
but i don't know how to go from there.
I guess apply or transform with the quantile function should prove useful, but how?
Can you take a look and see if this gives you the results you want ?
df = pd.read_excel('Tax.xlsx')
def mx(tri,state):
return df[(df['Trimester'].eq(tri)) & (df['State'].eq(state))] \
.groupby(['Trimester','State'])['Tax'].apply(max)[0]
for i,v in df.iterrows():
t = (v['Tax'] / mx(v['Trimester'],v['State']))
df.loc[i,'Tax_per_state_per_trimester'] = 'L' if t < 1/3 else 'M' if t < 2/3 else 'H'

creating a dataframe and based on 2 dataframe sets that have different lengths

I have 2 dataframe sets , I want to create a third one. I am trying to to write a code that to do the following :
if A_pd["from"] and A_pd["To"] is within the range of B_pd["from"]and B_pd["To"] then add to the C_pd dateframe A_pd["from"] and A_pd["To"] and B_pd["Value"].
if the A_pd["from"] is within the range of B_pd["from"]and B_pd["To"] and A_pd["To"] within the range of B_pd["from"]and B_pd["To"] of teh next row , then i want to split the range A_pd["from"] and A_pd["To"] to 2 ranges (A_pd["from"] and B_pd["To"]) and ( B_pd["To"] and A_pd["To"] ) and the corresponded B_pd["Value"].
I created the following code:
import pandas as pd
A_pd = {'from':[0,20,80,180,250],
'To':[20, 50,120,210,300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0,20,100,200],
'To':[20, 100,200,300],
'Value':[20, 17,15,12]}
B_pd=pd.DataFrame(B_pd)
for i in range(len(A_pd)):
numberOfIntrupt=0
for j in range(len(B_pd)):
if A_pd["from"].values[i] >= B_pd["from"].values[j] and A_pd["from"].values[i] > B_pd["To"].values[j]:
numberOfIntrupt+=1
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols, index=range(len(A_pd)+numberOfIntrupt))
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a=A_pd ["from"].values[i]
b=A_pd["To"].values[i]
c_eval=B_pd["Value"].values[j]
range_s=B_pd["from"].values[j]
range_f=B_pd["To"].values[j]
if a >= range_s and a <= range_f and b >= range_s and b <= range_f :
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=b
C_dp['C_value'].loc[i]=c_eval
elif a >= range_s and b > range_f:
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=range_f
C_dp['C_value'].loc[i]=c_eval
C_dp['C_from'].loc[i+1]=range_f
C_dp['C_To'].loc[i+1]=b
C_dp['C_value'].loc[i+1]=B_pd["Value"].values[j+1]
print(C_dp)
The current result is C_dp:
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 180 200 15
4 250 300 12
5 200 300 12
6 NaN NaN NaN
7 NaN NaN NaN
the expected should be :
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12
Thank you a lot for the support
I'm sure there is a better way to do this without loops, but this will help your logic flow.
import pandas as pd
A_pd = {'from':[0, 20, 80, 180, 250],
'To':[20, 50, 120, 210, 300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0, 20, 100, 200],
'To':[20, 100,200, 300],
'Value':[20, 17, 15, 12]}
B_pd=pd.DataFrame(B_pd)
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols)
spillover = False
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a_from = A_pd["from"].values[i]
a_to = A_pd["To"].values[i]
b_from = B_pd["from"].values[j]
b_to = B_pd["To"].values[j]
b_value = B_pd['Value'].values[j]
if (a_from >= b_to):
# a_from outside b range
continue # next b
elif (a_from >= b_from):
# a_from within b range
if a_to <= b_to:
C_dp = C_dp.append({"C_from": a_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
break # next a
else:
C_dp = C_dp.append({"C_from": a_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
if j < len(B_pd):
spillover = True
continue
if spillover:
if a_to <= b_to:
C_dp = C_dp.append({"C_from": b_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
spillover = False
break
else:
C_dp = C_dp.append({"C_from": b_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
spillover = True
continue
print(C_dp)
Output
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12

How can we select columns from a pandas dataframe based on a certain condition?

I have a pandas dataframe and i want to create a list of columns for one particular variable if P_BUYER column has one entry greater than 97 and others less . For example, below, a list should be created containing TENRACT and ADV_INC. If P_BUYER has a value greater than or equal to 97 then the value which is in parallel to T for that particular block should be saved in a list (e.g. we have following values in parallel to T in below example : (TENRCT,ADVNTG_MARITAL,NEWLSGOLFIN,ADV_INC)
Input :
T TENRCT P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N (1,2,3) = Renter N (1,2,3) = Renter 35.88 0.1 33 8 2
Q <0> = Unknown Q <0> = Unknown 3.26 0.1 36 8 2
Q1 <4> = Owner Q <4> = Owner 60.86 99.8 143 5 1
E2
T ADVNTG_MARITAL P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
Q2<1> = 1+Marrd Q<1> = 1+Marrd 52.91 78.98 149 5 2
Q<2> = 1+Sngl Q<2> = 1+Sngl 45.23 17.6 39 8 3
Q1<3> = Mrrd_Sngl Q<3> = Mrrd_Sngl 1.87 3.42 183 4 1
E3
T ADV_INC P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N1('1','Y') = Yes N('1','Y') = Yes 3.26 1.2 182 4 1
N('0','-1')= No N('0','-1')= No 96.74 98.8 97 7 2
E2
output:
Finallist=['TENRACT','ADV_INC']
You can do it like this:
# In your code, you have 3 dataframes E1,E2,E3, iterate over them
output = []
for df in [E1,E2,E3]:
# Filter you dataframe
df = df[df['P_BUYER(%)'] >= 97 ]
if not df.empty:
cols = df.columns.values.tolist()
# Find index of 'T' column
t_index = cols.index('T')
# You desired parallel column will be at t_index+1
output.append(cols[t_index+1])
print(output)

Inverse Score in Python

I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?
Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000
This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.

Categories