Data manupulation python - python

I am trying to get the mode of models at column level and the find the error associated with that model
if two or more more modes are received then we select the model with least error out of these two modes
import pandas as pd
data1 = {'Iteration1': ["M2",'M1',"M3","M5","M4","M6"],
'Iteration1_error': [96,98,34,19,22,9],
'Iteration2': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration2_error': [76,88,54,12,92,19],
'Iteration3': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration3_error': [66,68,84,52,72,89]}
Input1 = pd.DataFrame(data1,
columns=['Iteration1','Iteration1_error','Iteration2','Iteration2_error','Iteration3','Iteration3_error'],
index=['I1', 'I2','I3','I4','I5','I6'])
print(Input1)
data2 = {'Iteration1': ["M2",'M1',"M3","M5","M4","M6"],
'Iteration1_error': [96,98,34,19,22,9],
'Iteration2': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration2_error': [76,88,54,12,92,19],
'Iteration3': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration3_error': [66,68,84,52,72,89],
'Mode of model name in all iterations':['M3','M1','M1','M5','M6','M4'],
'Best model error':[66,68,54,12,72,19]
}
Output1 = pd.DataFrame(data2,
columns=['Iteration1','Iteration1_error','Iteration2','Iteration2_error','Iteration3','Iteration3_error','Mode of model name in all iterations','Best model error'],
index=['I1', 'I2','I3','I4','I5','I6'])
print(Output1)
Question: So we are expecting an output with two etc columns at the end , one tells us about mode at column level second tells about the error of that mode, first 6 columns are input dataframe, incase two modes or more are received example ("M1","M2","M3") all three values are different so technically it will have 3 modes so model with least accuracy will be selected
What I tried: I was able to get the mode at column level by using .mode(numeric_only=False) but what issue I am getting how can I get that modes error from 2nd, 4th and 6th column, there I am stuck at

Use:
#filter only columns by Iteration with number
df = Input1.filter(regex='Iteration\d+$')
#get first mode
s = df.mode(axis=1).iloc[:, 0]
#compare df for all possible modes, add suffix for match errors columns,
#last filter original with min
s1 = Input1.where(df.eq(s, axis=0).add_suffix('_error')).min(axis=1)
#add new columns
Output1 = Input1.assign(best_mode = s, best_error=s1)
print (Output1)
Iteration1 Iteration1_error Iteration2 Iteration2_error Iteration3 \
I1 M2 96 M3 76 M3
I2 M1 98 M1 88 M1
I3 M3 34 M1 54 M1
I4 M5 19 M5 12 M5
I5 M4 22 M6 92 M6
I6 M6 9 M4 19 M4
Iteration3_error best_mode best_error
I1 66 M3 66.0
I2 68 M1 68.0
I3 84 M1 54.0
I4 52 M5 12.0
I5 72 M6 72.0
I6 89 M4 19.0
Another idea if is possible use pair and unpairs columns (in data has to exist all pairs ech other, sorted):
df = Input1.iloc[:, ::2]
s = df.mode(axis=1).iloc[:, 0]
s1 = Input1.iloc[:, 1::2].where(df.eq(s, axis=0).to_numpy()).min(axis=1)
Output1 = Input1.assign(best_mode = s, best_error=s1)

Related

Match standalone words in string in DataFrame column

I have a dataframe such as:
import pandas as pd
import re
df = pd.DataFrame({"Name": ["D1", "D2", "D3", "D4", "M1", "M2", "M3"],
"Requirements": ["3 meters|2/3 meters|3.5 meters",
"3 meters",
"3/5 meters|3 meters",
"2/3 meters",
"steel|g1_steel",
"steel",
"g1_steel"]})
dataframe df
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
3 D4 2/3 meters
4 M1 steel|g1_steel
5 M2 steel
6 M3 g1_steel
I have a list of words req_list = ['3 meters', 'steel'] and I am trying to extract rows from df where the strings in column Requirements contain standalone words that are from req_list. This is what I have done:
This one prints just D2 and M2
df[df.Requirements.apply(lambda x: any(len(x.replace(y, '')) == 0 for y in req_list))]
This one prints all rows
df[df['Requirements'].str.contains(fr"\b(?:{'|'.join(req_list)})\b")]
My desired result is as follows:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel
In this desired output, D4 and M3 are eliminated because they do not have words from req_list as standalone strings. Is there any way to achieve this preferably in an one-liner without using custom functions?
EDIT
The strings in the column Requirements can come in any pattern such as:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
3 D4 2/3 meters
4 D5 3::3 meters # New pattern which needs to be eliminated
5 D6 3.3 meters # New pattern which needs to be eliminated
6 D7 3?3 meters # New pattern which needs to be eliminated
7 M1 steel|g1_steel
8 M2 steel
9 M3 g1_steel
Since you want to make sure you do not match 3 meters that is preceded with a digit + /, you may add a (?<!\d/) negative lookbehind after the intial word boundary:
df[df['Requirements'].str.contains(fr"\b(?<!\d/)(?:{'|'.join(req_list)})\b")]
Output:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel
See the regex demo.
Notes
Since req_list contains phrases (mutiword strings) you might have to sort the items by length in the descending order before joining with the | OR operator, so you'd better use fr"\b(?<!\d/)(?:{'|'.join(sorted(req_list, key=len, reverse=True))})\b" as regex
If the req_list ever contains items with special chars you should also use adaptive dynamic word boundaries, i.e. fr"(?!\B\w)(?<!\d/)(?:{'|'.join(sorted(map(re.escape, req_list), key=len, reverse=True))})(?<!\w\B)".
Here is one more way to do it
def chk(row):
for r in row:
if r.strip() in req_list:
return True
return False
df[df.assign(lst = df['Requirements'].str.split('|'))['lst'].apply(chk) == True]
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel

How can we select columns from a pandas dataframe based on a certain condition?

I have a pandas dataframe and i want to create a list of columns for one particular variable if P_BUYER column has one entry greater than 97 and others less . For example, below, a list should be created containing TENRACT and ADV_INC. If P_BUYER has a value greater than or equal to 97 then the value which is in parallel to T for that particular block should be saved in a list (e.g. we have following values in parallel to T in below example : (TENRCT,ADVNTG_MARITAL,NEWLSGOLFIN,ADV_INC)
Input :
T TENRCT P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N (1,2,3) = Renter N (1,2,3) = Renter 35.88 0.1 33 8 2
Q <0> = Unknown Q <0> = Unknown 3.26 0.1 36 8 2
Q1 <4> = Owner Q <4> = Owner 60.86 99.8 143 5 1
E2
T ADVNTG_MARITAL P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
Q2<1> = 1+Marrd Q<1> = 1+Marrd 52.91 78.98 149 5 2
Q<2> = 1+Sngl Q<2> = 1+Sngl 45.23 17.6 39 8 3
Q1<3> = Mrrd_Sngl Q<3> = Mrrd_Sngl 1.87 3.42 183 4 1
E3
T ADV_INC P_NONBUY(%) P_BUYER(%) INDEX PBIN NEWBIN
N1('1','Y') = Yes N('1','Y') = Yes 3.26 1.2 182 4 1
N('0','-1')= No N('0','-1')= No 96.74 98.8 97 7 2
E2
output:
Finallist=['TENRACT','ADV_INC']
You can do it like this:
# In your code, you have 3 dataframes E1,E2,E3, iterate over them
output = []
for df in [E1,E2,E3]:
# Filter you dataframe
df = df[df['P_BUYER(%)'] >= 97 ]
if not df.empty:
cols = df.columns.values.tolist()
# Find index of 'T' column
t_index = cols.index('T')
# You desired parallel column will be at t_index+1
output.append(cols[t_index+1])
print(output)

How to find customized average which is based on weightage including handling of nan value in pandas?

I have a data frame df_ss_g as
ent_id,WA,WB,WC,WD
123,0.045251836,0.614582906,0.225930615,0.559766482
124,0.722324239,0.057781167,,0.123603561
125,,0.361074325,0.768542766,0.080434134
126,0.085781742,0.698045853,0.763116684,0.029084545
127,0.909758657,,0.760993759,0.998406211
128,,0.32961283,,0.90038336
129,0.714585519,,0.671905291,
130,0.151888772,0.279261613,0.641133263,0.188231227
now I have to compute the average(AVG_WEIGHTAGE) which is based on a weightage i.e. =(WA*0.5+WB*1+WC*0.5+WD*1)/(0.5+1+0.5+1)
but while I am computing it using below method i.e.
df_ss_g['AVG_WEIGHTAGE']= df_ss_g.apply(lambda x:((x['WA']*0.5)+(x['WB']*1)+(x['WC']*0.5)+(x['WD']*1))/(0.5+1+0.5+1) , axis=1)
IT output as i.e. for NaN value it is giving NaN as AVG_WEIGHTAGE as null which is wrong.
all I wanted is that null should not be considered in denominator and numerator
e.g.
ent_id,WA,WB,WC,WD,AVG_WEIGHTAGE
128,,0.32961283,,0.90038336,0.614998095 i.e. (WB*1+WD*1)/1+1
129,0.714585519,,0.671905291,,0.693245405 i.e. (WA*0.5+WC*0.5)/0.5+0.5
IIUC:
import numpy as np
weights = np.array([0.5, 1, 0.5, 1]))
values = df.drop('ent_id', axis=1)
df['AVG_WEIGHTAGE'] = np.dot(values.fillna(0).to_numpy(), weights)/np.dot(values.notna().to_numpy(), weights)
df['AVG_WEIGHTAGE']
0 0.436647
1 0.217019
2 0.330312
3 0.383860
4 0.916891
5 0.614998
6 0.693245
7 0.288001
Try this method using dot products -
def av(t):
#Define weights
wt = [0.5, 1, 0.5, 1]
#Create a vector with 0 for null and 1 for non null
nulls = [int(i) for i in ~t.isna()]
#Take elementwise products of the nulls vector with both weights and t.fillna(0)
wt_new = np.dot(nulls, wt)
t_new = np.dot(nulls, t.fillna(0))
#return division
return np.divide(t_new,wt_new)
df['WEIGHTED AVG'] = df.apply(av, axis=1)
df = df.reset_index()
print(df)
ent_id WA WB WC WD WEIGHTED AVG
0 123 0.045252 0.614583 0.225931 0.559766 0.481844
1 124 0.722324 0.057781 NaN 0.123604 0.361484
2 125 NaN 0.361074 0.768543 0.080434 0.484020
3 126 0.085782 0.698046 0.763117 0.029085 0.525343
4 127 0.909759 NaN 0.760994 0.998406 1.334579
5 128 NaN 0.329613 NaN 0.900383 0.614998
6 129 0.714586 NaN 0.671905 NaN 1.386491
7 130 0.151889 0.279262 0.641133 0.188231 0.420172
It boils down to masking the nan values with 0 so they don't contribute to either weights or sum:
# this is the weights
weights = np.array([0.5,1,0.5,1])
# the columns of interest
s = df.iloc[:,1:]
# where the valid values are
mask = s.notnull()
# use `fillna` and then `#` for matrix multiplication
df['AVG_WEIGHTAGE'] = (s.fillna(0) # weights) / (mask#weights)

Using pandas to identify nearest objects

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.
so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

Dataframe/Row Indexing for Pandas

I was wondering how could I index datasets so that a row number from df1 can equal a different row number for df2? eg. row 1 in df 1 = row 3 in df2
What I would like. (In this case: row 1 2011 = row 2 2016)
row 49:50 2011 b1 is the same as row 51:52 bt 2016 (both the same item, but different value in different years) but is sliced differently due to being in a different cell in 2016
I've been using pd.concat and pd.Series but still no success.
# slicing 2011 data (total)
b1 = df1.iloc[49:50, 6:7]
m1 = df1.iloc[127:128, 6:7]
a1 = df1.iloc[84:85, 6:7]
data2011 = pd.concat([b1, m1, a1])
# slicing 2016 data (total)
bt = df2.iloc[51:52, 6:7]
mt = df2.iloc[129:130, 6:7]
at = df2.iloc[86:87, 6:7]
data2016 = pd.concat([bt, mt, at])
data20112016 = pd.concat([data2011, data2016])
print(data20112016)
Output I'm getting:
What I need to fix. (In this case : row 49 = row 51, so 11849 in the left column and 13500 in the right coloumn)
49 11849
127 22622
84 13658
51 13500
129 25281
86 18594
I would like to do a bar graph comparing b12011 to bt2016 and so on. meaning 42 = 51, 127 = 129 etc
# Tot_x Tot_y
# 49=51 11849 13500
# 127=129 22622 25281
# 84=86 13658 18594
I hope this clear things up.
Thanks in advance.
If I understood your question correctly, here is solution using merge:
df1 = pd.DataFrame([9337, 2953, 8184], index=[49, 127, 84], columns=['Tot'])
df2 = pd.DataFrame([13500, 25281, 18594], index=[51, 129, 86], columns=['Tot'])
total_df = (df1.reset_index()
.merge(df2.reset_index(), left_index=True, right_index=True))
And here is, using concat:
total_df = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
And here is resulting barplot:
total_df.index = total_df['index_x'].astype(str) + '=' + total_df['index_y'].astype(str)
total_df
# index_x Tot_x index_y Tot_y
# 49=51 49 9337 51 13500
# 127=129 127 2953 129 25281
# 84=86 84 8184 86 18594
(total_df.drop(['index_x', 'index_y'], axis=1)
.plot(kind='bar', rot=0))

Categories