I have a dataset as shown below I am looking to assign a new student if score ratio is <=0.05
import pandas as pd
df = pd.DataFrame({'Teacher': ['P','P','N','N','N','N','P','N','N','P','P','N'],
'Class': ['A','A','A','A','B','B','B','C','C','C','C','C'],
'Student': [1,2,3,4,1,2,3,1,2,3,4,5],
'Total Score': [75,10,10,5,75,20,5,60,20,10,6,4],
'Percent': [43,32,30,36,35,28,34,33,31,36,37,29]})
built a score ratio column as below
df_2 = df.groupby(['Teacher','Class']).agg({'Total Score': 'sum'}).reset_index()
final_data=pd.merge(df,df_2, on=['Teacher','Class'], how='inner')
final_data['score ratio']=final_data['Total Score_x']/final_data['Total Score_y']
If a students score ratio is <=0.05 then I need to assign a new student for the same teacher(ex:N) within the same class(Ex:C) whose percent is next best(below example student 2 has the next best percent of 31)
Expected output with new column-'new_assigned_student'
Here is a solution with nested iterrows which works but is not efficient. I would be interested to see if someone provides a more efficient vectorized solution:
for idx,row in final_data.iterrows():
if row['score ratio'] < 0.05:
min_distance = math.inf
target_index = -1
for idx2, row2 in final_data.iterrows():
if row2['Teacher'] == row['Teacher'] and\
row2['Class'] == row['Class'] and\
row2['Percent'] > row['Percent'] and\
row2['Percent'] - row['Percent'] < min_distance:
min_distance = row2['Percent'] - row['Percent']
target_index = idx2
final_data.loc[idx,'new_assigned-student'] = final_data.loc[target_index,'Student'].astype(str)
#output:
Teacher Class Student ... Total Score_y score ratio new_assigned-student
0 P A 1 ... 85 0.882353 NaN
1 P A 2 ... 85 0.117647 NaN
2 N A 3 ... 15 0.666667 NaN
3 N A 4 ... 15 0.333333 NaN
4 N B 1 ... 95 0.789474 NaN
5 N B 2 ... 95 0.210526 NaN
6 P B 3 ... 5 1.000000 NaN
7 N C 1 ... 84 0.714286 NaN
8 N C 2 ... 84 0.238095 NaN
9 N C 5 ... 84 0.047619 2
10 P C 3 ... 16 0.625000 NaN
11 P C 4 ... 16 0.375000 NaN
This should do it, just use a shift. assumes your scores are sorted per Teacher/Class, as they are in your example
final_data['new_assigned_student'] = final_data.groupby(['Teacher','Class'])['Student'].shift()
final_data.loc[final_data['score ratio']>0.05,'new_assigned_student'] = np.nan
the result
Teacher Class Student Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- --------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan
Solution 2
Here is a more robust, if somewhat more involved, solution
df3 = final_data
df_min_pct = (df3.groupby(['Teacher','Class'],
as_index = False,
sort = False)
.apply(lambda g: g.iloc[g.loc[g['score ratio']>0.05,'Percent'].argmin()])
)
Here df_min_pct shows, for each Teacher/Class group, the details of the student in that group that has the lowest score that is above 0.05:
Teacher Class Student Total Score_x Percent Total Score_y score ratio
-- --------- ------- --------- --------------- --------- --------------- -------------
0 P A 2 10 32 85 0.117647
1 N A 3 10 30 15 0.666667
2 N B 2 20 28 95 0.210526
3 P B 3 5 34 5 1
4 N C 2 20 31 84 0.238095
5 P C 3 10 36 16 0.625
Now we merge with the original df, and remove the details from those lines where it is not relevant
df4 = df3.merge(df_min_pct[['Teacher', 'Class','Student']], on = ['Teacher', 'Class'], sort = False).rename(columns = {'Student_y':'new_assigned_student'})
df4.loc[df4['score ratio']>0.05,'new_assigned_student'] = np.nan
This produces the desired result
Teacher Class Student_x Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- ----------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan
Related
I have some data like this:
df = pd.DataFrame({'x':[1,2,3,1,1,2,3,3,2],
'y':['n', 'n', 'p', 'p', 'n', 'n', 'n', 'p', 'n'],
'z':[52,75,77,68,92,32,62,70,34]})
I'd like to first group it by x, and then check if p exists in any of the rows of each group, and add another column to the original dataframe (or to the grouped one, and then somehow flatten it back out?) that has either None if there were no p's in that group, or the smallest number corresponding to p from the z column.
So here it'd be:
x y z t
0 1 n 52 68
3 1 p 68 68
4 1 n 92 68
x y z t
1 2 n 75 None
5 2 n 32 None
8 2 n 34 None
x y z t
2 3 p 77 70
6 3 n 62 70
7 3 p 70 70
or flattened:
x y z t
0 1 n 52 68
3 1 p 68 68
4 1 n 92 68
1 2 n 75 None
5 2 n 32 None
8 2 n 34 None
2 3 p 77 70
6 3 n 62 70
7 3 p 70 70
So first we'd do
g = df.groupby('x')
But then I'm not sure how to proceed.
I'm just having a hard time wrapping my head around it and running into all sorts of pandas error.
One option is to filter only the rows in the DataFrame where y is p. Then use groupby min to get the minimal z value per group (of remaining rows). Then join back to the DataFrame on x. NaN will automatically be added for any missing values (groups which did not have any values equal to p).
df = df.join(
df[df['y'].eq('p')].groupby('x')['z'].min().rename('t'),
on='x'
)
x y z t
0 1 n 52 68.0
1 2 n 75 NaN
2 3 p 77 70.0
3 1 p 68 68.0
4 1 n 92 68.0
5 2 n 32 NaN
6 3 n 62 70.0
7 3 p 70 70.0
8 2 n 34 NaN
*rename is used here to change the name of the column to the desired before joining back.
We can also sort by x with sort_values if needing the x values grouped together:
df = df.sort_values('x', ignore_index=True).join(
df[df['y'].eq('p')].groupby('x')['z'].min().rename('t'),
on='x'
)
x y z t
0 1 n 52 68.0
1 1 p 68 68.0
2 1 n 92 68.0
3 2 n 75 NaN
4 2 n 32 NaN
5 2 n 34 NaN
6 3 p 77 70.0
7 3 n 62 70.0
8 3 p 70 70.0
Depending on the size of the DataFrame it may be more efficient to select only the z column initially with loc:
df = df.sort_values('x', ignore_index=True).join(
df.loc[df['y'].eq('p'), 'z'].groupby(df['x']).min().rename('t'),
on='x'
)
x y z t
0 1 n 52 68.0
1 1 p 68 68.0
2 1 n 92 68.0
3 2 n 75 NaN
4 2 n 32 NaN
5 2 n 34 NaN
6 3 p 77 70.0
7 3 n 62 70.0
8 3 p 70 70.0
#HenryEcker covered all the nice intuitive solutions. This one's just for fun.
The basic idea is filter the rows where "y" is 'p' and among these rows find the minimum value of "z" for each "x". Then map it back to "x":
df['t'] = df['x'].map(df[df['y'].eq('p')].groupby('x')['z'].min())
df = df.sort_values(by='x')
An alternative method using eq + where. The basic idea is to mask the "z" values corresponding to non-"p" values in column "y"; then groupby "x" and transform the minimum "z":
df['t'] = df['z'].where(df['y'].eq('p')).groupby(df['x']).transform('min')
Output:
x y z t
0 1 n 52 68.0
3 1 p 68 68.0
4 1 n 92 68.0
1 2 n 75 NaN
5 2 n 32 NaN
8 2 n 34 NaN
2 3 p 77 70.0
6 3 n 62 70.0
7 3 p 70 70.0
Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
I would like to cluster below dataframe for column X3 and then for each cluster find mean of X3 then assign 3 for highest mean and 2 for lower and 1 for lowest mean. Below data frame
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':
[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':
[34,65,34,87,100,65,78,67,34,98,96,46,76]})
I did cluster according to the column X3 below
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
Now find mean of X3 for each cluster and month and then rank it and assign 3 to the max mean , 2 for medium and 1 for lowest. Below is what I did but it is not working . How can I fix this? Thank you.
mapping = {1: 'weak', 2: 'average', 3: 'good'}
cols=df.columns[3]
df['product_rank'] = df.groupby(['Month','X3_cluster_id'])
[cols].transform('mean').rank(method='dense').astype(int)
df['product_category'] = df['product_rank'].map(mapping)
While assigning ranks, Make sure to group it on the basis of month.
Complete code:
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56],'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
mapping = {1: 'weak', 2: 'average', 3: 'good'}
df['mean_X3'] = df.groupby(["Month","X3_cluster_id"])["X3"].transform("mean")
df["product_category"] = df.groupby("Month")['mean_X3'].rank(method='dense').astype(int).map(mapping)
print(df)
Month X1 X2 X3 X3_cluster_id mean_X3 product_category
0 1 10 12 34 1 57.80 weak
1 1 15 90 65 2 81.00 good
2 1 24 20 34 1 57.80 weak
3 1 32 40 87 0 66.75 average
4 1 8 10 100 0 66.75 average
5 1 6 15 65 2 81.00 good
6 3 10 30 78 1 57.80 weak
7 3 23 40 67 1 57.80 weak
8 3 24 60 34 0 66.75 average
9 3 56 42 98 2 81.00 good
10 3 45 2 96 2 81.00 good
11 3 10 4 46 0 66.75 average
12 3 56 10 76 1 57.80 weak
When you apply kmeans, the mean is already calculated, so I would suggest doing 1 fit, and return the labels, means and ranking within each groupby:
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X)
ranks = np.argsort(k_means.cluster_centers_.ravel())+1
res = pd.DataFrame({'cluster':range(k_means.n_clusters),
'means':k_means.cluster_centers_.ravel(),
'ranks':ranks}).loc[k_means.labels_,:]
res.index = X.index
return res
Then what you do is simply to apply the function above and obtain the ranks and means in one shot:
mapping = {1: 'weak', 2: 'average', 3: 'good'}
res = df.groupby("Month")[['X3']].apply(cluster, n_clusters=3)
cluster means ranks
0 1 34.000000 3
1 2 65.000000 1
2 1 34.000000 3
3 0 93.500000 2
4 0 93.500000 2
5 2 65.000000 1
6 0 73.666667 2
7 0 73.666667 2
8 1 40.000000 1
9 2 97.000000 3
10 2 97.000000 3
11 1 40.000000 1
12 0 73.666667 2
You can apply map and also a complete dataframe with a left join:
res['product_category'] = res['ranks'].map(mapping)
df.merge(res,left_index=True,right_index=True)
Month X1 X2 X3 cluster means ranks product_category
0 1 10 12 34 1 34.000000 1 weak
1 1 15 90 65 0 65.000000 2 average
2 1 24 20 34 1 34.000000 1 weak
3 1 32 40 87 2 93.500000 3 good
4 1 8 10 100 2 93.500000 3 good
5 1 6 15 65 0 65.000000 2 average
6 3 10 30 78 0 73.666667 2 average
7 3 23 40 67 0 73.666667 2 average
8 3 24 60 34 1 40.000000 1 weak
9 3 56 42 98 2 97.000000 3 good
10 3 45 2 96 2 97.000000 3 good
11 3 10 4 46 1 40.000000 1 weak
12 3 56 10 76 0 73.666667 2 average
I have two data frames. One with a list of all mutations (+ a score associated), and another with a subset of mutations actually observed (+ a measured value).
I want to merge my second data frame (subset of observed) into my larger data frame (all possible) and bring with it the data that is associated with the observed mutations (fit values). However, when I do this, my merged data frame shows NaN for all the fit values.
The code I tried for merging is below, with samples of my data frames and the resultant output (as s1).
s1 = pd.merge(data_frame, data_frame_2, how='left', on=['position', 'mutation'])
data_frame #all possible
position mutation A_score Normalized_A_Score
0 1 * 0.00 0.000000
1 1 A 849.69 100.007062
2 1 C 849.94 100.036486
3 1 D 849.76 100.015301
4 1 E 849.67 100.004708
5 1 F 849.00 99.925850
6 1 G 849.56 99.991761
7 1 H 849.83 100.023540
8 1 I 849.63 100.000000
9 1 K 851.51 100.221273
10 1 L 849.56 99.991761
11 1 M 849.63 100.000000
12 1 N 849.63 100.000000
13 1 P 849.00 99.925850
14 1 Q 849.13 99.941151
15 1 R 851.70 100.243635
16 1 S 849.15 99.943505
17 1 T 849.94 100.036486
18 1 V 849.63 100.000000
19 1 W 849.00 99.925850
20 1 Y 849.10 99.937620
data_frame_2 #observed
position mutation fit_val adjusted_fit_val
0 1 * 0.633847 0.274555
1 1 A 0.832698 0.473406
2 1 C 0.857012 0.497719
3 1 D 0.873119 0.513827
4 1 E 0.859805 0.500512
5 1 F 0.359053 -0.000239
6 1 G 0.786489 0.427197
7 1 H 0.876687 0.517395
8 1 I 0.820826 0.461534
9 1 K 0.886447 0.527154
10 1 L 0.868197 0.508905
11 1 N 0.909416 0.550124
12 1 P 0.843697 0.484405
13 1 Q 0.838892 0.479600
14 1 R 0.878175 0.518883
15 1 S 0.981739 0.622446
16 1 T 0.709694 0.350402
17 1 W 0.866746 0.507453
18 1 Y 0.876647 0.517355
s1 #merged
position mutation A_score Normalized_A_Score fit_val adjusted_fit_val
0 1 * 0.00 0.000000 NaN NaN
1 1 A 849.69 100.007062 NaN NaN
2 1 C 849.94 100.036486 NaN NaN
3 1 D 849.76 100.015301 NaN NaN
4 1 E 849.67 100.004708 NaN NaN
5 1 F 849.00 99.925850 NaN NaN
6 1 G 849.56 99.991761 NaN NaN
7 1 H 849.83 100.023540 NaN NaN
8 1 I 849.63 100.000000 NaN NaN
9 1 K 851.51 100.221273 NaN NaN
10 1 L 849.56 99.991761 NaN NaN
11 1 M 849.63 100.000000 NaN NaN
12 1 N 849.63 100.000000 NaN NaN
13 1 P 849.00 99.925850 NaN NaN
14 1 Q 849.13 99.941151 NaN NaN
15 1 R 851.70 100.243635 NaN NaN
16 1 S 849.15 99.943505 NaN NaN
17 1 T 849.94 100.036486 NaN NaN
18 1 V 849.63 100.000000 NaN NaN
19 1 W 849.00 99.925850 NaN NaN
20 1 Y 849.10 99.937620 NaN NaN
Why wont the fit_val or adjusted_fit_val column values from data_frame_2 show up when I merge the data frames together? Thanks for any help in understanding!
I think there is different types of column position - string and integers:
data_frame['position'] = data_frame['position'].astype(int)
data_frame_2['position'] = data_frame_2['position'].astype(int)
s1 = pd.merge(data_frame, data_frame_2, how='left', on=['position', 'mutation'])
print (s1)
position mutation A_score Normalized_A_Score fit_val adjusted_fit_val
0 1 * 0.00 0.000000 0.633847 0.274555
1 1 A 849.69 100.007062 0.832698 0.473406
2 1 C 849.94 100.036486 0.857012 0.497719
3 1 D 849.76 100.015301 0.873119 0.513827
4 1 E 849.67 100.004708 0.859805 0.500512
5 1 F 849.00 99.925850 0.359053 -0.000239
6 1 G 849.56 99.991761 0.786489 0.427197
7 1 H 849.83 100.023540 0.876687 0.517395
8 1 I 849.63 100.000000 0.820826 0.461534
9 1 K 851.51 100.221273 0.886447 0.527154
10 1 L 849.56 99.991761 0.868197 0.508905
11 1 M 849.63 100.000000 NaN NaN
12 1 N 849.63 100.000000 0.909416 0.550124
13 1 P 849.00 99.925850 0.843697 0.484405
14 1 Q 849.13 99.941151 0.838892 0.479600
15 1 R 851.70 100.243635 0.878175 0.518883
16 1 S 849.15 99.943505 0.981739 0.622446
17 1 T 849.94 100.036486 0.709694 0.350402
18 1 V 849.63 100.000000 NaN NaN
19 1 W 849.00 99.925850 0.866746 0.507453
20 1 Y 849.10 99.937620 0.876647 0.517355
Suppose we have a dataframe and we calculate as percent change between rows
y_axis = [1,2,3,4,5,6,7,8,9]
x_axis = [100,105,115,95,90,88,110,100,0]
DF = pd.DataFrame({'Y':y_axis, 'X':x_axis})
DF = DF[['Y','X']]
DF['PCT'] = DF['X'].pct_change()
Y X PCT
0 1 100 NaN
1 2 105 0.050000
2 3 115 0.095238
3 4 95 -0.173913
4 5 90 -0.052632
5 6 88 -0.022222
6 7 110 0.250000
7 8 100 -0.090909
8 9 0 -1.000000
That way it starts from the first row.
I want calculate pct_change() starting from the last row.
One way to do it
DF['Reverse'] = list(reversed(x_axis))
DF['PCT_rev'] = DF['Reverse'].pct_change()
pct_rev = DF.PCT_rev.tolist()
DF['_PCT_'] = list(reversed(pct_rev))
DF2 = DF[['Y','X','PCT','_PCT_']]
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
But that is a very ugly and inefficient solution.
I was wondering if there are more elegant solutions?
DF.assign(_PCT_=DF.X.pct_change(-1))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
Series.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
periods : int, default 1 Periods to shift for forming percent change
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html
I deleted my other answer because #su79eu7k 's is way better.
You can cut your time in half by using the underlying arrays. But you also have to suppress a warning.
a = DF.X.values
DF.assign(_PCT_=np.append((a[:-1] - a[1:]) / a[1:], np.nan))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN