Merging Data Frames based on two columns - python

I have two data frames. One with a list of all mutations (+ a score associated), and another with a subset of mutations actually observed (+ a measured value).
I want to merge my second data frame (subset of observed) into my larger data frame (all possible) and bring with it the data that is associated with the observed mutations (fit values). However, when I do this, my merged data frame shows NaN for all the fit values.
The code I tried for merging is below, with samples of my data frames and the resultant output (as s1).
s1 = pd.merge(data_frame, data_frame_2, how='left', on=['position', 'mutation'])
data_frame #all possible
position mutation A_score Normalized_A_Score
0 1 * 0.00 0.000000
1 1 A 849.69 100.007062
2 1 C 849.94 100.036486
3 1 D 849.76 100.015301
4 1 E 849.67 100.004708
5 1 F 849.00 99.925850
6 1 G 849.56 99.991761
7 1 H 849.83 100.023540
8 1 I 849.63 100.000000
9 1 K 851.51 100.221273
10 1 L 849.56 99.991761
11 1 M 849.63 100.000000
12 1 N 849.63 100.000000
13 1 P 849.00 99.925850
14 1 Q 849.13 99.941151
15 1 R 851.70 100.243635
16 1 S 849.15 99.943505
17 1 T 849.94 100.036486
18 1 V 849.63 100.000000
19 1 W 849.00 99.925850
20 1 Y 849.10 99.937620
data_frame_2 #observed
position mutation fit_val adjusted_fit_val
0 1 * 0.633847 0.274555
1 1 A 0.832698 0.473406
2 1 C 0.857012 0.497719
3 1 D 0.873119 0.513827
4 1 E 0.859805 0.500512
5 1 F 0.359053 -0.000239
6 1 G 0.786489 0.427197
7 1 H 0.876687 0.517395
8 1 I 0.820826 0.461534
9 1 K 0.886447 0.527154
10 1 L 0.868197 0.508905
11 1 N 0.909416 0.550124
12 1 P 0.843697 0.484405
13 1 Q 0.838892 0.479600
14 1 R 0.878175 0.518883
15 1 S 0.981739 0.622446
16 1 T 0.709694 0.350402
17 1 W 0.866746 0.507453
18 1 Y 0.876647 0.517355
s1 #merged
position mutation A_score Normalized_A_Score fit_val adjusted_fit_val
0 1 * 0.00 0.000000 NaN NaN
1 1 A 849.69 100.007062 NaN NaN
2 1 C 849.94 100.036486 NaN NaN
3 1 D 849.76 100.015301 NaN NaN
4 1 E 849.67 100.004708 NaN NaN
5 1 F 849.00 99.925850 NaN NaN
6 1 G 849.56 99.991761 NaN NaN
7 1 H 849.83 100.023540 NaN NaN
8 1 I 849.63 100.000000 NaN NaN
9 1 K 851.51 100.221273 NaN NaN
10 1 L 849.56 99.991761 NaN NaN
11 1 M 849.63 100.000000 NaN NaN
12 1 N 849.63 100.000000 NaN NaN
13 1 P 849.00 99.925850 NaN NaN
14 1 Q 849.13 99.941151 NaN NaN
15 1 R 851.70 100.243635 NaN NaN
16 1 S 849.15 99.943505 NaN NaN
17 1 T 849.94 100.036486 NaN NaN
18 1 V 849.63 100.000000 NaN NaN
19 1 W 849.00 99.925850 NaN NaN
20 1 Y 849.10 99.937620 NaN NaN
Why wont the fit_val or adjusted_fit_val column values from data_frame_2 show up when I merge the data frames together? Thanks for any help in understanding!

I think there is different types of column position - string and integers:
data_frame['position'] = data_frame['position'].astype(int)
data_frame_2['position'] = data_frame_2['position'].astype(int)
s1 = pd.merge(data_frame, data_frame_2, how='left', on=['position', 'mutation'])
print (s1)
position mutation A_score Normalized_A_Score fit_val adjusted_fit_val
0 1 * 0.00 0.000000 0.633847 0.274555
1 1 A 849.69 100.007062 0.832698 0.473406
2 1 C 849.94 100.036486 0.857012 0.497719
3 1 D 849.76 100.015301 0.873119 0.513827
4 1 E 849.67 100.004708 0.859805 0.500512
5 1 F 849.00 99.925850 0.359053 -0.000239
6 1 G 849.56 99.991761 0.786489 0.427197
7 1 H 849.83 100.023540 0.876687 0.517395
8 1 I 849.63 100.000000 0.820826 0.461534
9 1 K 851.51 100.221273 0.886447 0.527154
10 1 L 849.56 99.991761 0.868197 0.508905
11 1 M 849.63 100.000000 NaN NaN
12 1 N 849.63 100.000000 0.909416 0.550124
13 1 P 849.00 99.925850 0.843697 0.484405
14 1 Q 849.13 99.941151 0.838892 0.479600
15 1 R 851.70 100.243635 0.878175 0.518883
16 1 S 849.15 99.943505 0.981739 0.622446
17 1 T 849.94 100.036486 0.709694 0.350402
18 1 V 849.63 100.000000 NaN NaN
19 1 W 849.00 99.925850 0.866746 0.507453
20 1 Y 849.10 99.937620 0.876647 0.517355

Related

Pandas Dataframe - iterate and assign

I have a dataset as shown below I am looking to assign a new student if score ratio is <=0.05
import pandas as pd
df = pd.DataFrame({'Teacher': ['P','P','N','N','N','N','P','N','N','P','P','N'],
'Class': ['A','A','A','A','B','B','B','C','C','C','C','C'],
'Student': [1,2,3,4,1,2,3,1,2,3,4,5],
'Total Score': [75,10,10,5,75,20,5,60,20,10,6,4],
'Percent': [43,32,30,36,35,28,34,33,31,36,37,29]})
built a score ratio column as below
df_2 = df.groupby(['Teacher','Class']).agg({'Total Score': 'sum'}).reset_index()
final_data=pd.merge(df,df_2, on=['Teacher','Class'], how='inner')
final_data['score ratio']=final_data['Total Score_x']/final_data['Total Score_y']
If a students score ratio is <=0.05 then I need to assign a new student for the same teacher(ex:N) within the same class(Ex:C) whose percent is next best(below example student 2 has the next best percent of 31)
Expected output with new column-'new_assigned_student'
Here is a solution with nested iterrows which works but is not efficient. I would be interested to see if someone provides a more efficient vectorized solution:
for idx,row in final_data.iterrows():
if row['score ratio'] < 0.05:
min_distance = math.inf
target_index = -1
for idx2, row2 in final_data.iterrows():
if row2['Teacher'] == row['Teacher'] and\
row2['Class'] == row['Class'] and\
row2['Percent'] > row['Percent'] and\
row2['Percent'] - row['Percent'] < min_distance:
min_distance = row2['Percent'] - row['Percent']
target_index = idx2
final_data.loc[idx,'new_assigned-student'] = final_data.loc[target_index,'Student'].astype(str)
#output:
Teacher Class Student ... Total Score_y score ratio new_assigned-student
0 P A 1 ... 85 0.882353 NaN
1 P A 2 ... 85 0.117647 NaN
2 N A 3 ... 15 0.666667 NaN
3 N A 4 ... 15 0.333333 NaN
4 N B 1 ... 95 0.789474 NaN
5 N B 2 ... 95 0.210526 NaN
6 P B 3 ... 5 1.000000 NaN
7 N C 1 ... 84 0.714286 NaN
8 N C 2 ... 84 0.238095 NaN
9 N C 5 ... 84 0.047619 2
10 P C 3 ... 16 0.625000 NaN
11 P C 4 ... 16 0.375000 NaN
This should do it, just use a shift. assumes your scores are sorted per Teacher/Class, as they are in your example
final_data['new_assigned_student'] = final_data.groupby(['Teacher','Class'])['Student'].shift()
final_data.loc[final_data['score ratio']>0.05,'new_assigned_student'] = np.nan
the result
Teacher Class Student Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- --------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan
Solution 2
Here is a more robust, if somewhat more involved, solution
df3 = final_data
df_min_pct = (df3.groupby(['Teacher','Class'],
as_index = False,
sort = False)
.apply(lambda g: g.iloc[g.loc[g['score ratio']>0.05,'Percent'].argmin()])
)
Here df_min_pct shows, for each Teacher/Class group, the details of the student in that group that has the lowest score that is above 0.05:
Teacher Class Student Total Score_x Percent Total Score_y score ratio
-- --------- ------- --------- --------------- --------- --------------- -------------
0 P A 2 10 32 85 0.117647
1 N A 3 10 30 15 0.666667
2 N B 2 20 28 95 0.210526
3 P B 3 5 34 5 1
4 N C 2 20 31 84 0.238095
5 P C 3 10 36 16 0.625
Now we merge with the original df, and remove the details from those lines where it is not relevant
df4 = df3.merge(df_min_pct[['Teacher', 'Class','Student']], on = ['Teacher', 'Class'], sort = False).rename(columns = {'Student_y':'new_assigned_student'})
df4.loc[df4['score ratio']>0.05,'new_assigned_student'] = np.nan
This produces the desired result
Teacher Class Student_x Total Score_x Percent Total Score_y score ratio new_assigned_student
-- --------- ------- ----------- --------------- --------- --------------- ------------- ----------------------
0 P A 1 75 43 85 0.882353 nan
1 P A 2 10 32 85 0.117647 nan
2 N A 3 10 30 15 0.666667 nan
3 N A 4 5 36 15 0.333333 nan
4 N B 1 75 35 95 0.789474 nan
5 N B 2 20 28 95 0.210526 nan
6 P B 3 5 34 5 1 nan
7 N C 1 60 33 84 0.714286 nan
8 N C 2 20 31 84 0.238095 nan
9 N C 5 4 29 84 0.047619 2
10 P C 3 10 36 16 0.625 nan
11 P C 4 6 37 16 0.375 nan

Flatten DataFrame by group with columns creation in Pandas

I have the following pandas DataFrame
Id_household Age_Father Age_child
0 1 30 2
1 1 30 4
2 1 30 4
3 1 30 1
4 2 27 4
5 3 40 14
6 3 40 18
and I want to achieve the following result
Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
Id_household
1 30 1 2.0 4.0 4.0
2 27 4 NaN NaN NaN
3 40 14 18.0 NaN NaN
I tried stacking with multi-index renaming, but I am not very happy with it and I am not able to make everything work properly.
Use this:
df_out = df.set_index([df.groupby('Id_household').cumcount()+1,
'Id_household',
'Age_Father']).unstack(0)
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
Id_household Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
0 1 30 2.0 4.0 4.0 1.0
1 2 27 4.0 NaN NaN NaN
2 3 40 14.0 18.0 NaN NaN

python: inserting row at specifc index from one dataframe to another

I have a two dataframes as follows:
df1:
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
2 0 7 6 5 8
df2:
M N O P Q R S T
0 1 2 3
1 4 5 6
2 7 8 9
3 8 6 5
4 5 4 3
I have taken out a slice of data from df1 as follows:
>data_1 = df1.loc[0:1]
>data_1
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
Now I need to insert this data_1 into df2 at specific location of Index(0,P) (row,column). Is there any way to do it? I do not want to disturb the other columns in df2.
I can extract individual values of each cell and do it but since I have to do it for a large dataset, its not possible to do it cell-wise.
Cellwise method:
>var1 = df1.iat[0,1]
>var2 = df1.iat[0,0]
>df2.at[0, 'P'] = var1
>df2.at[0, 'Q'] = var2
If you specify all the columns, it is possible to do it as follows:
df2.loc[0:1, ['P', 'Q', 'R', 'S', 'T']] = df1.loc[0:1].values
Resulting dataframe:
M N O P Q R S T
0 1 2 3 8.0 6.0 4.0 9.0 7.0
1 4 5 6 2.0 6.0 3.0 8.0 5.0
2 7 8 9
3 8 6 5
4 5 4 3
You can rename columns and index names for match to second DataFrame, so possible use DataFrame.update for correct way specifiest by tuple pos:
data_1 = df1.loc[0:1]
print (data_1)
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
pos = (2, 'P')
data_1 = data_1.rename(columns=dict(zip(data_1.columns, df2.loc[:, pos[1]:].columns)),
index=dict(zip(data_1.index, df2.loc[pos[0]:].index)))
print (data_1)
P Q R S T
2 8 6 4 9 7
3 2 6 3 8 5
df2.update(data_1)
print (df2)
M N O P Q R S T
0 1 2 3 NaN NaN NaN NaN NaN
1 4 5 6 NaN NaN NaN NaN NaN
2 7 8 9 8.0 6.0 4.0 9.0 7.0
3 8 6 5 2.0 6.0 3.0 8.0 5.0
4 5 4 3 NaN NaN NaN NaN NaN
How working rename - idea is select all columns and all index values after specified column, index name by loc and then zip by columns names of data_1 with convert to dictionary. So last replace bot, index and columns names in data_1 by next columns, index values.

Pandas Rolling Groupby Shift back 1, Trying to lag rolling sum

I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row. My attempt looked like the below code and i is the column. There has to be a way to do this but this method doesnt seem to work.
for i in df.columns.values:
df.groupby('Id', group_keys=False)[i].rolling(window=3, min_periods=2).mean().shift(1)
id dollars lag
1 6 nan
1 7 nan
1 6 6.5
3 7 nan
3 4 nan
3 4 5.5
3 3 5
5 6 nan
5 5 nan
5 6 5.5
5 12 5.67
5 7 8.3
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row.
You can create the lagged rolling sum by chaining DataFrame.groupby(ID), .shift(1) for the lag 1, .rolling(3) for the window 3, and .sum() for the sum.
Example: Let's say your dataset is:
import pandas as pd
# Reproducible datasets are your friend!
d = pd.DataFrame({'grp':pd.Series(['A']*4 + ['B']*5 + ['C']*6),
'x':pd.Series(range(15))})
print(d)
grp x
A 0
A 1
A 2
A 3
B 4
B 5
B 6
B 7
B 8
C 9
C 10
C 11
C 12
C 13
C 14
I think what you're asking for is this:
d['y'] = d.groupby('grp')['x'].shift(1).rolling(3).sum()
print(d)
grp x y
A 0 NaN
A 1 NaN
A 2 NaN
A 3 3.0
B 4 NaN
B 5 NaN
B 6 NaN
B 7 15.0
B 8 18.0
C 9 NaN
C 10 NaN
C 11 NaN
C 12 30.0
C 13 33.0
C 14 36.0

Pandas: rolling count if within a loop

In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0
I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".

Categories