Merging One Column on to Multiple Columns - python

I have the following two dataframes, DF1:
location vaccine1 vaccine2 vaccine3 vaccine4
0 Afghanistan Oxford/AstraZeneca Pfizer/BioNTech Sinopharm/Beijing None
1 Albania Oxford/AstraZeneca Pfizer/BioNTech Sinovac Sputnik V
2 Algeria Sputnik V None None None
3 Andorra Oxford/AstraZeneca Pfizer/BioNTech None None
DF2:
Vaccine Efficacy
0 Oxford/AstraZeneca 0.70
1 Pfizer/BioNTech 0.95
2 Sinopharm/Beijing 0.79
3 Sinovac 0.50
4 Sputnik V 0.92
I understand that you can merge like this below but the process is repeated 4 times which is inefficient:
v1 = pd.merge(df1, vacc_eff, how='left', left_on='vaccine1', right_on='Vaccine')[['location', 'Efficacy']]
v2 = pd.merge(df1, vacc_eff, how='left', left_on='vaccine2', right_on='Vaccine')[['location', 'Efficacy']]
vmerged = pd.merge(v1,v2,on=['location'])
How can I merge the DF2 column 'Efficacy' onto each of the vaccine columns in DF1 without writing the same merge function again and again?

Here is a solution you can try out, stack + map then unstack
map_ = vacc_eff.set_index('Vaccine')['Efficacy'].to_dict()
print(
df1[['location', 'vaccine1', 'vaccine2']].set_index('location')
.stack().map(map_).unstack()
)
vaccine1 vaccine2
location
Afghanistan 0.70 0.95
Albania 0.70 0.95
Algeria 0.92 NaN
Andorra 0.70 0.95

Related

Python String Matching Using Loops and Iterations and Score Calculation using two dataframes

df1
Place Location
Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow
Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot
Delhi,Punjab,Kerala Delhi,Jaipur,Madras
df2
Target1 Target2 Strength
Jaipur Rajkot 0.94
Jaipur Punjab 0.84
Jaipur Noida 0.62
Jaipur Jodhpur 0.59
Punjab Amritsar 0.97
Punjab Delhi 0.85
Punjab Bhopal 0.91
Punjab Jodhpur 0.75
Kerala Varkala 0.85
Kerala Kochi 0.88
The task is to match 'Place' value with 'Location' values and assign score '1' in case of direct match and refer df2 in case of indirect match and assign strength score from that. For Ex: In Row1 Delhi and Punjab are direct match as both are present in 'Place' and 'Location' wherein Jaipur is there in 'Place' but not in 'Location. So, Jaipur will be iterated in df2 Target1 and try to find the corresponding 'Location' values of Row1 in Target2. In df2 Jaipur is related to Punjab and Noida which there in ROW1 Location values. So, corresponding to Jaipur, Punjab strength will be alloted as 0.84 is higher than Noida's 0.62. Final score is calculated as (1+1+0.84)/3 i.e sum of direct and indirect matches divided by number of 'Place' items.
Expected output is :
Place Location Avg. Score
Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow (1+1+0.84)/3 = 0.95
Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot (1+0.91+1)/3 = 0.97
Delhi,Punjab,Kerala Delhi,Jaipur,Madras (1+0.85+0)/3 = 0.67
My try
data1 = df1['Place'].to_list()
data2 = df1['Location'].to_list()
dict3 = {}
exac_match = []
for el in data1:
#print(el)
el=[x.strip() for x in el.split(',')]
for ell in data2:
ell=[x.strip() for x in ell.split(',')]
dict1 = {}
dict2 = {}
for elll in el:
if elll in ell:
#print("Exact match:::", elll)
dict1[elll]=1
dict2[elll]=elll
Use:
#convert splitted values of df1['Place'] to rows
df = df1.assign(Place = df1['Place'].str.split(',')).explode('Place').reset_index()
#test if match Place in Location (splitted values)
mask = [a in b for a, b in zip(df.Place, df['Location'].str.split(','))]
#filter matched and remove duplicates, assign 1 to final column
df11 = df[mask].drop_duplicates(['index','Place','Location']).assign(final=1)
#filter not matched rows (indirect match) and join df2
df12 = df[~np.array(mask)].merge(df2, left_on='Place', right_on='Target1')
#test if Target2 in Location
mask = [a in b for a, b in zip(df12.Target2, df12['Location'].str.split(','))]
#get maximal Strength per Place
df12 = df12[mask].copy()
df12 = (df12.loc[df12.groupby(['index','Place'])['Strength'].idxmax()]
.assign(final = lambda x: x['Strength']))
#join together
df3 = pd.concat([df11, df12[['index','Place','final','Location']]])
#join to exploded DataFrame with replace NaN to 0 in final column
df = df.merge(df3, how='left', on=['index','Place']).fillna({'final':0})
print (df)
index Place Location_x Location_y \
0 0 Delhi Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
1 0 Punjab Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
2 0 Jaipur Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
3 1 Delhi Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
4 1 Punjab Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
5 1 Jaipur Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
6 2 Delhi Delhi,Jaipur,Madras Delhi,Jaipur,Madras
7 2 Punjab Delhi,Jaipur,Madras Delhi,Jaipur,Madras
8 2 Kerala Delhi,Jaipur,Madras NaN
final
0 1.00
1 1.00
2 0.84
3 1.00
4 0.91
5 1.00
6 1.00
7 0.85
8 0.00
#last aggregate mean and assign to df1['Score']
df1['Score'] = df.groupby('index')['final'].mean()
print (df1)
Place Location Score
0 Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow 0.946667
1 Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot 0.970000
2 Delhi,Punjab,Kerala Delhi,Jaipur,Madras 0.616667

Pythion: Conditional_join janitor package

Hi I want to do a lookup to get the factor value for my dataset based on 3 conditions. Below is the lookup table:
Lookup_Table = {'State_Cd': ['TX','TX','TX','TX','CA','CA','CA','CA'],
'Deductible': [0,0,1000,1000,0,0,1000,1000],
'Revenue_1': [-99999999,25000000,-99999999,25000000,-99999999,25000000,-99999999,25000000],
'Revenue_2': [24999999,99000000,24999999,99000000,24999999,99000000,24999999,99000000],
'Factor': [0.15,0.25,0.2,0.3,0.11,0.15,0.13,0.45]
}
Lookup_Table = pd.DataFrame(Lookup_Table, columns = ['State_Cd','Deductible','Revenue_1','Revenue_2','Factor'])
lookup output:
Lookup_Table
State_Cd Deductible Revenue_1 Revenue_2 Factor
0 TX 0 -99999999 24999999 0.15
1 TX 0 25000000 99000000 0.25
2 TX 1000 -99999999 24999999 0.20
3 TX 1000 25000000 99000000 0.30
4 CA 0 -99999999 24999999 0.11
5 CA 0 25000000 99000000 0.15
6 CA 1000 -99999999 24999999 0.13
7 CA 1000 25000000 99000000 0.45
And then below is my dataset.
Dataset = {'Policy': ['A','B','C'],
'State': ['CA','TX','TX'],
'Deductible': [0,1000,0],
'Revenue': [10000000,30000000,1000000]
}
Dataset = pd.DataFrame(Dataset, columns = ['Policy','State','Deductible','Revenue'])
Dataset output:
Dataset
Policy State Deductible Revenue
0 A CA 0 1500000
1 B TX 1000 30000000
2 C TX 0 1000000
So basically to do the lookup the State must be matching to the State_Cd in lookup table, Deductible should be matching on the deductible in the lookup table, and lastly for Revenue it should be in between Revenue_1 and Revenue_2 (Revenue_1<=Revenue<=Revenue_2). To get to the desired factor value. Below is my expected output:
Policy State Deductible Revenue Factor
0 A CA 0 1500000 0.11
1 B TX 1000 30000000 0.30
2 C TX 0 1000000 0.15
I'm trying the conditional_join from janitor package. However I'm having an error. Is there something missing on my code?
import janitor
Data_Final = (Dataset.conditional_join(Lookup_Table,
# variable arguments
# col_from_left_df, col_from_right_df, comparator
('Revenue', 'Revenue_1', '>='),
('Revenue', 'Revenue_2', '<='),
('State', 'State_Cd', '=='),
('Deductible', 'Deductible', '=='),
how = 'left',sort_by_appearance = False
))
Below is the error
TypeError: __init__() got an unexpected keyword argument 'copy'
Resolved. By installing older version of pandas (less than 1.5). e.g.:
pip install pandas==1.4

how to highlight pandas data frame on selected rows

I have the data like this:
df:
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
I would like to color-highlight the data (index 1-5 in this df) by comparing Max and Min of the data (last two rows) to USL and LSL respectively. if Max > USL or Min < LSL, I would like to highlight the corresponding data points as red. if Max == USL or Min == LSL, corresponding data point as yellow and otherwise everything green.
I tried this :
highlight = np.where(df.loc['Max']>df.loc['USL'], 'background-color: red', '')
df.style.apply(lambda _: highlight)
but i get the error:
ValueError: Function <function <lambda> at 0x7fb681b601f0> created invalid index labels.
Usually, this is the result of the function returning a Series which contains invalid labels, or returning an incorrectly shaped, list-like object which cannot be mapped to labels, possibly due to applying the function along the wrong axis.
Result index has shape: (5,)
Expected index shape: (10,)
Out[58]:
<pandas.io.formats.style.Styler at 0x7fb681b52e20>
Use custom function for create DataFrame of styles by conditions:
#changed data for test
print (df)
A-A A-B A-C A-D
Tg 0.37 10.24 5.02 0.63
USL 0.39 10.26 5.04 0.65
LSL 0.33 0.22 5.00 10.63
1 0.35 10.23 5.05 0.65
2 0.36 10.19 5.07 0.67
3 0.34 10.25 5.03 0.66
4 0.35 10.20 5.08 0.69
5 0.33 10.17 5.05 0.62
Max 0.36 10.25 5.08 0.69
Min 0.33 10.17 5.03 0.62
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
r = [1,2,3,4,5]
m1 = (x.loc['Max']>x.loc['USL']) | (x.loc['Min']<x.loc['LSL'])
print (m1)
m2 = (x.loc['Max']==x.loc['USL']) | (x.loc['Min']==x.loc['LSL'])
print (m2)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)
EDIT: For compare 1-5 and Min/Max use:
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
# r = [1,2,3,4,5]
r += ['Max','Min']
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)

Remove Columns from DataFrame based on Standard Deviation

I am trying to do something that I think should be rather simple but I am stuck.
I would like to be able to get the standard deviation of each column in my dataframe and remove that column if the standard deviation is below a set number. This is as far as I have gotten.
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(20, 5), columns=list('ABCDE'))
namelist = list(df.columns.values.tolist())
stdev = pd.DataFrame(df.std())
I've tried a few things but nothing worth mentioning, any help would be greatly appreciated.
You don't need any loops.
You rarely do with pandas.
In this case, you need boolean indexing:
import pandas
import numpy
numpy.random.seed(37)
stdev_min = 0.95
df = pandas.DataFrame(numpy.random.randn(20, 5), columns=list('ABCDE'))
So now df.std() gives me:
A 0.928547
B 0.859394
C 0.998692
D 1.187380
E 1.092970
dtype: float64
so I can do
df.loc[:, df.std() > stdev_min]
And get:
C D E
0 0.35 -1.30 1.52
1 -0.45 0.96 -0.83
2 0.52 -0.06 -0.03
3 1.89 0.40 0.19
4 -0.27 -2.07 -0.71
5 -1.72 -0.40 1.27
6 0.44 -2.05 -0.23
7 1.76 0.06 0.36
8 -0.30 -2.05 1.68
9 0.34 1.26 -1.08
10 0.10 -0.48 -1.74
11 1.95 -0.08 1.51
12 0.43 -0.06 -0.63
13 -0.30 -1.06 0.57
14 -0.95 -1.45 0.93
15 -1.13 2.23 -0.88
16 -0.77 0.86 0.58
17 0.93 -0.11 -1.29
18 -0.82 0.03 -0.44
19 0.40 1.13 -1.89
Here's a way to do this.
Iterate through each column. Get the Standard Deviation for the column. Check if it is less than the minimum standard deviation value. If it is, drop the column using inplace=True
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
for col in df.columns:
print (col, df[col].std())
if df[col].std() < stdev_min:
df.drop(col,axis='columns', inplace=True)
print (df)
Output:
A 0.5046725928657507
B 1.1382221163449697
C 1.0318169576864502
D 0.7129102193331575
E 1.3805207184389312
The value of A is less than 0.6 and so it got dropped.
B C D E
0 -0.923822 1.155547 -0.601033 -0.066207
1 0.068844 0.426304 -0.376052 0.368574
2 0.585187 -0.367270 0.530934 0.086811
3 0.021466 1.381579 0.483134 -0.300033
4 0.351492 -0.648734 -0.736213 0.827953
5 0.155731 -0.004504 0.315432 0.310515
6 -1.092933 1.341933 -0.672240 -3.482960
7 -0.587766 0.227846 0.246781 1.978528
8 1.565055 0.527668 -0.371854 -0.030196
9 -2.634862 -1.973874 1.508080 -0.362073
Did a few more runs. Here's an example with before and after.
DF before
A B C D E
0 0.496740 0.799021 1.655287 0.091138 0.309186
1 -0.580667 -0.749337 -0.521909 -0.529410 1.010981
2 0.212731 0.126389 -2.244500 0.400540 -0.148761
3 -0.424375 -0.832478 -0.030865 -0.561107 0.196268
4 0.229766 0.688040 0.580294 0.941885 1.554929
5 0.676926 -0.062092 -1.452619 0.952388 -0.963857
6 0.683216 0.747429 -1.834337 -0.402467 -0.383881
7 0.834815 -0.770804 1.299346 1.694612 1.171190
8 0.500445 -1.517488 0.610287 -0.601442 0.343389
9 -0.182286 -0.713332 0.526507 1.042717 1.229628
Standard Deviations for each column of DF:
A 0.49088743174291477
B 0.8047513692231202
C 1.333382184686379
D 0.8248456756163864
E 0.8033725216710547
df['A'] is less than 0.6 and so got dropped.
DF after dropping the column.
B C D E
0 0.799021 1.655287 0.091138 0.309186
1 -0.749337 -0.521909 -0.529410 1.010981
2 0.126389 -2.244500 0.400540 -0.148761
3 -0.832478 -0.030865 -0.561107 0.196268
4 0.688040 0.580294 0.941885 1.554929
5 -0.062092 -1.452619 0.952388 -0.963857
6 0.747429 -1.834337 -0.402467 -0.383881
7 -0.770804 1.299346 1.694612 1.171190
8 -1.517488 0.610287 -0.601442 0.343389
9 -0.713332 0.526507 1.042717 1.229628

Pairwise correlation from 2 dataframes in python

I have 2 dataframes:
df = pd.DataFrame({'SAMs': ['GOS', 'BUM', 'BEN', 'AUD', 'VWA','HON'],
'GN1': [22, 22, 2, 2, 2,5],
'GN2':[1.1,5.7,4.8,7.09,10.876,0.178]})
df
GN1 GN2 SAMs
0 22 1.100 GOS
1 22 5.700 BUM
2 2 4.800 BEN
3 2 7.090 AUD
4 2 10.876 VWA
5 5 0.178 HON
and df2:
df2 = pd.DataFrame({'SAMs': ['FAMS', 'SAP', 'KLM', 'SOS', 'LUD','EJT'],
'GN1': [22, 22, 2, 2, 2,5],
'GN2':[1.1,5.7,4.8,7.09,10.876,0.178]})
I need to calculate the pearson correlations between the column SAMs from df1 and df2. For each value in column SAMs from both df1 and df2, I'd like to make pairwise combinations and calculate their correlations.
At the end, the output should look like:
SAMs correlation_value P-value
GOS-FAMS 0.45 0.87
GOS-SAP 0.55 1
GOS-KLM 0.15 0.89
...
HON-EJT 0.156 0.98
Any suggestions would be great!

Categories