Pairwise correlation from 2 dataframes in python - python

I have 2 dataframes:
df = pd.DataFrame({'SAMs': ['GOS', 'BUM', 'BEN', 'AUD', 'VWA','HON'],
'GN1': [22, 22, 2, 2, 2,5],
'GN2':[1.1,5.7,4.8,7.09,10.876,0.178]})
df
GN1 GN2 SAMs
0 22 1.100 GOS
1 22 5.700 BUM
2 2 4.800 BEN
3 2 7.090 AUD
4 2 10.876 VWA
5 5 0.178 HON
and df2:
df2 = pd.DataFrame({'SAMs': ['FAMS', 'SAP', 'KLM', 'SOS', 'LUD','EJT'],
'GN1': [22, 22, 2, 2, 2,5],
'GN2':[1.1,5.7,4.8,7.09,10.876,0.178]})
I need to calculate the pearson correlations between the column SAMs from df1 and df2. For each value in column SAMs from both df1 and df2, I'd like to make pairwise combinations and calculate their correlations.
At the end, the output should look like:
SAMs correlation_value P-value
GOS-FAMS 0.45 0.87
GOS-SAP 0.55 1
GOS-KLM 0.15 0.89
...
HON-EJT 0.156 0.98
Any suggestions would be great!

Related

Python String Matching Using Loops and Iterations and Score Calculation using two dataframes

df1
Place Location
Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow
Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot
Delhi,Punjab,Kerala Delhi,Jaipur,Madras
df2
Target1 Target2 Strength
Jaipur Rajkot 0.94
Jaipur Punjab 0.84
Jaipur Noida 0.62
Jaipur Jodhpur 0.59
Punjab Amritsar 0.97
Punjab Delhi 0.85
Punjab Bhopal 0.91
Punjab Jodhpur 0.75
Kerala Varkala 0.85
Kerala Kochi 0.88
The task is to match 'Place' value with 'Location' values and assign score '1' in case of direct match and refer df2 in case of indirect match and assign strength score from that. For Ex: In Row1 Delhi and Punjab are direct match as both are present in 'Place' and 'Location' wherein Jaipur is there in 'Place' but not in 'Location. So, Jaipur will be iterated in df2 Target1 and try to find the corresponding 'Location' values of Row1 in Target2. In df2 Jaipur is related to Punjab and Noida which there in ROW1 Location values. So, corresponding to Jaipur, Punjab strength will be alloted as 0.84 is higher than Noida's 0.62. Final score is calculated as (1+1+0.84)/3 i.e sum of direct and indirect matches divided by number of 'Place' items.
Expected output is :
Place Location Avg. Score
Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow (1+1+0.84)/3 = 0.95
Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot (1+0.91+1)/3 = 0.97
Delhi,Punjab,Kerala Delhi,Jaipur,Madras (1+0.85+0)/3 = 0.67
My try
data1 = df1['Place'].to_list()
data2 = df1['Location'].to_list()
dict3 = {}
exac_match = []
for el in data1:
#print(el)
el=[x.strip() for x in el.split(',')]
for ell in data2:
ell=[x.strip() for x in ell.split(',')]
dict1 = {}
dict2 = {}
for elll in el:
if elll in ell:
#print("Exact match:::", elll)
dict1[elll]=1
dict2[elll]=elll
Use:
#convert splitted values of df1['Place'] to rows
df = df1.assign(Place = df1['Place'].str.split(',')).explode('Place').reset_index()
#test if match Place in Location (splitted values)
mask = [a in b for a, b in zip(df.Place, df['Location'].str.split(','))]
#filter matched and remove duplicates, assign 1 to final column
df11 = df[mask].drop_duplicates(['index','Place','Location']).assign(final=1)
#filter not matched rows (indirect match) and join df2
df12 = df[~np.array(mask)].merge(df2, left_on='Place', right_on='Target1')
#test if Target2 in Location
mask = [a in b for a, b in zip(df12.Target2, df12['Location'].str.split(','))]
#get maximal Strength per Place
df12 = df12[mask].copy()
df12 = (df12.loc[df12.groupby(['index','Place'])['Strength'].idxmax()]
.assign(final = lambda x: x['Strength']))
#join together
df3 = pd.concat([df11, df12[['index','Place','final','Location']]])
#join to exploded DataFrame with replace NaN to 0 in final column
df = df.merge(df3, how='left', on=['index','Place']).fillna({'final':0})
print (df)
index Place Location_x Location_y \
0 0 Delhi Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
1 0 Punjab Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
2 0 Jaipur Delhi,Punjab,Noida,Lucknow Delhi,Punjab,Noida,Lucknow
3 1 Delhi Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
4 1 Punjab Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
5 1 Jaipur Delhi,Bhopal,Jaipur,Rajkot Delhi,Bhopal,Jaipur,Rajkot
6 2 Delhi Delhi,Jaipur,Madras Delhi,Jaipur,Madras
7 2 Punjab Delhi,Jaipur,Madras Delhi,Jaipur,Madras
8 2 Kerala Delhi,Jaipur,Madras NaN
final
0 1.00
1 1.00
2 0.84
3 1.00
4 0.91
5 1.00
6 1.00
7 0.85
8 0.00
#last aggregate mean and assign to df1['Score']
df1['Score'] = df.groupby('index')['final'].mean()
print (df1)
Place Location Score
0 Delhi,Punjab,Jaipur Delhi,Punjab,Noida,Lucknow 0.946667
1 Delhi,Punjab,Jaipur Delhi,Bhopal,Jaipur,Rajkot 0.970000
2 Delhi,Punjab,Kerala Delhi,Jaipur,Madras 0.616667

Find the second largest date in df python [duplicate]

I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?
To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.
Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])
Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64
df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....

separating dataset into 2 groups (group 1: ID starting with u and group 2: ID starting with s)

ID A1 A2 Exam
0 u123456 10.00 0.00 21
1 s123457 6.80 9.40 30
2 u123458 13.35 20.00 25
3 u123459 0.00 10.15 24
4 u123460 4.50 8.09 21
5 u123461 5.50 13.30 14
6 u123462 20.00 12.75 16
7 s123463 20.00 17.50 22
8 u123464 11.75 17.30 31
9 s123465 0.00 12.65 15
The above is sample of my dataset, I'm confused how can I make two dataset based on id which starts with 'u' and 's' respectively. I am new in coding and sorry for asking silly thing.
You can group the DataFrame using a function that takes the first letter into account.
import pandas as pd
df = pd.DataFrame(
[
['s123457', 6.80, 9.40, 30],
['u123458', 13.35, 20.00, 25],
['u123459', 0.00, 10.15, 24],
['u123460', 4.50, 8.09, 21],
['u123461', 5.50, 13.30, 14],
['u123462', 20.00, 12.75, 16],
['s123463', 20.00, 17.50, 22],
['u123464', 11.75, 17.30, 31],
['s123465', 0.00, 12.65, 15]
],
columns=['ID', 'A1', 'A2', 'Exam']
)
# Group by the first letter of the ID column.
grouped = df.groupby(lambda index: df['ID'].loc[index][0])
# Output key and associated group, with the index of the group being reset.
for key, group in grouped:
print(key)
print(group.reset_index(drop=True))

Merging One Column on to Multiple Columns

I have the following two dataframes, DF1:
location vaccine1 vaccine2 vaccine3 vaccine4
0 Afghanistan Oxford/AstraZeneca Pfizer/BioNTech Sinopharm/Beijing None
1 Albania Oxford/AstraZeneca Pfizer/BioNTech Sinovac Sputnik V
2 Algeria Sputnik V None None None
3 Andorra Oxford/AstraZeneca Pfizer/BioNTech None None
DF2:
Vaccine Efficacy
0 Oxford/AstraZeneca 0.70
1 Pfizer/BioNTech 0.95
2 Sinopharm/Beijing 0.79
3 Sinovac 0.50
4 Sputnik V 0.92
I understand that you can merge like this below but the process is repeated 4 times which is inefficient:
v1 = pd.merge(df1, vacc_eff, how='left', left_on='vaccine1', right_on='Vaccine')[['location', 'Efficacy']]
v2 = pd.merge(df1, vacc_eff, how='left', left_on='vaccine2', right_on='Vaccine')[['location', 'Efficacy']]
vmerged = pd.merge(v1,v2,on=['location'])
How can I merge the DF2 column 'Efficacy' onto each of the vaccine columns in DF1 without writing the same merge function again and again?
Here is a solution you can try out, stack + map then unstack
map_ = vacc_eff.set_index('Vaccine')['Efficacy'].to_dict()
print(
df1[['location', 'vaccine1', 'vaccine2']].set_index('location')
.stack().map(map_).unstack()
)
vaccine1 vaccine2
location
Afghanistan 0.70 0.95
Albania 0.70 0.95
Algeria 0.92 NaN
Andorra 0.70 0.95

The most elegant way to do a calculation on dataframe column

I'm a newbie in python.
I have a column in pandas dataframe called [weight].
Which is the efficient and smartest way to redefine securities's weights to sum 1 (or 100%) ? something like the sample calculation below
weight new weight
0,05 14%
0,10 29%
0,20 57%
total 0,35 100%
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
print(df)
Security Rating Weight
ABC AAA 0.05
DEF BBB 0.10
GHI AA 0.20
I think we can devide weight by sum of weights and get the percentage of weight (newWeight):
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
df['newWeight'] = 100 * df['Weight'] / sum(df['Weight'])
print(df)
## Rating Security Weight newWeight
## 0 AAA ABC 0.05 14.285714
## 1 BBB DEF 0.10 28.571429
## 2 AA GHI 0.20 57.142857
Using the apply method is a neat way to solve this problem. You can do something like this:
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
total = df.Weight.sum()
df['newWeight'] = df.Weight.apply(lambda x: x / total)
The resulting DataFrame looks like this:
Security Rating Weight newWeight
0 ABC AAA 0.05 0.142857
1 DEF BBB 0.10 0.285714
2 GHI AA 0.20 0.571429
If you want to represent these as percentages, you need to convert them to strings, here's an example:
df['percentWeight'] = df.newWeight.apply(lambda x: "{}%".format(round(x * 100)))
And you get the result:
Security Rating Weight newWeight percentWeight
0 ABC AAA 0.05 0.142857 14%
1 DEF BBB 0.10 0.285714 29%
2 GHI AA 0.20 0.571429 57%

Categories