Pairwise correlations in dataframe

Pairwise correlations in dataframe - python

I have a dataframe as following,
print(df)
SAS_a1 SAS2_a1 SAS3_a1 FDF_b1 FDF2_b1
0 0.673114 0.745755 0.989468 0.498920 0.837440
1 0.811218 0.392196 0.505301 0.615603 0.946847
2 0.252856 0.709125 0.321580 0.826123 0.224813
3 0.566833 0.738661 0.626808 0.815460 0.003738
4 0.102995 0.171741 0.246565 0.784519 0.980965
I aiming to pairwise correlation using pearsonr and but I wanted the pairwise correlation between columns ending with a1 versus b1. The final result should look like,
PCC p-value
SAS_a1__FDF_b1 -0.293373 0.631895
SAS_a1__FDF2_b1 -0.947724 0.014235
SAS2_a1__FDF_b1 0.771389 0.126618
SAS2_a1__FDF2_b1 e 0.132380 0.831942
SAS3_a1__FDF_b1 0.422249 0.478808
SAS3_a1__FDF2_b1 0.346411 0.567923
Any suggestions would be great ..!!!
Here is what I tried,
columns = df.columns.tolist()
for col_a, col_b in itertools.combinations(columns, 2):
correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])
results = DataFrame.from_dict(correlations, orient='index')
results.columns = ['PCC', 'p-value']

I don't know if its the most elegant solution but you can use list comprehension to create a list containing the relevant columns:
import pandas as pd
from scipy.stats import pearsonr
result = pd.DataFrame()
for a1 in [column for column in df.columns if 'a1' in column]:
for b1 in [column for column in df.columns if 'b1' in column]:
result = result.append(
pd.Series(
pearsonr(df[a1],df[b1]),
index=['PCC', 'p-value'],
name=a1 + '__' +b1
))
PS: It would be great if you would include your imports in your next question. (So that people answering don't have to google it)

Related

Join Pandas DataFrames on Fuzzy/Approximate Matches for Multiple Columns

I have two Pandas DataFrames that look like this. Trying to join the two data sets on 'Name','Longitude', and 'Latitude' but using a fuzzy/approximate match. Is there a way to join these together using a combination of the 'Name' strings being, for example, at least an 80% match and the 'Latitude' and 'Longitude' columns being the nearest value or within like 0.001 of each other? I tried using pd.merge_asof but couldn't figure out how to make it work. Thank you for the help!
import pandas as pd
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])

merge_asof won't work here since it can only merge on a single numeric column, such as datetimelike, integer, or float (see doc).
Here you can compute the (euclidean) distance between the coordinates of df1 and df2 and pickup the best match:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
# Replacing 'Latitude' and 'Longitude' columns with a 'Coord' Tuple
df1['Coord'] = df1[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df1.drop(columns=['Latitude', 'Longitude'], inplace=True)
df2['Coord'] = df2[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df2.drop(columns=['Latitude', 'Longitude'], inplace=True)
# Creating a distance matrix between df1['Coord'] and df2['Coord']
distances_df1_df2 = cdist(df1['Coord'].to_list(), df2['Coord'].to_list())
# Creating df1['Price'] column from df2 and the distance matrix
for i in df1.index:
# you can replace the following lines with a loop over distances_df1_df2[i]
# and reject names that are too far from each other
min_dist = np.amin(distances_df1_df2[i])
if min_dist > 0.001:
continue
closest_match = np.argmin(distances_df1_df2[i])
# df1.loc[i, 'df2_Name'] = df2.loc[closest_match, 'Name'] # keep track of the merged row
df1.loc[i, 'Price'] = df2.loc[closest_match, 'Price']
print(df1)
Output:
Name Rating Coord Price
0 Game Time Bar 4.5 (42.3734, -71.1204) $$
1 Sports Grill 4.6 (42.3739, -71.1214) $
2 Sports Grill 2 4.3 (42.3839, -71.1315) $$
Edit: your requirement on 'Name' ("at least an 80% match") isn't really appropriate. Take a look at fuzzywuzzy to get a sense of how string distances can be measured.

Split column in a Dask Dataframe into n number of columns

In a column in a Dask Dataframe, I have strings like this:
column_name_1
column_name_2
a^b^c
j
e^f^g
k^l
h^i
m
I need to split these strings into columns in the same data frame, like this
column_name_1
column_name_2
column_name_1_1
column_name_1_2
column_name_1_3
column_name_2_1
column_name_2_2
a^b^c
j
a
b
c
j
e^f^g
k^l
e
f
g
k
l
h^i
m
h
i
m
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata

Here are 2 solutions working without stack but with loop for selected columns names:
cols = ['column_name_1','column_name_2']
for c in cols:
df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Or modify another solution:
cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m

Unfortunately using dask.dataframe.Series.str.split with expand=True and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()
Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta argument. This makes using Dask for this task challenging. Relatedly, the ValueError occurs because column_name_2 requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.
Here is one solution (building from #Fontanka16's answer) if you do know the number of splits ahead of time:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?

Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem

This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''

IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

Highlight panda df errors based on conditions

Good day SO community,
I have been having an issue with trying to highlight errors in my df, row by row.
reference_dict = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
dict = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
df = pd.DataFrame(data=dict)
def highlight_rows(df):
for i in df.index:
if df.jobclass[i] in reference_dict['jobclass']:
print(df.jobclass[i])
return 'background-color: green'
df.style.apply(highlight_rows, axis = 1)
I am getting the error:
TypeError: ('string indices must be integers', 'occurred at index 0')
What i hope to get is my df with values not found in my reference_dict being highlighted.
Any help would be greatly appreciated.. Cheers!
Edit:
x = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
d = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
df = pd.DataFrame(data=d)
print(df)
def highlight_rows(s):
ret = ["" for i in s.index]
for i in df.index:
if df.jobclass[i] not in x['jobclass']:
ret[s.index.get_loc('Jobs')] = "background-color: yellow"
return ret
df.style.apply(highlight_rows, axis = 1)
Tried this and got the whole column highlighted instead of the specific rows values that i desire.. =/

You can use merge with parameter indicator for found not matched values and then create DataFrame of styles:
x = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
d = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
df = pd.DataFrame(data=d)
print (df)
jobclass Jobs
0 A Teacher
1 C Plumber
2 A Policeman
Detail:
print (df.merge(pd.DataFrame(x) , on='jobclass', how='left', indicator=True))
jobclass Jobs_x Jobs_y _merge
0 A Teacher Teacher both
1 C Plumber NaN left_only
2 A Policeman Teacher both
def highlight_rows(s):
c1 = 'background-color: yellow'
c2 = ''
df1 = pd.DataFrame(x)
m = s.merge(df1, on='jobclass', how='left', indicator=True)['_merge'] == 'left_only'
df2 = pd.DataFrame(c2, index=s.index, columns=s.columns)
df2.loc[m, 'Jobs'] = c1
return df2
df.style.apply(highlight_rows, axis = None)

Good day to you as well!
What i hope to get is my df with values not found in my reference_dict being highlighted.
If you're looking for values not found in reference_dict to be highlighted, do you mean for the function to be the following?
def highlight_rows(df):
for i in df.index:
if df.jobclass[i] not in reference_dict['jobclass']:
print(df.jobclass[i])
return 'background-color: green'
Either way, why highlight the rows when you could isolate them? It seems like you want to look at all of the job classes in df where there is not one in reference_dict.
import pandas as pd
reference_dict = {'jobclass' : ['A','B'], 'Jobs' : ['Teacher','Plumber']}
data_dict = {'jobclass': ['A','C','A'], 'Jobs': ['Teacher', 'Plumber','Policeman']}
ref_df = pd.DataFrame(reference_dict)
df = pd.DataFrame(data_dict)
outliers = df.merge(ref_df, how='outer', on='jobclass') # merge the two tables together, how='outer' includes jobclasses which the DataFrames do not have in common. Will automatically generate columns Jobs_x and Jobs_y once joined together because the columns have the same name
outliers = outliers[ outliers['Jobs_y'].isnull() ] # Jobs_y is null when there is no matching jobclass in the reference DataFrame, so we can take advantage of that by filtering
outliers = outliers.drop('Jobs_y', axis=1) # let's drop the junk column after we used it to filter for what we wanted
print("The reference DataFrame is:")
print(ref_df,'\n')
print("The input DataFrame is:")
print(df,'\n')
print("The result is a list of all the jobclasses not in the reference DataFrame and what job is with it:")
print(outliers)
The result is:
The reference DataFrame is:
jobclass Jobs
0 A Teacher
1 B Plumber
The input DataFrame is:
jobclass Jobs
0 A Teacher
1 C Plumber
2 A Policeman
The result is a list of all the jobclasses not in the reference DataFrame and what job is with it:
jobclass Jobs_x
2 C Plumber
This could have been a tangent but it's what I'd do. I was not aware you could highlight rows in pandas at all, cool trick.

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.

You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)

Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pairwise correlations in dataframe - python

Related

Join Pandas DataFrames on Fuzzy/Approximate Matches for Multiple Columns

Split column in a Dask Dataframe into n number of columns

How to compare columns of two dataframes and have consequences when they match in Python Pandas

Highlight panda df errors based on conditions

Given a pandas dataframe, is there an easy way to print out a command to generate it?

Categories

Resources