Diagonal heatmap with matplotlib - python

I have a heatmap that I created from Pandas in this way:
tukey = tukey.set_index('index')
fix,ax = plt.subplots(figsize=(12,6))
ax.set_title(str(date)+' '+ str(hour)+':'+'00',fontsize=14)
heatmap_args = {'linewidths': 0.35, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.75, 0.35, 0.04, 0.3]}
sp.sign_plot(tukey, **heatmap_args)
I have tried to do this with seaborn but I haven't gotten the desired output:
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(tukey, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 6))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(tukey, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
As seen, it still shows square where it is supposed to be masked and obviously the cbar is different.
My question is if there is any way to make it diagonal without using seaborn? Or at least just to get rid of the repeating part?
Edit: sample of my dataframe (the tukey):
>>> 1_a 1_b 1_c 1_d 1_e 1_f
index
1_a 1.00 0.900 0.75 0.736 0.900 0.400
1_b 0.9000 1.000 0.72 0.715 0.900 0.508
1_c 0.756 0.342 1.000 0.005 0.124 0.034
1_d 0.736 0.715 0.900 1.000 0.081 0.030
1_e 0.900 0.900 0.804 0.793 1.000 0.475
1_f 0.400 0.508 0.036 0.030 0.475 1.000
*I might have typo mistakes , the two diagonal sides suppose to be equal.
edit:
imports:
import scikit_posthocs as sp
import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

scikit_posthocs' sign_plot() seems to create a QuadMesh (as does sns.heatmap). Setting an edge color to such a mesh will show horizontal and vertical lines for the full width and height of the mesh. To make the edges invisible in the "empty" region, they can be colored the same as the background (for example white). Making individual cells invisible can be done by setting their values to NaN such as in the code below.
Removing a column and a row (e.g. tukey.drop('1_f', axis=1, inplace=True) and
tukey.drop('1_a', axis=0, inplace=True)), doesn't help to make the plot a bit smaller because sign_plot adds them back in automatically.
import matplotlib.pyplot as plt
import scikit_posthocs as sp
import pandas as pd
import numpy as np
from io import StringIO
data_str = ''' 1_a 1_b 1_c 1_d 1_e 1_f
1_a 1.00 0.900 0.75 0.736 0.900 0.400
1_b 0.9000 1.000 0.72 0.715 0.900 0.508
1_c 0.756 0.342 1.000 0.005 0.124 0.034
1_d 0.736 0.715 0.900 1.000 0.081 0.030
1_e 0.900 0.900 0.804 0.793 1.000 0.475
1_f 0.400 0.508 0.036 0.030 0.475 1.000'''
tukey = pd.read_csv(StringIO(data_str), delim_whitespace=True)
cols = tukey.columns
for i in range(len(cols)):
for j in range(i, len(cols)):
tukey.iloc[i, j] = np.nan
fix, ax = plt.subplots(figsize=(12, 6))
heatmap_args = {'linewidths': 0.35, 'linecolor': 'white', 'clip_on': False, 'square': True,
'cbar_ax_bbox': [0.75, 0.35, 0.04, 0.3]}
sp.sign_plot(tukey, **heatmap_args)
plt.show()

Related

Adding a weighting to curve fit

So I have a curve fit, and I'm wondering how to include the weighting based on the standard error.
Here's the df (defined as df_altered):
Temperature Growth_rate Standard_Error Result Final_results Weight
14.0 0.363 0.110 0.000 0.363 9.091
18.0 0.677 0.043 0.767 0.673 23.256
22.0 0.822 0.044 0.975 0.832 22.727
26.0 0.936 0.073 0.975 0.920 13.699
30.0 0.897 0.051 0.767 0.911 19.608
And here's the curve fit setup (I don't really know if this qualifies as a curve fit, though):
import numpy as np
import matplotlib.pyplot as plt
x = df_altered['Temperature']
y1 = df_altered['Growth_rate']
y2 = df_altered['Final_results']
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(x, y1, 'r-')
ax2.plot(x, y2, 'g-')
ax1.set_xlabel('Temperature (°C)')
ax1.set_ylabel('Observed growth rate', color='r')
ax2.set_ylabel('Optimised modelled growth rate', color='g')
plt.show()
So essentially, I'd like to use the Weight and Standard_Error columns as a set of parameters to determine the weighting of the plots (the smaller the standard error, the greater the weighting). I've already set this up with popt2:
popt2, pcov2 = curve_fit(boatman_temperature_function_optimised, xdata = df_altered.Temperature, ydata = df_altered.Growth_rate, p0 = array_of_maxima, sigma=df_altered.Weight, bounds = (array_of_minima, array_of_maxima), absolute_sigma=True)
Any thoughts?

Different Scikit Learn R^2 results on different computers

In the code below the 'correct' R2 value for sigma = 1 is 0.33, which I receive when run on my work computer. However on my personal computer I receive R2 = 0.119. This has been confirmed across multiple other computers running my exact code.
Only my personal computer produces this strange 0.119 result (even running the 'solution' code produces the 0.119 result). I have tried multiple clean installs of Anaconda to no avail.
Only thing I can think is that maybe my 'clean' installs aren't 'clean' enough. I have tried a few methods of fully deleting Anaconda and Python, maybe someone has a robust method for this?
x_peak
[2688.126327 2692.813829 2697.501331 2702.188833 2706.876334 2711.563836
2716.251338 2720.93884 2725.626341 2730.313843 2735.001345 2739.688846
2744.376348 2749.06385 2753.751352 2758.438853 2763.126355 2767.813857
2772.501359 2777.18886 2781.876362 2786.563864 2791.251366 2795.938867
2800.626369 2805.313871 2810.001373 2814.688874 2819.376376 2824.063878
2828.75138 2833.438881 2838.126383 2842.813885 2847.501387 2852.188888
2856.87639 2861.563892 2866.251394 2870.938895 2875.626397 2880.313899
2885.0014 2889.688902 2894.376404 2899.063906 2903.751407 2908.438909
2913.126411 2917.813913 2922.501414 2927.188916 2931.876418 2936.56392
2941.251421 2945.938923 2950.626425 2955.313927 2960.001428 2964.68893
2969.376432 2974.063934 2978.751435 2983.438937 2988.126439 2992.813941
2997.501442 3002.188944 3006.876446 3011.563948 3016.251449 3020.938951
3025.626453 3030.313954 3035.001456 3039.688958 3044.37646 3049.063961
3053.751463 3058.438965 3063.126467 3067.813968 3072.50147 3077.188972
3081.876474 3086.563975 3091.251477 3095.938979 3100.626481 3105.313982
3110.001484 3114.688986 3119.376488 3124.063989 3128.751491 3133.438993
3138.126495 3142.813996 3147.501498 3152.189 ]
y_peak
[0.01 0.011 0.011 0.012 0.013 0.015 0.017 0.018 0.02 0.021 0.024 0.027
0.029 0.03 0.031 0.033 0.034 0.036 0.037 0.039 0.04 0.043 0.047 0.049
0.052 0.055 0.058 0.062 0.066 0.071 0.077 0.085 0.097 0.111 0.141 0.169
0.183 0.235 0.265 0.324 0.35 0.396 0.421 0.45 0.467 0.486 0.514 0.51
0.464 0.444 0.437 0.432 0.432 0.437 0.442 0.45 0.475 0.501 0.541 0.553
0.594 0.611 0.611 0.607 0.612 0.607 0.521 0.471 0.424 0.331 0.264 0.216
0.161 0.114 0.094 0.054 0.034 0.021 0.014 0.008 0.007 0.005 0.004 0.003
0.003 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0. 0. 0. 0.
0. 0. 0. 0. ]
import numpy as np
import pandas as pd
import pylab as plt
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/ethanol_IR.csv')
x_all = df['wavenumber [cm^-1]'].values
y_all = df['absorbance'].values
x_peak = x_all[475:575]
y_peak = y_all[475:575]
x_train = x_peak[::3]
y_train = y_peak[::3]
sigmas = [1, 10, 50, 100, 150]
def rbf(x_train, x_test=None, gamma=1):
if x_test is None:
x_test = x_train
N = len(x_test) #<- number of data points
M = len(x_train) #<- number of features
X = np.zeros((N,M))
for i in range(N):
for j in range(M):
X[i,j] = np.exp(-gamma*(x_test[i] - x_train[j])**2)
return X
model_rbf = LinearRegression() #create a linear regression model instance
n = len(sigmas)
def gam(sigma):
gam = 1./(2*sigma**2)
return gam
for i in range(n):
total = []
gamma = gam(sigmas[i])
print('Sigma = {}'.format(sigmas[i]))
X_train = rbf(x_train, gamma=gamma)
model_rbf.fit(X_train, y_train) #fit the model
r2 = model_rbf.score(X_train, y_train) #get the "score", which is equivalent to r^2
print('r^2 training = {}'.format(r2))
X_all = rbf(x_train, x_test=x_peak, gamma=gamma)
yhat = model_rbf.predict(X_all)
r2 = model_rbf.score(X_all, y_peak) #get the "score", which is equivalent to r^2
print('r^2 testing = {}'.format(r2))

3D Surface Colormap in Python

There's another problem I'm facing now. I have a data file like the following:
# Time THR-1 ALA-2 PRO-3 VAL-4 PRO-5 MET-6 PRO-7 ASP-8 LEU-9 LYS-10 ASN-11 VAL-12 LYS-13 SER-14 LYS-15 ILE-16 GLY-17 SER-18 THR-19 GLU-20 ASN-21 LEU-22 LYS-23 HIS-24 GLN-25 PRO-26 GLY-27 GLY-28 GLY-29 LYS-30 VAL-31 GLN-32 ILE-33 ILE-34 ASN-35 LYS-36 LYS-37 LEU-38 ASP-39 LEU-40 SER-41 ASN-42 VAL-43 GLN-44 SER-45 LYS-46 CYS-47 GLY-48 SER-49 LYS-50 ASP-51 ASN-52 ILE-53 LYS-54 HIS-55 VAL-56 PRO-57 GLY-58 GLY-59 GLY-60 SER-61 VAL-62 GLN-63 ILE-64 VAL-65 TYR-66 LYS-67 PRO-68 VAL-69 ASP-70 LEU-71 SER-72 LYS-73 VAL-74 THR-75 SER-76 LYS-77 CYS-78 GLY-79 SER-80 LEU-81 GLY-82 ASN-83 ILE-84 HIS-85 HIS-86 LYS-87 PRO-88 GLY-89 GLY-90 GLY-91 GLN-92 VAL-93 GLU-94 VAL-95 LYS-96 SER-97 GLU-98 LYS-99 LEU-100 ASP-101 PHE-102 LYS-103 ASP-104 ARG-105 VAL-106 GLN-107 SER-108 LYS-109 ILE-110 GLY-111 SER-112 LEU-113 ASP-114 ASN-115 ILE-116 THR-117 HIS-118 VAL-119 PRO-120 GLY-121 GLY-122 GLY-123 ASN-124 DA-1 DA-2 DA-3 DC-4 DA-5 DT-6 DG-7 DT-8 DT-9 DA-10 DA-11 DA-12 DC-13 DA-14 DT-15 DG-16 DT-17 DT-18 DT-19 DA-1 DA-2 DA-3 DC-4 DA-5 DT-6 DG-7 DT-8 DT-9 DT-10 DA-11 DA-12 DC-13 DA-14 DT-15 DG-16 DT-17 DT-18 DT-19
0.000 84.841 0.274 8.595 -4.939 1.713 -1.704 0.768 -127.825 5.554 108.207 5.297 8.390 212.124 2.830 39.479 8.168 0.458 8.848 6.897 -83.882 29.016 9.647 308.856 6.400 32.481 11.327 10.372 0.247 -3.669 45.391 7.648 -6.990 16.870 11.946 18.778 29.161 127.841 -1.885 -49.943 4.716 6.552 16.029 4.803 7.307 5.423 35.449 -1.362 0.703 0.817 5.544 -14.168 -2.450 0.138 10.984 2.680 -0.238 -0.204 -1.814 -0.273 0.971 -0.256 2.553 -1.172 0.337 0.659 -3.890 8.570 1.180 2.319 -10.711 0.433 0.320 7.904 -0.021 1.672 -0.895 -1.804 -0.317 0.233 0.013 1.462 -1.310 -3.139 -1.453 -4.536 0.559 59.050 -10.891 3.089 5.579 9.818 6.599 -1.635 -34.622 2.576 14.145 9.062 -82.518 51.319 -5.944 -42.734 -0.065 5.200 -18.819 -1.670 0.354 -0.142 -0.938 -4.108 -0.582 -0.511 -0.452 0.763 -21.291 2.587 -5.088 -0.458 5.958 -0.746 -0.587 0.600 6.134 9.432 -47.476 0.517 -0.958 -1.246 0.005 -1.422 -5.105 -2.815 -6.459 -1.618 56.055 117.408 92.845 60.554 -6.065 -9.293 -3.752 -5.407 -1.491 -4.924 -0.944 13.894 32.688 15.937 2.866 -0.934 25.169 1.291 -5.292 -8.727 5.852 -8.092 -40.334 -18.542 0.468 -6.011 -2.043 -1.305 -0.959
10.000 127.315 0.993 15.230 12.627 0.804 0.642 -2.810 -101.634 5.500 114.097 3.368 9.100 162.819 -10.033 39.935 6.920 9.887 9.732 4.997 -79.368 25.134 -5.714 307.359 5.781 34.996 8.885 7.234 -5.875 -0.094 31.674 3.963 -8.064 14.720 12.726 25.431 25.011 108.108 -0.293 -63.815 4.442 1.071 12.768 2.871 1.451 2.179 30.666 -2.066 0.995 1.496 3.384 -1.398 -0.776 -0.101 5.159 1.092 -0.829 -0.205 -0.125 1.054 0.574 -0.291 1.106 0.875 -1.106 -1.955 1.153 4.273 0.628 1.305 -5.547 0.755 0.126 3.704 0.925 0.074 -0.516 3.643 -0.133 -0.064 0.717 0.547 0.197 -0.408 -0.912 -1.296 0.508 35.027 -3.056 10.216 5.885 8.755 -0.792 -1.442 -28.498 2.122 6.803 1.344 -58.583 47.395 -2.332 -32.863 -2.826 5.311 -23.087 6.478 -0.205 0.288 -0.373 4.358 0.362 -1.010 -0.352 2.271 -13.406 -2.747 -4.616 -2.275 3.943 -4.391 -7.063 -0.599 3.081 12.778 -40.043 0.327 -1.940 -2.012 2.592 2.909 1.041 0.658 -0.868 -3.206 16.355 109.843 107.372 63.801 8.499 0.931 2.639 -0.884 0.214 1.880 -2.379 8.408 12.583 10.883 23.083 7.955 31.277 0.539 3.992 -0.887 12.925 -4.248 -31.420 -4.812 1.125 3.287 -0.532 -0.438 0.291
20.000 84.636 5.538 15.954 10.437 0.439 1.773 -1.913 -96.625 5.704 132.598 -0.572 6.877 174.628 -9.400 32.417 -0.264 3.812 6.175 5.056 -62.617 25.479 -1.171 288.031 8.114 37.636 10.461 4.612 -3.521 -0.335 37.957 6.596 -11.250 12.510 11.557 21.128 37.344 135.293 -2.163 -80.896 0.912 1.963 1.101 2.815 6.051 5.374 28.443 0.905 1.734 0.813 5.060 -1.365 1.653 -0.415 4.862 1.758 -0.572 -0.339 0.423 0.759 1.036 -0.543 0.783 0.102 -0.971 -1.529 -1.595 5.519 0.587 1.306 -2.813 0.605 0.761 4.542 0.698 0.767 -0.050 2.201 -0.084 0.563 0.357 0.422 0.642 0.588 -1.426 -1.375 1.455 31.332 -3.390 16.696 15.616 13.449 0.096 -2.711 -24.804 1.969 4.095 2.078 -58.303 47.776 -1.047 -22.013 -2.270 4.204 -11.059 3.952 0.382 -0.863 0.010 3.473 0.375 -1.301 -0.037 1.396 -14.392 -2.887 -5.915 -2.315 5.888 -3.365 -5.950 -2.439 4.814 7.125 -46.399 4.393 5.939 -0.508 2.461 2.562 -0.717 4.225 3.642 4.664 27.859 104.835 114.077 74.730 8.410 1.862 0.061 -1.288 -1.181 2.106 4.346 9.017 29.050 -5.088 14.618 4.149 5.062 1.369 15.083 9.537 18.306 -1.165 -8.966 3.864 3.523 7.232 4.275 1.888 4.708
30.000 91.953 11.008 15.794 12.043 0.596 4.611 1.048 -70.764 7.475 72.100 1.360 6.891 150.455 -7.180 11.932 4.845 9.519 6.184 4.684 -57.283 24.797 0.393 275.626 14.021 22.233 10.877 0.934 -7.551 -2.439 27.929 5.098 -6.797 12.784 12.140 19.698 25.762 108.882 0.267 -54.801 1.470 2.139 1.302 1.996 2.021 3.090 22.690 0.669 1.347 0.113 5.378 -1.570 0.585 -0.143 1.156 -0.050 -1.086 0.148 -0.017 -0.417 -0.201 -1.304 0.808 -0.950 -0.958 -1.741 0.200 2.846 0.633 1.279 -3.693 0.338 -1.058 3.651 0.009 0.202 -1.009 0.037 -0.245 -0.183 -0.615 0.192 -0.386 0.426 -1.800 -2.009 0.496 33.517 -4.213 15.421 16.942 14.559 0.109 -2.553 -25.113 1.199 2.074 -0.265 -56.399 40.657 -0.746 -24.020 -1.986 3.400 -9.631 1.384 0.502 -1.001 0.547 2.622 -0.201 -1.062 -0.916 0.493 -14.621 -2.660 -4.459 -1.066 3.788 -4.289 -7.086 2.460 5.341 8.759 -39.474 -0.051 2.116 0.498 1.267 0.728 1.071 1.155 0.824 3.214 32.413 124.028 144.011 80.795 11.199 5.365 1.969 0.659 2.780 2.311 1.671 14.244 33.170 -6.859 -6.106 13.690 4.742 0.645 17.301 12.245 15.829 -11.976 -22.289 3.100 1.725 5.538 5.041 3.517 -0.205
40.000 149.956 11.453 22.603 13.125 1.909 5.563 1.533 -90.126 5.479 90.590 4.141 6.652 173.681 -3.703 24.551 3.012 10.247 12.607 7.241 -64.707 21.636 -0.285 276.445 6.223 29.727 8.346 5.092 -5.591 -2.969 27.881 3.581 -6.824 13.884 11.709 21.034 25.732 104.610 -0.237 -54.221 1.960 1.674 2.394 1.727 6.499 3.453 25.335 0.636 0.754 -0.591 5.789 -3.344 1.182 -0.366 0.810 0.901 -0.625 -0.997 -0.241 0.214 0.311 -0.312 0.498 -1.336 -0.911 -1.210 -2.459 3.182 0.599 0.713 -4.273 0.326 0.522 3.207 0.312 0.830 -0.558 1.351 -0.017 0.569 -0.367 0.966 -0.637 -2.392 -2.722 -3.405 0.818 39.708 -2.537 16.297 14.229 10.427 0.837 -1.855 -24.033 0.996 5.579 -1.055 -65.068 48.891 -2.411 -21.785 -2.094 1.285 -3.668 1.264 0.463 0.070 -0.034 2.779 0.115 -0.947 1.107 0.337 -16.009 -3.881 -5.203 -1.503 0.358 -4.410 -8.007 -1.383 10.872 17.390 -47.147 1.140 -2.218 -0.597 -0.312 0.685 1.781 5.662 1.917 1.504 32.806 123.230 132.991 68.245 11.523 3.048 0.389 -0.890 0.170 2.100 1.166 11.693 31.756 2.595 19.844 24.565 30.414 11.828 18.563 22.426 20.596 -13.383 -18.574 -2.142 4.737 1.680 0.071 3.983 -0.001
For which, I'm trying to plot a 3D colormap with time in the x-axis, number of columns in the y-axis and their respective values in the z-axis.
I have written the following code to extract data from the file and to plot it:
#!/usr/bin/python
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np
data = np.loadtxt('contrib_pol.dat', skiprows=1)
x = data[:,0]
y = range(1,len(data[0,:]))
z = []
fig=plt.figure()
ax=fig.gca(projection='3d')
for r, row in enumerate(data):
for c, col in enumerate(row[1:], start=1):
z.append(col)
surf = ax.plot_surface(x, y, z, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
And I'm having the errors:
Traceback (most recent call last):
File "./barplot.py", line 31, in <module>
linewidth=0, antialiased=False)
File "/usr/lib/python2.7/dist-packages/mpl_toolkits/mplot3d/axes3d.py", line 1586, in plot_surface
X, Y, Z = np.broadcast_arrays(X, Y, Z)
File "/home/microbio/.local/lib/python2.7/site-packages/numpy/lib/stride_tricks.py", line 250, in broadcast_arrays
shape = _broadcast_shape(*args)
File "/home/microbio/.local/lib/python2.7/site-packages/numpy/lib/stride_tricks.py", line 185, in _broadcast_shape
b = np.broadcast(*args[:32])
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Can you help???
Check the docstring of plot_surface, it states that you need to supply the data as 2D-arrays. With two additional lines of code, you can make it work using numpy.meshgrid to get the base grid, and a numpy.reshape to get your z value in the right format.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np
data = np.loadtxt('contrib_pol.dat', skiprows=1)
x = data[:,0]
y = range(1,len(data[0,:]))
z = []
fig=plt.figure()
ax=fig.gca(projection='3d')
for r, row in enumerate(data):
for c, col in enumerate(row[1:], start=1):
z.append(col)
# generate the grid
xx, yy = np.meshgrid(x, y)
# reshaping your data to match the grid shape
zz = np.reshape(z, (len(y), len(x)))
surf = ax.plot_surface(xx, yy, zz, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()

How can i plot this in matplotlib with 3 dimensions?

My data looks like this:
Rate 4 8 16 32 64 128 256
David 0.25 0.176 0.652 0.126 0.123 0.142 0.318
Saul 0.132 0.244 0.142 0.162 0.174 0.244 0.149
Maria 0.145 0.189 0.65 0.42 0.111 0.197 0.182
I need to plot this data 3d. For clarity, I have 3 dimensions, first one is names like David etc. Secondly the rate, and a variable value for every combination of Name and Rate.
Which kind of plot should I use? I am a little confused because rates and variables are in the same dimension but names are different and when I use regular 3d plot like scatter it said xs, ys and zs should be in the same dimension.
This is my code:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig=plt.figure()
ax=fig.add_subplot(111, projection='3d')
X=[1,2,3]
Y=[4,5,6]
Z=[7,8,9]
ax.scatter(X,Y,Z, c='r', marker='o')
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
ax.set_zlabel('z axis')
plt.show()
If we assume X, Y and Z as names, rates and variables, then an error arise about x and y and z should be in the same dimension!

How can I plot a correlation matrix as a set of ellipses, similar to the R open-air package?

The figure below is plotted using the open-air R package:
I know matplotlib has the plt.matshow function,
but it can't clearly show the relation between variables at the same time.
Here is my early work:
df is a pandas dataframe with 7 variables shows like below:
I don't know how to attach a .csv file to StackOverflow.
Using plt.matshow(df.corr(),cmap = plt.cm.Greens), the figure shows like this:
The second figure can't represent the correlation relations of the variables as clearly as the first one.
Edit:
I upload the csv file to Google docs here.
I'm not aware of any existing Python library that does these "ellipse plots", but it's not particularly hard to implement using a matplotlib.collections.EllipseCollection:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.collections import EllipseCollection
def plot_corr_ellipses(data, ax=None, **kwargs):
M = np.array(data)
if not M.ndim == 2:
raise ValueError('data must be a 2D array')
if ax is None:
fig, ax = plt.subplots(1, 1, subplot_kw={'aspect':'equal'})
ax.set_xlim(-0.5, M.shape[1] - 0.5)
ax.set_ylim(-0.5, M.shape[0] - 0.5)
# xy locations of each ellipse center
xy = np.indices(M.shape)[::-1].reshape(2, -1).T
# set the relative sizes of the major/minor axes according to the strength of
# the positive/negative correlation
w = np.ones_like(M).ravel()
h = 1 - np.abs(M).ravel()
a = 45 * np.sign(M).ravel()
ec = EllipseCollection(widths=w, heights=h, angles=a, units='x', offsets=xy,
transOffset=ax.transData, array=M.ravel(), **kwargs)
ax.add_collection(ec)
# if data is a DataFrame, use the row/column names as tick labels
if isinstance(data, pd.DataFrame):
ax.set_xticks(np.arange(M.shape[1]))
ax.set_xticklabels(data.columns, rotation=90)
ax.set_yticks(np.arange(M.shape[0]))
ax.set_yticklabels(data.index)
return ec
For example, using your data:
data = df.corr()
fig, ax = plt.subplots(1, 1)
m = plot_corr_ellipses(data, ax=ax, cmap='Greens')
cb = fig.colorbar(m)
cb.set_label('Correlation coefficient')
ax.margins(0.1)
Negative correlations can be plotted as ellipses with the opposite orientation:
fig2, ax2 = plt.subplots(1, 1)
data2 = np.linspace(-1, 1, 9).reshape(3, 3)
m2 = plot_corr_ellipses(data2, ax=ax2, cmap='seismic', clim=[-1, 1])
cb2 = fig2.colorbar(m2)
ax2.margins(0.3)
Assuming you are interested in showing cluster relations, the seaborn package mentioned in the comments also has a clustermap. Using your correlation matrix (looks like you want to show correlation coefficients as int in the [-100, 100] range, you could do the following:
corr = df.corr().mul(100).astype(int)
GX HG RM SJ XB XN ZG
GX 100 77 62 71 48 66 57
HG 77 100 69 74 61 61 58
RM 62 69 100 75 48 64 68
SJ 71 74 75 100 50 70 65
XB 48 61 48 50 100 46 51
XN 66 61 64 70 46 100 75
ZG 57 58 68 65 51 75 100
and then use seaborn.clustermap() as follows:
import seaborn as sns
sns.clustermap(data=corr, annot=True, fmt='d', cmap='Greens').savefig('cluster.png')
I just discovered this Python package biokit today. It provides a very handy function to create various kinds of correlation charts. For example:
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
...: from biokit.viz import corrplot
In [6]: corr
Out[6]:
GX HG RM SJ XB XN ZG
GX 1.00 -0.77 0.62 0.71 0.48 0.66 0.57
HG -0.77 1.00 0.69 0.74 0.61 0.61 0.58
RM 0.62 0.69 1.00 0.75 0.48 0.64 0.68
SJ 0.71 0.74 0.75 1.00 0.50 0.70 0.65
XB 0.48 0.61 0.48 0.50 1.00 -0.46 0.51
XN 0.66 0.61 0.64 0.70 -0.46 1.00 0.75
ZG 0.57 0.58 0.68 0.65 0.51 0.75 1.00
I took Stefan's data and modified it a little bit. Let's assume this is a correlation matrix. Now to create a correlation chart, you can simply do this:
In [7]: c = corrplot.Corrplot(corr)
...: c.plot()
Correlation chart with ellipses
You can read more examples here.

Categories