Plot existing covariance dataframe - python

I have computed a covariance of 26 inputs from another software. I have an existing table of the results. See image below:
What I want to do is enter the table as a pandas dataframe and plot the matrix. I have seen the thread here: Plot correlation matrix using pandas. However, the aforementioned example, computed the covariance first and plotted the 'covariance' object. In my case, I want to plot the dataframe object to look like the covariance matrix in the example.
Link to data: HERE.

IIUC, you can use seaborn.heatmap with annot=True :
plt.figure(figsize=(6, 4))
(
pd.read_excel("/tmp/Covariance Matrix.xlsx", header=None)
.pipe(lambda df: sns.heatmap(df.sample(10).sample(10, axis=1), annot=True, fmt=".1f"))
);
# for a sample of 10 rows / 10 columns
Output :
And, as suggested by stukituk in the comments, you can add cmap="coolwarm" for colors :

a clean option, in my opinion, from this other answer: How to plot only the lower triangle of a seaborn heatmap?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_excel('Covariance Matrix.xlsx', header=None)
# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(df)
# using the upper triangle matrix as mask
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(df, ax=ax, fmt='.1g', annot=True, mask=matrix)
plt.show()
hope this helps

Related

Creating a 2 colour heatmap with Python

I have numerous sets of seasonal data that I am looking to show in a heatmap format. I am not worried about the magnitude of the values in the dataset but more the overall direction and any patterns that i can look at in more detail later. To do this I want to create a heatmap that only shows 2 colours (red for below zero and green for zero and above).
I can create a normal heatmap with seaborn but the normal colour maps do not have only 2 colours and I am not able to create one myself. Even if I could I am unable to set the parameters to reflect the criteria of below zero = red and zero+ = green.
I managed to create this simply by styling the dataframe but I was unable to export it as a .png because the table_criteria='matplotlib' option removes the formatting.
Below is an example of what I would like to create made from random data, could someone help or point me in the direction of a helpful Stackoverflow answer?
I have also included the code I used to style and export the dataframe.
Desired output - this is created with random data in an Excel spreadsheet
#Code to create a regular heatmap - can this be easily amended?
df_hm = pd.read_csv(filename+h)
pivot = df_hm.pivot_table(index='Year', columns='Month', values='delta', aggfunc='sum')
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title('M1 '+h[:-7])
sns.heatmap(pivot, annot=True, fmt='.2f', cmap='RdYlGn')
plt.savefig(chartpath+h[:-7]+" M1.png", bbox_inches='tight')
plt.close()
#code used to export dataframe that loses format in the .png
import matplotlib.pyplot as plt
import dataframe_image as dfi
#pivot is the dateframe name
pivot = pd.DataFrame(np.random.randint(-100,100,size= (5, 12)),columns=list ('ABCDEFGHIJKL'))
styles = [dict(selector="caption", props=[("font-size", "120%"),("font-weight", "bold")])]
pivot = pivot.style.format(precision=2).highlight_between(left=-100000, right=-0.01, props='color:white;background-color:red').highlight_between(left=0, right= 100000, props='color:white;background-color:green').set_caption(title).set_table_styles(styles)
dfi.export(pivot, root+'testhm.png', table_conversion='matplotlib',chrome_path=None)
You can manually set cmap property to list of colors and if you want to annotate you can do it and it will show same value as it's not converted to -1 or 1.
import numpy as np
import seaborn as sns
arr = np.random.randn(10,10)
sns.heatmap(arr,cmap=["grey",'green'],annot=True,center=0)
# center will make it dividing point
Output:
PS. If you don't want color-bar you can pass cbar=False in `sns.heatmap)
Welcome to SO!
To achieve what you need, you just need to pass delta through the sign function, here's an example code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
arr = np.random.randn(25,25)
sns.heatmap(np.sign(arr))
Which results in a binary heatmap, albeit one with a quite ugly colormap, still, you can fiddle around with Seaborn's colormaps in order to make it look like excel.

Finding the correlation between variables using python

I am trying to find the correlation of all the columns in this dataset excluding qualityand then plot the frequency distribution of wine quality.
I am doing it the following way, but how do I remove quality?
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
df.corr()
It returns this output:
How can I graph the frequency distribution of wine quality with pandas?
I previously used R for correlation and it worked fine for me but on this dataset I am learning use of pandas and python:
winecor = cor(wine[-12])
hist(wine$quality)
So in R I am getting the following output and I am looking for same in Python.
1. Histogram
# Import plotting library
import matplotlib.pyplot as plt
### Option 1 - histogram
plt.hist(df['quality'], bins=range(3, 10))
plt.show()
### Option 2 - bar plot (looks nicer)
# Get frequency per quality group
x = df.groupby('quality').size()
# Plot
plt.bar(x.index, x.values)
plt.show()
2. Correlation matrix
In order to get the correlation matrix of features, excluding quality:
# Option 1 - very similar to R
df.iloc[:, :-1].corr()
# Option 2 - more Pythonic
df.drop('quality', axis=1).corr()
You can plot histograms with:
import matplotlib.pyplot as plt
plt.hist(x=df['quality'], bins=30)
plt.show()
Read the docs of plt.hist() in order to understand better all the attributes

Creating whisker plots from grouped pandas Series

I have a dataset of values arriving in 5min timestamped intervals that I'm visualising grouped by hours of day, like this
I want to turn this into a whisker/box plot for the added information. However, the implementations of matplotlib, seaborn and pandas of this plot all want an array of raw data to compute the plot's contents themselves.
Is there a way to create whisker plots from pre-computed/grouped mean, median, std and quartiles? I would like to avoid reinventing the wheel with a comparatively inefficient grouping algorithm to build per-day datasets just for this.
This is some code to produce toy data and a version of the current plot.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# some toy data in a 15-day range
data = [1.5+np.sin(x)*5 for x in np.arange(0, 403.3, .1)]
s = pd.Series(data=data, index=pd.date_range('2019-01-01', '2019-01-15', freq='5min'))
s.groupby(s.index.hour).mean().plot(kind='bar')
plt.show()
Adding to #Quang Hoang's solution: You can use hlines() to display the median as well:
axis.bar(data.index, data['q75'] - data['q25'], bottom=data['q25'], width=wd)
axis.hlines(y=data['median'], xmin=data.index-wd/2, xmax=data.index+wd/2, color='black', linewidth=1)
I don't think there is anything for that. But you can create a whisker plot fairly simply with two plot command:
# precomputed data:
data = (s.groupby(s.index.hour)
.agg(['mean','std','median',
lambda x: x.quantile(.25),
lambda x: x.quantile(.75)])
)
data.columns = ['mean','std','median','q25','q75']
# plot the whiskers with `errorbar` from `mean` and `std`
fig, ax = plt.subplots(figsize=(12,6))
ax.errorbar(data.index,data['mean'],
yerr=data['std']*1.96,
linestyle='none',
capsize=5
)
# plot the boxes with `bar` at bottoms from quantiles
ax.bar(data.index, data['q75']-data['q25'], bottom=data['q25'])
Output:

Pandas DataFrame.hist Seaborn equivalent

When exploring a I often use Pandas' DataFrame.hist() method to quickly display a grid of histograms for every numeric column in the dataframe, for example:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.hist(bins=50, figsize=(10,7))
plt.show()
Which produces a figure with separate plots for each column:
I've tried the following:
import pandas as pd
import seaborn as sns
from sklearn import datasets
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
for col_id in df.columns:
sns.distplot(df[col_id])
But this produces a figure with a single plot and all columns overlayed:
Is there a way to produce a grid of histograms showing the data from a DataFrame's columns with Seaborn?
You can take advantage of seaborn's FacetGrid if you reorganize your dataframe using melt. Seaborn typically expects data organized this way (long format).
g = sns.FacetGrid(df.melt(), col='variable', col_wrap=2)
g.map(plt.hist, 'value')
There is no equivalent as seaborn displot itself will only pick 1-D array, or list, maybe you can try generating the subplots.
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
sns.distplot(df[df.columns[i*2+j]], ax=ax[i][j])
https://seaborn.pydata.org/examples/distplot_options.html
Here is an example how you can show 4 graphs using subplot, with seaborn.
Anothert useful SEABORN method to quickly display a grid of histograms for every numeric column in the dataframe for you could be the quick,clean and handy sns.pairplot()
try:
sns.pairplot(df)
this has a lot of cool parameters you can explor like Hue etc
pairplot example for iris dataset
if you DON'T want the scatters you can actually create a customised grid really really quickly using sns.PairGrid(df)
this creates an empty grid with all the spaces and you can map whatever you want on them :g = sns.pairgrid(df)
`g.map(sns.distplot)` or `g.map_diag(plt.scatter)`
etc
I ended up adapting jcaliz's to make it work more generally, i.e. not just when the DataFrame has four columns, I also added code to remove any unused axes and ensure axes appear in alphabetical order (as with df.hist()).
size = int(math.ceil(len(df.columns)**0.5))
fig, ax = plt.subplots(size, size, figsize=(10, 10))
for i in range(ax.shape[0]):
for j in range(ax.shape[1]):
data_index = i*ax.shape[1]+j
if data_index < len(df.columns):
sns.distplot(df[df.columns.sort_values()[data_index]], ax=ax[i][j])
for i in range(len(df.columns), size ** 2):
fig.delaxes(ax[i // size][i % size])

Heatmap correlation plot half with values number and half color map in seaborn

In the previous versions of seaborn (<0.7) it was present the function corrplot(), which allowed to plot a correlation matrix such that half of the matrix is numeric and the other half is a color map. Now, seaborn (0.7.1) has just the heatmap() function, that doesn't have this function directly. Is there a way to obtain the same result?
I have spend some time to do it, basically it require to overlap two heatmaps, where one makes use of a mask to cover half of the matrix. A code example is showed below.
import numpy as np
import pandas as pd
import seaborn
from matplotlib.colors import ListedColormap
from matplotlib.pylab import *
arr_name = ['D','S','P','E','C','KW','K','EF']
data = np.random.randn(8,8)
df = pd.DataFrame(data, columns=arr_name)
labels = df.where(np.triu(np.ones(df.shape)).astype(np.bool))
labels = labels.round(2)
labels = labels.replace(np.nan,' ', regex=True)
mask = np.triu(np.ones(df.shape)).astype(np.bool)
ax = seaborn.heatmap(df, mask=mask, cmap='RdYlGn_r', fmt='', square=True, linewidths=1.5)
mask = np.ones((8, 8))-mask
ax = seaborn.heatmap(df, mask=mask, cmap=ListedColormap(['white']),annot=labels,cbar=False, fmt='', linewidths=1.5)
ax.set_xticks([])
ax.set_yticks([])
plt.show()
The final result is following:

Categories