Is there a function to make scatterplot matrices in matplotlib? - python

Example of scatterplot matrix
Is there such a function in matplotlib.pyplot?

For those who do not want to define their own functions, there is a great data analysis libarary in Python, called Pandas, where one can find the scatter_matrix() method:
from pandas.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(1000, 4), columns = ['a', 'b', 'c', 'd'])
scatter_matrix(df, alpha = 0.2, figsize = (6, 6), diagonal = 'kde')

Generally speaking, matplotlib doesn't usually contain plotting functions that operate on more than one axes object (subplot, in this case). The expectation is that you'd write a simple function to string things together however you'd like.
I'm not quite sure what your data looks like, but it's quite simple to just build a function to do this from scratch. If you're always going to be working with structured or rec arrays, then you can simplify this a touch. (i.e. There's always a name associated with each data series, so you can omit having to specify names.)
As an example:
import itertools
import numpy as np
import matplotlib.pyplot as plt
def main():
np.random.seed(1977)
numvars, numdata = 4, 10
data = 10 * np.random.random((numvars, numdata))
fig = scatterplot_matrix(data, ['mpg', 'disp', 'drat', 'wt'],
linestyle='none', marker='o', color='black', mfc='none')
fig.suptitle('Simple Scatterplot Matrix')
plt.show()
def scatterplot_matrix(data, names, **kwargs):
"""Plots a scatterplot matrix of subplots. Each row of "data" is plotted
against other rows, resulting in a nrows by nrows grid of subplots with the
diagonal subplots labeled with "names". Additional keyword arguments are
passed on to matplotlib's "plot" command. Returns the matplotlib figure
object containg the subplot grid."""
numvars, numdata = data.shape
fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(8,8))
fig.subplots_adjust(hspace=0.05, wspace=0.05)
for ax in axes.flat:
# Hide all ticks and labels
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# Set up ticks only on one side for the "edge" subplots...
if ax.is_first_col():
ax.yaxis.set_ticks_position('left')
if ax.is_last_col():
ax.yaxis.set_ticks_position('right')
if ax.is_first_row():
ax.xaxis.set_ticks_position('top')
if ax.is_last_row():
ax.xaxis.set_ticks_position('bottom')
# Plot the data.
for i, j in zip(*np.triu_indices_from(axes, k=1)):
for x, y in [(i,j), (j,i)]:
axes[x,y].plot(data[x], data[y], **kwargs)
# Label the diagonal subplots...
for i, label in enumerate(names):
axes[i,i].annotate(label, (0.5, 0.5), xycoords='axes fraction',
ha='center', va='center')
# Turn on the proper x or y axes ticks.
for i, j in zip(range(numvars), itertools.cycle((-1, 0))):
axes[j,i].xaxis.set_visible(True)
axes[i,j].yaxis.set_visible(True)
return fig
main()

You can also use Seaborn's pairplot function:
import seaborn as sns
sns.set()
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

Thanks for sharing your code! You figured out all the hard stuff for us. As I was working with it, I noticed a few little things that didn't look quite right.
[FIX #1] The axis tics weren't lining up like I would expect (i.e., in your example above, you should be able to draw a vertical and horizontal line through any point across all plots and the lines should cross through the corresponding point in the other plots, but as it sits now this doesn't occur.
[FIX #2] If you have an odd number of variables you are plotting with, the bottom right corner axes doesn't pull the correct xtics or ytics. It just leaves it as the default 0..1 ticks.
Not a fix, but I made it optional to explicitly input names, so that it puts a default xi for variable i in the diagonal positions.
Below you'll find an updated version of your code that addresses these two points, otherwise preserving the beauty of your code.
import itertools
import numpy as np
import matplotlib.pyplot as plt
def scatterplot_matrix(data, names=[], **kwargs):
"""
Plots a scatterplot matrix of subplots. Each row of "data" is plotted
against other rows, resulting in a nrows by nrows grid of subplots with the
diagonal subplots labeled with "names". Additional keyword arguments are
passed on to matplotlib's "plot" command. Returns the matplotlib figure
object containg the subplot grid.
"""
numvars, numdata = data.shape
fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(8,8))
fig.subplots_adjust(hspace=0.0, wspace=0.0)
for ax in axes.flat:
# Hide all ticks and labels
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# Set up ticks only on one side for the "edge" subplots...
if ax.is_first_col():
ax.yaxis.set_ticks_position('left')
if ax.is_last_col():
ax.yaxis.set_ticks_position('right')
if ax.is_first_row():
ax.xaxis.set_ticks_position('top')
if ax.is_last_row():
ax.xaxis.set_ticks_position('bottom')
# Plot the data.
for i, j in zip(*np.triu_indices_from(axes, k=1)):
for x, y in [(i,j), (j,i)]:
# FIX #1: this needed to be changed from ...(data[x], data[y],...)
axes[x,y].plot(data[y], data[x], **kwargs)
# Label the diagonal subplots...
if not names:
names = ['x'+str(i) for i in range(numvars)]
for i, label in enumerate(names):
axes[i,i].annotate(label, (0.5, 0.5), xycoords='axes fraction',
ha='center', va='center')
# Turn on the proper x or y axes ticks.
for i, j in zip(range(numvars), itertools.cycle((-1, 0))):
axes[j,i].xaxis.set_visible(True)
axes[i,j].yaxis.set_visible(True)
# FIX #2: if numvars is odd, the bottom right corner plot doesn't have the
# correct axes limits, so we pull them from other axes
if numvars%2:
xlimits = axes[0,-1].get_xlim()
ylimits = axes[-1,0].get_ylim()
axes[-1,-1].set_xlim(xlimits)
axes[-1,-1].set_ylim(ylimits)
return fig
if __name__=='__main__':
np.random.seed(1977)
numvars, numdata = 4, 10
data = 10 * np.random.random((numvars, numdata))
fig = scatterplot_matrix(data, ['mpg', 'disp', 'drat', 'wt'],
linestyle='none', marker='o', color='black', mfc='none')
fig.suptitle('Simple Scatterplot Matrix')
plt.show()
Thanks again for sharing this with us. I have used it many times! Oh, and I re-arranged the main() part of the code so that it can be a formal example code or not get called if it is being imported into another piece of code.

While reading the question I expected to see an answer including rpy. I think this is a nice option taking advantage of two beautiful languages. So here it is:
import rpy
import numpy as np
def main():
np.random.seed(1977)
numvars, numdata = 4, 10
data = 10 * np.random.random((numvars, numdata))
mpg = data[0,:]
disp = data[1,:]
drat = data[2,:]
wt = data[3,:]
rpy.set_default_mode(rpy.NO_CONVERSION)
R_data = rpy.r.data_frame(mpg=mpg,disp=disp,drat=drat,wt=wt)
# Figure saved as eps
rpy.r.postscript('pairsPlot.eps')
rpy.r.pairs(R_data,
main="Simple Scatterplot Matrix Via RPy")
rpy.r.dev_off()
# Figure saved as png
rpy.r.png('pairsPlot.png')
rpy.r.pairs(R_data,
main="Simple Scatterplot Matrix Via RPy")
rpy.r.dev_off()
rpy.set_default_mode(rpy.BASIC_CONVERSION)
if __name__ == '__main__': main()
I can't post an image to show the result :( sorry!

Related

multiple boxplots, side by side, using matplotlib from a dataframe

I'm trying to plot 60+ boxplots side by side from a dataframe and I was wondering if someone could suggest some possible solutions.
At the moment I have df_new, a dataframe with 66 columns, which I'm using to plot boxplots. The easiest way I found to plot the boxplots was to use the boxplot package inside pandas:
boxplot = df_new.boxplot(column=x, figsize = (100,50))
This gives me a very very tiny chart with illegible axis which I cannot seem to change the font size for, so I'm trying to do this natively in matplotlib but I cannot think of an efficient way of doing it. I'm trying to avoid creating 66 separate boxplots using something like:
fig, ax = plt.subplots(nrows = 1,
ncols = 66,
figsize = (10,5),
sharex = True)
ax[0,0].boxplot(#insert parameters here)
I actually do not not how to get the data from df_new.describe() into the boxplot function, so any tips on this would be greatly appreciated! The documentation is confusing. Not sure what x vectors should be.
Ideally I'd like to just give the boxplot function the dataframe and for it to automatically create all the boxplots by working out all the quartiles, column separations etc on the fly - is this even possible?
Thanks!
I tried to replace the boxplot with a ridge plot, which takes up less space because:
it requires half of the width
you can partially overlap the ridges
it develops vertically, so you can scroll down all the plot
I took the code from the seaborn documentation and adapted it a little bit in order to have 60 different ridges, normally distributed; here the code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# # Create the data
n = 20
x = list(np.random.randn(1, 60)[0])
g = [item[0] + item[1] for item in list(itertools.product(list('ABCDEFGHIJ'), list('123456')))]
df = pd.DataFrame({'x': n*x,
'g': n*g})
# Initialize the FacetGrid object
pal = sns.cubehelix_palette(10, rot=-.25, light=.7)
g = sns.FacetGrid(df, row="g", hue="g", aspect=15, height=.5, palette=pal)
# Draw the densities in a few steps
g.map(sns.kdeplot, "x", clip_on=False, shade=True, alpha=1, lw=1.5, bw=.2)
g.map(sns.kdeplot, "x", clip_on=False, color="w", lw=2, bw=.2)
g.map(plt.axhline, y=0, lw=2, clip_on=False)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)
g.map(label, "x")
# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)
plt.show()
This is the result I get:
I don't know if it will be good for your needs, in any case keep in mind that keeping so many distributions next to each other will always require a lot of space (and a very big screen).
Maybe you could try dividing the distrubutions into smaller groups and plotting them a little at a time?

How can I rotate annotated seaborn heatmap data and legend?

I created to a seaborn heatmap to summarize Teils_U coefficients. The data is horizontally displayed in the heatmap. Now, I would like to rotate the data and the legend. I know that you can roate the x axis and y axis labels in a plot, but how can I rotate the data and the legend ?
This is my code:
#creates padnas dataframe to hold the values
theilu = pd.DataFrame(index=['Y'],columns=matrix.columns)
#store column names in variable columns
columns = matrix.columns
#iterate through each variable
for j in range(0,len(columns)):
#call teil_u function on "ziped" independant and dependant variable -> respectivley x & y in the functions section
u = theil_u(matrix['Y'].tolist(),matrix[columns[j]].tolist())
#select respecive columns needed for output
theilu.loc[:,columns[j]] = u
#handle nans if any
theilu.fillna(value=np.nan,inplace=True)
#plot correlation between fraud reported (y) and all other variables (x)
plt.figure(figsize=(20,1))
sns.heatmap(theilu,annot=True,fmt='.2f')
plt.show()
Here an image of what I am looking for:
Please let me know if you need and sample data or the teil_u function to recreate the problem. Thank you
The parameters of the annotation can be changed via annot_kws. One of them is the rotation.
Some parameters of the colorbar can be changed via cbar_kwsdict, but the unfortunately the orientation of the labels isn't one of them. Therefore, you need a handle to the colorbar's ax. One way is to create an ax beforehand, and pass it to sns.heatmap(..., cbar_ax=ax). An easier way is to get the handle afterwards: cbar = heatmap.collections[0].colorbar.
With this ax handle, you can change more properties of the colorbar, such as the orientation of its labels. Also, their vertical alignment can be changed to get them centered.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = np.random.rand(1, 12)
fig, ax = plt.subplots(figsize=(10,2))
heatmap = sns.heatmap(data, cbar=True, ax=ax,
annot=True, fmt='.2f', annot_kws={'rotation': 90})
cbar = heatmap.collections[0].colorbar
# heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=90)
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=90)
cbar.ax.set_yticklabels(cbar.ax.get_yticklabels(), rotation=90, va='center')
plt.tight_layout()
plt.show()
You can pass argument to ax.text() (which is used to write the annotation) using the annot_kws= argument.
Therefore:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
fig, ax = plt.subplots(figsize=(8,8))
ax = sns.heatmap(flights, annot=True, fmt='d', annot_kws={'rotation':90})

Scatter Matrix showing too many floating point values on graph

I'm trying to plot a scatter matrix using Python but the ticks on the y-axis for the top left plot has a high amount of unnecessary digits. I'm directly plotting the graph from pandas using scatter_matrix function from pandas.plotting
Also, I am quite new to Python so sorry if this is a stupid question but I just couldn't find the right answer to fit my needs.
I've tried to use different axis formatting options using yaxis.set_major_formatter (not sure if this doesn't work because I'm plotting from pandas, but yielding no results either way), pandas.set_option to customise display.
from pandas.plotting import scatter_matrix
scatter_matrix(df, alpha=0.3, figsize=(9,9), diagonal='kde')
df: Tesla Ret Ford Ret GM Ret
Date
2012-01-03 NaN NaN NaN
2012-01-04 -0.013177 0.015274 0.004751
2012-01-05 -0.021292 0.025664 0.048227
2012-01-06 -0.008481 0.010354 0.033829
2012-01-09 0.013388 0.007686 -0.003490
2012-01-10 0.013578 0.000000 0.017513
2012-01-11 0.022085 0.022881 0.052926
2012-01-12 0.000708 0.005800 0.008173
2012-01-13 -0.193274 -0.008237 -0.015403
2012-01-17 0.167179 -0.001661 -0.003705
...
I've tried to use:
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.2f}')) and ax.yaxis.set_major_formatter(FormatStrFormatter('%.2f')) after importing the respective modulesm, to no avail.
Figure is available here
Everything else in the figure is just as it should be, just the y-axis of the top left plot. I would like it to show one or two decimal point values like the rest of the figure.
I'd greatly appreciate any help that could fix my issue.
Thanks.
P.S: I have edited this answer based on the problem pointed out by #ImportanceOfBeingEarnest (thanks to him). Please read the comments below the answer to see what I mean.
The new solution is to get the displayed ticks for that particular axis and format them up to 2 decimal places.
new_labels = [round(float(i.get_text()), 2) for i in axes[0,0].get_yticklabels()]
axes[0,0].set_yticklabels(new_labels)
OLD ANSWER (Still kept as a history as you will see that the y-ticks in the figure generated below are not correct)
The problem is that you are using ax object to format the labels but ax returned from scatter_matrix is not a single axis object. It is an object containing 9 axis (3x3 subfigure). You can prove this if you plot the shape of the axes variable.
axes = scatter_matrix(df, alpha=0.3, figsize=(9,9), diagonal='kde')
print (axes.shape)
# (3, 3)
The solution is either to iterate through all the axis or to just change the formatting for the problematic case. P.S: The figure below don't match with your's because I just used the small DataFrame you posted.
Following is how you can do it for all the y-axis
from pandas.plotting import scatter_matrix
from matplotlib.ticker import FormatStrFormatter
axes = scatter_matrix(df, alpha=0.3, figsize=(9,9), diagonal='kde')
for ax in axes.flatten():
ax.yaxis.set_major_formatter(FormatStrFormatter('%.2f'))
Alternatively you can just choose a particular axis. Here your top left subfigure can be accessed using axes[0,0]
axes[0,0].yaxis.set_major_formatter(FormatStrFormatter('%.2f'))
pandas.scatter_matrix suffers from an unfortunate design choice. That is, it plots the kde or histogram on the diagonal to the axes that shows the ticks for the rest of the row. This then requires to fake the ticks and labels to be fitting for the data. In the course of this a FixedLocator and a FixedFormatter are used. The format of the ticklabels is hence directly taken over from the string representation of a number.
I would propose a completely different design here. That is, the diagonal axes should stay empty, and instead twin axes are used to show the histogram or kde curve. The problem from the question can hence not occur.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def scatter_matrix(df, axes=None, **kw):
n = df.columns.size
diagonal = kw.pop("diagonal", "hist")
if not axes:
fig, axes = plt.subplots(n,n, figsize=kw.pop("figsize", None),
squeeze=False, sharex="col", sharey="row")
else:
flax = axes.flatten()
fig = flax[0].figure
assert len(flax) == n*n
# no gaps between subplots
fig.subplots_adjust(wspace=0, hspace=0)
hist_kwds = kw.pop("hist_kwds", {})
density_kwds = kw.pop("density_kwds", {})
import itertools
p = itertools.permutations(df.columns, r=2)
n = itertools.permutations(np.arange(len(df.columns)), r=2)
for (i,j), (y,x) in zip(n,p):
axes[i,j].scatter(df[x].values, df[y].values, **kw)
axes[i,j].tick_params(left=False, labelleft=False,
bottom=False, labelbottom=False)
diagaxes = []
for i, c in enumerate(df.columns):
ax = axes[i,i].twinx()
diagaxes.append(ax)
if diagonal == 'hist':
ax.hist(df[c].values, **hist_kwds)
elif diagonal in ('kde', 'density'):
from scipy.stats import gaussian_kde
y = df[c].values
gkde = gaussian_kde(y)
ind = np.linspace(y.min(), y.max(), 1000)
ax.plot(ind, gkde.evaluate(ind), **density_kwds)
if i!= 0:
diagaxes[0].get_shared_y_axes().join(diagaxes[0], ax)
ax.axis("off")
for i,c in enumerate(df.columns):
axes[i,i].tick_params(left=False, labelleft=False,
bottom=False, labelbottom=False)
axes[i,0].set_ylabel(c)
axes[-1,i].set_xlabel(c)
axes[i,0].tick_params(left=True, labelleft=True)
axes[-1,i].tick_params(bottom=True, labelbottom=True)
return axes, diagaxes
df = pd.DataFrame(np.random.randn(1000, 4), columns=['A','B','C','D'])
axes,diagaxes = scatter_matrix(df, diagonal='kde', alpha=0.5)
plt.show()

Adding second legend to scatter plot

Is there a way to add a secondary legend to a scatterplot, where the size of the scatter is proportional to some data?
I have written the following code that generates a scatterplot. The color of the scatter represents the year (and is taken from a user-defined df) while the size of the scatter represents variable 3 (also taken from a df but is raw data):
import pandas as pd
colors = pd.DataFrame({'1985':'red','1990':'b','1995':'k','2000':'g','2005':'m','2010':'y'}, index=[0,1,2,3,4,5])
fig = plt.figure()
ax = fig.add_subplot(111)
for i in df.keys():
df[i].plot(kind='scatter',x='variable1',y='variable2',ax=ax,label=i,s=df[i]['variable3']/100, c=colors[i])
ax.legend(loc='upper right')
ax.set_xlabel("Variable 1")
ax.set_ylabel("Variable 2")
This code (with my data) produces the following graph:
So while the colors/years are well and clearly defined, the size of the scatter is not.
How can I add a secondary or additional legend that defines what the size of the scatter means?
You will need to create the second legend yourself, i.e. you need to create some artists to populate the legend with. In the case of a scatter we can use a normal plot and set the marker accordingly.
This is shown in the below example. To actually add a second legend we need to add the first legend to the axes, such that the new legend does not overwrite the first one.
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np; np.random.seed(1)
import pandas as pd
plt.rcParams["figure.subplot.right"] = 0.8
v = np.random.rand(30,4)
v[:,2] = np.random.choice(np.arange(1980,2015,5), size=30)
v[:,3] = np.random.randint(5,13,size=30)
df= pd.DataFrame(v, columns=["x","y","year","quality"])
df.year = df.year.values.astype(int)
fig, ax = plt.subplots()
for i, (name, dff) in enumerate(df.groupby("year")):
c = matplotlib.colors.to_hex(plt.cm.jet(i/7.))
dff.plot(kind='scatter',x='x',y='y', label=name, c=c,
s=dff.quality**2, ax=ax)
leg = plt.legend(loc=(1.03,0), title="Year")
ax.add_artist(leg)
h = [plt.plot([],[], color="gray", marker="o", ms=i, ls="")[0] for i in range(5,13)]
plt.legend(handles=h, labels=range(5,13),loc=(1.03,0.5), title="Quality")
plt.show()
Have a look at http://matplotlib.org/users/legend_guide.html.
It shows how to have multiple legends (about halfway down) and there is another example that shows how to set the marker size.
If that doesn't work, then you can also create a custom legend (last example).

How to get rid of extra white space on subplots with shared axes?

I'm creating a plot using python 3.5.1 and matplotlib 1.5.1 that has two subplots (side by side) with a shared Y axis. A sample output image is shown below:
Notice the extra white space at the top and bottom of each set of axes. Try as I might I can't seem to get rid of it. The overall goal of the figure is to have a waterfall type plot on the left with a shared Y axes with the plot on the right.
Here's some sample code to reproduce the image above.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
# create some X values
periods = np.linspace(1/1440, 1, 1000)
# create some Y values (will be datetimes, not necessarily evenly spaced
# like they are in this example)
day_ints = np.linspace(1, 100, 100)
days = pd.to_timedelta(day_ints, 'D') + pd.to_datetime('2016-01-01')
# create some fake data for the number of points
points = np.random.random(len(day_ints))
# create some fake data for the color mesh
Sxx = np.random.random((len(days), len(periods)))
# Create the plots
fig = plt.figure(figsize=(8, 6))
# create first plot
ax1 = plt.subplot2grid((1,5), (0,0), colspan=4)
im = ax1.pcolormesh(periods, days, Sxx, cmap='viridis', vmin=0, vmax=1)
ax1.invert_yaxis()
ax1.autoscale(enable=True, axis='Y', tight=True)
# create second plot and use the same y axis as the first one
ax2 = plt.subplot2grid((1,5), (0,4), sharey=ax1)
ax2.scatter(points, days)
ax2.autoscale(enable=True, axis='Y', tight=True)
# Hide the Y axis scale on the second plot
plt.setp(ax2.get_yticklabels(), visible=False)
#ax1.set_adjustable('box-forced')
#ax2.set_adjustable('box-forced')
fig.colorbar(im, ax=ax1)
As you can see in the commented out code I've tried a number of approaches, as suggested by posts like https://github.com/matplotlib/matplotlib/issues/1789/ and Matplotlib: set axis tight only to x or y axis.
As soon as I remove the sharey=ax1 part of the second subplot2grid call the problem goes away, but then I also don't have a common Y axis.
Autoscale tends to add a buffer to the data so that all of the data points are easily visible and not part-way cut off by the axes.
Change:
ax1.autoscale(enable=True, axis='Y', tight=True)
to:
ax1.set_ylim(days.min(),days.max())
and
ax2.autoscale(enable=True, axis='Y', tight=True)
to:
ax2.set_ylim(days.min(),days.max())
To get:

Categories