Adding second legend to scatter plot - python

Is there a way to add a secondary legend to a scatterplot, where the size of the scatter is proportional to some data?
I have written the following code that generates a scatterplot. The color of the scatter represents the year (and is taken from a user-defined df) while the size of the scatter represents variable 3 (also taken from a df but is raw data):
import pandas as pd
colors = pd.DataFrame({'1985':'red','1990':'b','1995':'k','2000':'g','2005':'m','2010':'y'}, index=[0,1,2,3,4,5])
fig = plt.figure()
ax = fig.add_subplot(111)
for i in df.keys():
df[i].plot(kind='scatter',x='variable1',y='variable2',ax=ax,label=i,s=df[i]['variable3']/100, c=colors[i])
ax.legend(loc='upper right')
ax.set_xlabel("Variable 1")
ax.set_ylabel("Variable 2")
This code (with my data) produces the following graph:
So while the colors/years are well and clearly defined, the size of the scatter is not.
How can I add a secondary or additional legend that defines what the size of the scatter means?

You will need to create the second legend yourself, i.e. you need to create some artists to populate the legend with. In the case of a scatter we can use a normal plot and set the marker accordingly.
This is shown in the below example. To actually add a second legend we need to add the first legend to the axes, such that the new legend does not overwrite the first one.
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np; np.random.seed(1)
import pandas as pd
plt.rcParams["figure.subplot.right"] = 0.8
v = np.random.rand(30,4)
v[:,2] = np.random.choice(np.arange(1980,2015,5), size=30)
v[:,3] = np.random.randint(5,13,size=30)
df= pd.DataFrame(v, columns=["x","y","year","quality"])
df.year = df.year.values.astype(int)
fig, ax = plt.subplots()
for i, (name, dff) in enumerate(df.groupby("year")):
c = matplotlib.colors.to_hex(plt.cm.jet(i/7.))
dff.plot(kind='scatter',x='x',y='y', label=name, c=c,
s=dff.quality**2, ax=ax)
leg = plt.legend(loc=(1.03,0), title="Year")
ax.add_artist(leg)
h = [plt.plot([],[], color="gray", marker="o", ms=i, ls="")[0] for i in range(5,13)]
plt.legend(handles=h, labels=range(5,13),loc=(1.03,0.5), title="Quality")
plt.show()

Have a look at http://matplotlib.org/users/legend_guide.html.
It shows how to have multiple legends (about halfway down) and there is another example that shows how to set the marker size.
If that doesn't work, then you can also create a custom legend (last example).

Related

How can I rotate annotated seaborn heatmap data and legend?

I created to a seaborn heatmap to summarize Teils_U coefficients. The data is horizontally displayed in the heatmap. Now, I would like to rotate the data and the legend. I know that you can roate the x axis and y axis labels in a plot, but how can I rotate the data and the legend ?
This is my code:
#creates padnas dataframe to hold the values
theilu = pd.DataFrame(index=['Y'],columns=matrix.columns)
#store column names in variable columns
columns = matrix.columns
#iterate through each variable
for j in range(0,len(columns)):
#call teil_u function on "ziped" independant and dependant variable -> respectivley x & y in the functions section
u = theil_u(matrix['Y'].tolist(),matrix[columns[j]].tolist())
#select respecive columns needed for output
theilu.loc[:,columns[j]] = u
#handle nans if any
theilu.fillna(value=np.nan,inplace=True)
#plot correlation between fraud reported (y) and all other variables (x)
plt.figure(figsize=(20,1))
sns.heatmap(theilu,annot=True,fmt='.2f')
plt.show()
Here an image of what I am looking for:
Please let me know if you need and sample data or the teil_u function to recreate the problem. Thank you
The parameters of the annotation can be changed via annot_kws. One of them is the rotation.
Some parameters of the colorbar can be changed via cbar_kwsdict, but the unfortunately the orientation of the labels isn't one of them. Therefore, you need a handle to the colorbar's ax. One way is to create an ax beforehand, and pass it to sns.heatmap(..., cbar_ax=ax). An easier way is to get the handle afterwards: cbar = heatmap.collections[0].colorbar.
With this ax handle, you can change more properties of the colorbar, such as the orientation of its labels. Also, their vertical alignment can be changed to get them centered.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = np.random.rand(1, 12)
fig, ax = plt.subplots(figsize=(10,2))
heatmap = sns.heatmap(data, cbar=True, ax=ax,
annot=True, fmt='.2f', annot_kws={'rotation': 90})
cbar = heatmap.collections[0].colorbar
# heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=90)
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=90)
cbar.ax.set_yticklabels(cbar.ax.get_yticklabels(), rotation=90, va='center')
plt.tight_layout()
plt.show()
You can pass argument to ax.text() (which is used to write the annotation) using the annot_kws= argument.
Therefore:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
fig, ax = plt.subplots(figsize=(8,8))
ax = sns.heatmap(flights, annot=True, fmt='d', annot_kws={'rotation':90})

Plotting a variable number of series and data forecasts onto subplots, including axis labels, formatting and line colours

I'm writing a program which gets data and then uses time series forecasting to predict data values for the next, say, 300 data points.
However, only data which fulfills a certain condition will be plotted, so there is no defined number of subplots for the add_subplot() method. I'm aware of the plot.subplots() function, but something such as
fig, (ax1, ax2) = plt.subplots(1, 2)
implies that two graphs will definitely be plotted and I need to change the specific amount, like with a parameter.
Here is a simplified version of the current code which results in each plot being in separate windows:
fig = plt.figure() # creates a figure instance for the final graph output
plots = 1 # indicates the total number of plots to plot, starting from 1
# passed as a parameter to the add_subplot() function
for data in dataSet:
forecast(data, fig, plots)
plt.figure(fig.number)
plt.show()
And the function:
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
from pandas import Series
from statsmodels.tsa.holtwinters import ExponentialSmoothing
def forecast(data, superFigure, plotNumber):
index = range(0, len(data))
plotData = Series(data, index)
# fit the data values into a specific model:
modelFit = ExponentialSmoothing(plotData, trend="add").fit()
# forecast for the next 300 points:
modelForecast = modelFit.forecast(300)
if [condition]:
# plot the original data points:
points = plotData.plot(marker='x', color='black', label='Base Data')
points.set_xlim(0, len(data) + 300)
# plot the forecast in a different colour:
modelForecast.plot(marker='x', ax=points, color='blue', label='Forecasted Data')
plt.title("Plot Title")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
# format the axes, adding thousand separator
points.get_xaxis().set_major_formatter(
tick.FuncFormatter(lambda x, p: format(int(x), ',')))
points.get_yaxis().set_major_formatter(
tick.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.legend()
plt.show()
This produces multiple graphs such as this (actual labels have been cut out).
Unfortunately you have to close each graph before viewing the next one, and I want every graph to be visible on one page.
I tried changing the code within the "if [condition]" to:
if [condition]:
points = plotData.plot(marker='x', color='black', label='Base Data')
modelForecast.plot(marker='x', ax=points, color='blue', label='Forecasted Data')
dataLine = plt.gca().get_lines()[0]
forecastLine = plt.gca().get_lines()[1]
# put all x and y values into single lists by concatenating them
totalXData = [*dataLine.get_xdata(), *forecastLine.get_xdata()]
totalYData = [*dataLine.get_ydata(), *forecastLine.get_ydata()]
subset = superFigure.add_subplot(10, 10, plotNumber)
for i in range(0, len(totalXData)):
subset.plot(totalXData[i], totalYData[i])
plotNumber += 1
These changes produce this exact graph which seems to have the other graphs squished in the top-left corner, and I get "MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance" warnings.
If I change "superFigure.add_subplot(10, 10, plotNumber)" to "superFigure.add_subplot(20, 20, plotNumber)" I also get "UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations".
I then tried to change it to:
if [condition]:
fig, ax = plt.subplots()
plotData.plot(marker='x', ax=ax, color='black', label='Base Data')
modelForecast.plot(marker='x', ax=ax, color='blue', label='Forecasted Data')
ax.set([...])
ax.legend()
plt.show()
which doesn't produce the desired output assumedly because it recreates the figure on each call of forecast(), unless a figure window can contain multiple figures.
I also sometimes get the following warning:
RuntimeWarning: More than 20 figures have been opened. Figures created
through the pyplot interface (matplotlib.pyplot.figure) are retained
until explicitly closed and may consume too much memory.
fig, ax = plt.subplots()
How can I create subplots which include all the formatting and are displayed in one window all together?

Matplotlib scatter legend with colors using categorical variable

I have made a simple scatterplot using matplotlib showing data from 2 numerical variables (varA and varB) with colors that I defined with a 3rd categorical string variable (col) containing 10 unique colors (corresponding to another string variable with 10 unique names), all in the same Pandas DataFrame with 100+ rows.
Is there an easy way to create a legend for this scatterplot that shows the unique colored dots and their corresponding category names? Or should I somehow group the data and plot each category in a subplot to do this? This is what I have so far:
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
varA = df['A']
varB = df['B']
col = df['Color']
plt.scatter(varA,varB, c=col, alpha=0.8)
plt.legend()
plt.show()
I had to chime in, because I could not accept that I needed a for-loop to accomplish this. It just seems really annoying and unpythonic - especially when I'm not using Pandas. However, after some searching, I found the answer. You just need to import the 'collections' package so that you can access the PathCollections class and specifically, the legend_elements() method. See implementation below:
# imports
import matplotlib.collections
import numpy as np
# create random data and numerical labels
x = np.random.rand(10,2)
y = np.random.randint(4, size=10)
# create list of categories
labels = ['type1', 'type2', 'type3', 'type4']
# plot
fig, ax = plt.subplots()
scatter = ax.scatter(x[:,0], x[:,1], c=y)
handles, _ = scatter.legend_elements(prop="colors", alpha=0.6) # use my own labels
legend1 = ax.legend(handles, labels, loc="upper right")
ax.add_artist(legend1)
plt.show()
scatterplot legend with custom labels
Source:
https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html
https://matplotlib.org/stable/api/collections_api.html#matplotlib.collections.PathCollection.legend_elements
Considering, Color is the column that has all the colors and labels, you can simply do following.
colors = list(df['Color'].unique())
for i in range(0 , len(colors)):
data = df.loc[df['Color'] == colors[i]]
plt.scatter('A', 'B', data=data, color='Color', label=colors[i])
plt.legend()
plt.show()
A simple way is to group your data by color, then plot all of the data on one plot. Pandas has a built in groupby function. For example:
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
for color, group in df.groupby(['Color']):
plt.scatter(group['A'], group['B'], c=color, alpha=0.8, label=color)
plt.legend()
plt.show()
Notice that we call plt.scatter once for each grouping of data. Then we only need to call plt.legend and plt.show once all of the data is in our plot.

Setting xticks in pandas bar plot

I came across this different behaviour in the third example plot below. Why am I able to correctly edit the x-axis' ticks with pandas line() and area() plots, but not with bar()? What's the best way to fix the (general) third example?
import numpy as np
import pandas as pd
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
x = np.arange(73,145,1)
y = np.cos(x)
df = pd.Series(y,x)
ax1 = df.plot.line()
ax1.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax1.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
plt.show()
ax2 = df.plot.area(stacked=False)
ax2.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax2.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
plt.show()
ax3 = df.plot.bar()
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
plt.show()
Problem:
The bar plot is meant to be used with categorical data. Therefore the bars are not actually at the positions of x but at positions 0,1,2,...N-1. The bar labels are then adjusted to the values of x.
If you then put a tick only on every tenth bar, the second label will be placed at the tenth bar etc. The result is
You can see that the bars are actually positionned at integer values starting at 0 by using a normal ScalarFormatter on the axes:
ax3 = df.plot.bar()
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))
ax3.xaxis.set_major_formatter(ticker.ScalarFormatter())
Now you can of course define your own fixed formatter like this
n = 10
ax3 = df.plot.bar()
ax3.xaxis.set_major_locator(ticker.MultipleLocator(n))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(n/4.))
seq = ax3.xaxis.get_major_formatter().seq
ax3.xaxis.set_major_formatter(ticker.FixedFormatter([""]+seq[::n]))
which has the drawback that it starts at some arbitrary value.
Solution:
I would guess the best general solution is not to use the pandas plotting function at all (which is anyways only a wrapper), but the matplotlib bar function directly:
fig, ax3 = plt.subplots()
ax3.bar(df.index, df.values, width=0.72)
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.xaxis.set_minor_locator(ticker.MultipleLocator(2.5))

Is there a function to make scatterplot matrices in matplotlib?

Example of scatterplot matrix
Is there such a function in matplotlib.pyplot?
For those who do not want to define their own functions, there is a great data analysis libarary in Python, called Pandas, where one can find the scatter_matrix() method:
from pandas.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(1000, 4), columns = ['a', 'b', 'c', 'd'])
scatter_matrix(df, alpha = 0.2, figsize = (6, 6), diagonal = 'kde')
Generally speaking, matplotlib doesn't usually contain plotting functions that operate on more than one axes object (subplot, in this case). The expectation is that you'd write a simple function to string things together however you'd like.
I'm not quite sure what your data looks like, but it's quite simple to just build a function to do this from scratch. If you're always going to be working with structured or rec arrays, then you can simplify this a touch. (i.e. There's always a name associated with each data series, so you can omit having to specify names.)
As an example:
import itertools
import numpy as np
import matplotlib.pyplot as plt
def main():
np.random.seed(1977)
numvars, numdata = 4, 10
data = 10 * np.random.random((numvars, numdata))
fig = scatterplot_matrix(data, ['mpg', 'disp', 'drat', 'wt'],
linestyle='none', marker='o', color='black', mfc='none')
fig.suptitle('Simple Scatterplot Matrix')
plt.show()
def scatterplot_matrix(data, names, **kwargs):
"""Plots a scatterplot matrix of subplots. Each row of "data" is plotted
against other rows, resulting in a nrows by nrows grid of subplots with the
diagonal subplots labeled with "names". Additional keyword arguments are
passed on to matplotlib's "plot" command. Returns the matplotlib figure
object containg the subplot grid."""
numvars, numdata = data.shape
fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(8,8))
fig.subplots_adjust(hspace=0.05, wspace=0.05)
for ax in axes.flat:
# Hide all ticks and labels
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# Set up ticks only on one side for the "edge" subplots...
if ax.is_first_col():
ax.yaxis.set_ticks_position('left')
if ax.is_last_col():
ax.yaxis.set_ticks_position('right')
if ax.is_first_row():
ax.xaxis.set_ticks_position('top')
if ax.is_last_row():
ax.xaxis.set_ticks_position('bottom')
# Plot the data.
for i, j in zip(*np.triu_indices_from(axes, k=1)):
for x, y in [(i,j), (j,i)]:
axes[x,y].plot(data[x], data[y], **kwargs)
# Label the diagonal subplots...
for i, label in enumerate(names):
axes[i,i].annotate(label, (0.5, 0.5), xycoords='axes fraction',
ha='center', va='center')
# Turn on the proper x or y axes ticks.
for i, j in zip(range(numvars), itertools.cycle((-1, 0))):
axes[j,i].xaxis.set_visible(True)
axes[i,j].yaxis.set_visible(True)
return fig
main()
You can also use Seaborn's pairplot function:
import seaborn as sns
sns.set()
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")
Thanks for sharing your code! You figured out all the hard stuff for us. As I was working with it, I noticed a few little things that didn't look quite right.
[FIX #1] The axis tics weren't lining up like I would expect (i.e., in your example above, you should be able to draw a vertical and horizontal line through any point across all plots and the lines should cross through the corresponding point in the other plots, but as it sits now this doesn't occur.
[FIX #2] If you have an odd number of variables you are plotting with, the bottom right corner axes doesn't pull the correct xtics or ytics. It just leaves it as the default 0..1 ticks.
Not a fix, but I made it optional to explicitly input names, so that it puts a default xi for variable i in the diagonal positions.
Below you'll find an updated version of your code that addresses these two points, otherwise preserving the beauty of your code.
import itertools
import numpy as np
import matplotlib.pyplot as plt
def scatterplot_matrix(data, names=[], **kwargs):
"""
Plots a scatterplot matrix of subplots. Each row of "data" is plotted
against other rows, resulting in a nrows by nrows grid of subplots with the
diagonal subplots labeled with "names". Additional keyword arguments are
passed on to matplotlib's "plot" command. Returns the matplotlib figure
object containg the subplot grid.
"""
numvars, numdata = data.shape
fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(8,8))
fig.subplots_adjust(hspace=0.0, wspace=0.0)
for ax in axes.flat:
# Hide all ticks and labels
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# Set up ticks only on one side for the "edge" subplots...
if ax.is_first_col():
ax.yaxis.set_ticks_position('left')
if ax.is_last_col():
ax.yaxis.set_ticks_position('right')
if ax.is_first_row():
ax.xaxis.set_ticks_position('top')
if ax.is_last_row():
ax.xaxis.set_ticks_position('bottom')
# Plot the data.
for i, j in zip(*np.triu_indices_from(axes, k=1)):
for x, y in [(i,j), (j,i)]:
# FIX #1: this needed to be changed from ...(data[x], data[y],...)
axes[x,y].plot(data[y], data[x], **kwargs)
# Label the diagonal subplots...
if not names:
names = ['x'+str(i) for i in range(numvars)]
for i, label in enumerate(names):
axes[i,i].annotate(label, (0.5, 0.5), xycoords='axes fraction',
ha='center', va='center')
# Turn on the proper x or y axes ticks.
for i, j in zip(range(numvars), itertools.cycle((-1, 0))):
axes[j,i].xaxis.set_visible(True)
axes[i,j].yaxis.set_visible(True)
# FIX #2: if numvars is odd, the bottom right corner plot doesn't have the
# correct axes limits, so we pull them from other axes
if numvars%2:
xlimits = axes[0,-1].get_xlim()
ylimits = axes[-1,0].get_ylim()
axes[-1,-1].set_xlim(xlimits)
axes[-1,-1].set_ylim(ylimits)
return fig
if __name__=='__main__':
np.random.seed(1977)
numvars, numdata = 4, 10
data = 10 * np.random.random((numvars, numdata))
fig = scatterplot_matrix(data, ['mpg', 'disp', 'drat', 'wt'],
linestyle='none', marker='o', color='black', mfc='none')
fig.suptitle('Simple Scatterplot Matrix')
plt.show()
Thanks again for sharing this with us. I have used it many times! Oh, and I re-arranged the main() part of the code so that it can be a formal example code or not get called if it is being imported into another piece of code.
While reading the question I expected to see an answer including rpy. I think this is a nice option taking advantage of two beautiful languages. So here it is:
import rpy
import numpy as np
def main():
np.random.seed(1977)
numvars, numdata = 4, 10
data = 10 * np.random.random((numvars, numdata))
mpg = data[0,:]
disp = data[1,:]
drat = data[2,:]
wt = data[3,:]
rpy.set_default_mode(rpy.NO_CONVERSION)
R_data = rpy.r.data_frame(mpg=mpg,disp=disp,drat=drat,wt=wt)
# Figure saved as eps
rpy.r.postscript('pairsPlot.eps')
rpy.r.pairs(R_data,
main="Simple Scatterplot Matrix Via RPy")
rpy.r.dev_off()
# Figure saved as png
rpy.r.png('pairsPlot.png')
rpy.r.pairs(R_data,
main="Simple Scatterplot Matrix Via RPy")
rpy.r.dev_off()
rpy.set_default_mode(rpy.BASIC_CONVERSION)
if __name__ == '__main__': main()
I can't post an image to show the result :( sorry!

Categories