pandas 3x3 scatter-matrix missing labels - python

I create a pandas scatter-matrix usng the following code:
import numpy as np
import pandas as pd
a = np.random.normal(1, 3, 100)
b = np.random.normal(3, 1, 100)
c = np.random.normal(2, 2, 100)
df = pd.DataFrame({'A':a,'B':b,'C':c})
pd.scatter_matrix(df, diagonal='kde')
This result in the following scatter-matrix:
The first row has no ytick labels, the 3th column no xtick labels, the 3th item 'C' is not labeled.
Any idea how to complete this plot with the missing labels ?

Access the subplot in question and change its settings like so.
axes = pd.scatter_matrix(df, diagonal='kde')
ax = axes[2, 2] # your bottom-right subplot
ax.xaxis.set_visible(True)
draw()
You can inspect how the scatter_matrix function goes about labeling at the link below. If you find yourself doing this over and over, consider copying the code into file and creating your own custom scatter_matrix function.
https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py#L160
Edit, in response to a rejected comment:
The obvious extensions of this, doing ax[0, 0].xaxis.set_visible(True) and so forth, do not work. For some reason, scatter_matrix seems to set up ticks and labels for axes[2, 2] without making them visible, but it does not set up ticks and labels for the rest. If you decide that it is necessary to display ticks and labels on other subplots, you'll have to dig deeper into the code linked above.
Specifically, change the conditions on the if statements to:
if i == 0
if i == n-1
if j == 0
if j == n-1
respectively. I haven't tested that, but I think it will do the trick.

Since I can't reply above, the non-changing source code version for anybody googling is:
n = len(features)
for x in range(n):
for y in range(n):
sax = axes[x, y]
if ((x%2)==0) and (y==0):
if not sax.get_ylabel():
sax.set_ylabel(features[-1])
sax.yaxis.set_visible(True)
if (x==(n-1)) and ((y%2)==0):
sax.xaxis.set_visible(True)
if ((x%2)==1) and (y==(n-1)):
if not sax.get_ylabel():
sax.set_ylabel(features[-1])
sax.yaxis.set_visible(True)
if (x==0) and ((y%2)==1):
sax.xaxis.set_visible(True)
features is the list of column names

Related

Python / Seaborn - How to plot the names of each value in a scatterplot

first of all, in case I comment on any mistakes while writing this, sorry, English is not my first language.
I'm a begginer with Data vizualiation with python, I have a dataframe with 115 rows, and I want to do a scatterplot with 4 quadrants and show the values in R1 (image below for reference)
enter image description here
At moment this is my scatterplot. It's a football player dataset so I want to plot the name of the players name in the 'R1'. Is that possible?
enter image description here
You can annotate each point by making a sub-dataframe of just the players in a quadrant that you care about based on their x/y values using plt.annotate. So something like this:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
##### Making a mock dataset #################################################
names = ['one', 'two', 'three', 'four', 'five', 'six']
value_1 = [1, 2, 3, 4, 5, 6]
value_2 = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(zip(names, value_1, value_2), columns = ['name', 'v_1', 'v_2'])
#############################################################################
plt.rcParams['figure.figsize'] = (10, 5) # sizing parameter to make the graph bigger
ax1 = sns.scatterplot(x = value_1, y = value_2, s = 100) # graph code
# Code to make a subset of data that fits the specific conditions that I want to annotate
quadrant = df[(df.v_1 > 3) & (df.v_2 > 3)].reset_index(drop = True)
# Code to annotate the above "quadrant" on the graph
for x in range(len(quadrant)):
plt.annotate('Player: {}\nValue 1: {}\nValue 2: {}'.format(quadrant.name[x], quadrant.v_1[x], quadrant.v_2[x]),
(quadrant.v_1[x], quadrant.v_2[x])
Output graph:
If you're just working in the notebook and don't need to save the image with all the player's names, then using a "hover" feature might be a better idea. Annotating every player's name might become too busy for the graph, so just hovering over the point might work out better for you.
%matplotlib widget # place this or "%matplotlib notebook" at the top of your notebook
# This allows you to work with matplotlib graphs in line
import mplcursors # this is the module that allows hovering features on your graph
# using the same dataframe from above
ax1 = sns.scatterplot(x = value_1, y = value_2, s = 100)
#mplcursors.cursor(ax1, hover=2).connect("add") # add the plot to the hover feature of mplcursors
def _(sel):
sel.annotation.set_text('Player: {}\nValue 1: {}\nValue 2: {}'.format(df.name[sel.index], sel.target[0], sel.target[0])) # set the text
# you don't need any of the below but I like to customize
sel.annotation.get_bbox_patch().set(fc="lightcoral", alpha=1) # set the box color
sel.annotation.arrow_patch.set(arrowstyle='-|>', connectionstyle='angle3', fc='black', alpha=.5) # set the arrow style
Example outputs from hovering:
You can do two (or more) scatter plots on a single figure.
If I understand correctly what you want to do, you could separate your dataset in two :
Points for which you don't want the name to be plotted
Points for which you want the name to be plotted
You can then plot the second data set and display the name.
Without any other details on your problem, it is difficult to do more. You could edit your question and add a minimal example of your data set.

Add Axessubplot object to specific subplot index

There is an easy way to generate word frequency plots with NLTK:
my_plot = nltk.FreqDist(some_dynamically_changing_list[0]).plot(20)
some_dynamically_changing_list contains a list of texts that I want to see the word frequency for. The resulting my_plot is an AxesSubplot object. I want to be able to take this object and directly insert it to a dynamically sized subplotted grid. The closest I've gotten to the solution after scouring SO and Google is this. However, I have an issue because I don't want to use plt's plot function. So for example I'd have a dynamic subplot grid:
import matplotlib.pyplot as plt
tot = len(some_dynamically_changing_list)
cols = 5
rows = tot // cols
if tot % cols != 0:
rows += 1
position = range(1, tot + 1)
fig = plt.figure()
for k in range(tot):
ax = fig.add_subplot(rows, cols, position[k])
my_plot = nltk.FreqDist(some_dynamically_changing_list[k]).plot(20)
ax.plot(my_plot) #This is where I have my issue.
Of course this won't work, as ax.plot expects data, not a plot. But I want to plug my my_plot into that specific slot. How can I do that (if I can)?

Pyplot/Matplotlib: Binary data with strings on x-axis

I know it's such a basic thing, but due to ridiculous time constraints and the severity of the situation I'm forced to ask something like this:
I've got two arrays of 160 000 entries. One contains strings(names I need to use), the other contains corresponding 1's and 0's.
I'm trying to make a simple "step" graph in pyplot with the array of names along the X-axis and 0 and 1 along the Y-axis.
I have this currently:
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 4, 5, 9]
bindata = [0,1,1,0,1,1,0,0,0,1]
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
plt.step(xaxis, yaxis)
plt.xlabel('Filter Degree Combinations')
plt.ylabel('Negative Or Positive')
plt.title("Car 1")
#plt.savefig('foo.png') #For saving
plt.show()
It gives me this:
But I want something like this:
I cobbled the code together from some examples, tutorials and stackoverflow questions, but I run into "ValueError: x and y must have same first dimension" so often that I'm not getting anywhere when I try to experiment my way forward.
You can achieve the desired plot by specifying the tick labels and their positions on the x-axis using plt.xticks. The first argument range(0, 10, 2) is the positions followed by the strings
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 4, 5, 9]
bindata = [0,1,1,0,1,1,0,0,0,1]
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
plt.step(xaxis, yaxis)
xlabels = ['Josh', 'Anna', 'Kevin', 'Sophie', 'Steve'] # <-- specify tick-labels
plt.xlabel('Filter Degree Combinations')
plt.ylabel('Negative Or Positive')
plt.title("Car 1")
plt.xticks(range(0, 10, 2), xlabels) # <-- assign tick-labels
plt.show()

Annotated heatmap with multiple color schemes

I have the following dataframe and would like to differentiate the minor decimal differences in each "step" with a different color scheme in a heatmap.
Sample data:
Sample Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
A 64.847 54.821 20.897 39.733 23.257 74.942 75.945
B 64.885 54.767 20.828 39.613 23.093 74.963 75.928
C 65.036 54.772 20.939 39.835 23.283 74.944 75.871
D 64.869 54.740 21.039 39.889 23.322 74.925 75.894
E 64.911 54.730 20.858 39.608 23.101 74.956 75.930
F 64.838 54.749 20.707 39.394 22.984 74.929 75.941
G 64.887 54.781 20.948 39.748 23.238 74.957 75.909
H 64.903 54.720 20.783 39.540 23.028 74.898 75.911
I 64.875 54.761 20.911 39.695 23.082 74.897 75.866
J 64.839 54.717 20.692 39.377 22.853 74.849 75.939
K 64.857 54.736 20.934 39.699 23.130 74.880 75.903
L 64.754 54.746 20.777 39.536 22.991 74.877 75.902
M 64.798 54.811 20.963 39.824 23.187 74.886 75.895
An example of what I am looking for:
My first approach would be based on a figure with multiple subplots. Number of plots would equal number of columns in your dataframe; the gap between the plots could be shrinked down to zero:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
Result:
Not sure if this will lead to a helpful or even self describing visualization of data, but that's your choice - perhaps this helps to start...
Supplemental:
Regarding adding the colorbars: of course you can. But - besides not knowing the background of your data and the purpose of the visualization - I'd like to add some thoughts on all that:
First: adding all those colorbars as a separate bunch of bars on one side or below the heatmap is probably possible, but I find it already quite hard to read the data, plus: you already have all those annotations - it would mess all up I think.
Additionally: in the meantime #ImportanceOfBeingErnest provided such a beutiful solution on that topic, that this would be not too meaningful imo here.
Second: if you really want to stick to the heatmap thing, perhaps splitting up and giving every column its colorbar would suit better:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, figsize=(10, 3))
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c)
if i>0:
a.yaxis.set_ticks([])
f.tight_layout()
However, all that said - I dare to doubt that this is the best visualization for your data. Of course, I don't know what you want to say, see or find with these plots, but that's the point: if the visualization type would fit to the needs, I guess I'd know (or at least could imagine).
Just for example:
A simple df.plot() results in
and I feel that this tells more about different characteristics of your columns within some tenths of a second than the heatmap.
Or are you explicitely after the differences to each columns' means?
(df - df.mean()).plot()
... or the distribution of each column around them?
(df - df.mean()).boxplot()
What I want to say: data visualization becomes powerful when a plot begins to tell sth about the underlying data before you begin/have to explain anything...
I suppose the problem can be divided into several parts.
Getting several heatmaps with different colormaps into the same picture. This can be done masking the complete array column-wise, plot each masked array seperately via imshow and apply a different colormap. To visualize the concept:
Obtaining variable number of distinct colormaps. Matplotlib provides a large number of colormaps, however, they are in general very different concerning luminosity and saturation. Here it seems desireable to have colormaps of differing hue, but otherwise same saturation and luminosity.
An option is to create the colormaps on the fly, choosing n different (and equally spaced) hues, and create a colormap using the same saturation and luminosity.
Obtaining a distinct colorbar for each column. Since the values within columns might be on totally different scales, a colorbar for each column would be needed to know the values shown, e.g. in the first column the brightest color may correspond to a value of 1, while in the second column it may correspond to a value of 100. Several colorbars can be created inside of the axes of a GridSpec which is placed next to the actual heatmap axes. The number of columns and rows of that gridspec would be dependent of the number of columns in the dataframe.
In total this may then look as follows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
def get_hsvcmap(i, N, rot=0.):
nsc = 24
chsv = mcolors.rgb_to_hsv(plt.cm.hsv(((np.arange(N)/N)+rot) % 1.)[i,:3])
rhsv = mcolors.rgb_to_hsv(plt.cm.Reds(np.linspace(.2,1,nsc))[:,:3])
arhsv = np.tile(chsv,nsc).reshape(nsc,3)
arhsv[:,1:] = rhsv[:,1:]
rgb = mcolors.hsv_to_rgb(arhsv)
return mcolors.LinearSegmentedColormap.from_list("",rgb)
def columnwise_heatmap(array, ax=None, **kw):
ax = ax or plt.gca()
premask = np.tile(np.arange(array.shape[1]), array.shape[0]).reshape(array.shape)
images = []
for i in range(array.shape[1]):
col = np.ma.array(array, mask = premask != i)
im = ax.imshow(col, cmap=get_hsvcmap(i, array.shape[1], rot=0.5), **kw)
images.append(im)
return images
### Create some dataset
ind = list("ABCDEFGHIJKLM")
m = len(ind)
n = 8
df = pd.DataFrame(np.random.randn(m,n) + np.random.randint(20,70,n),
index=ind, columns=[f"Step {i}" for i in range(2,2+n)])
### Plot data
fig, ax = plt.subplots(figsize=(8,4.5))
ims = columnwise_heatmap(df.values, ax=ax, aspect="auto")
ax.set(xticks=np.arange(len(df.columns)), yticks=np.arange(len(df)),
xticklabels=df.columns, yticklabels=df.index)
ax.tick_params(bottom=False, top=False,
labelbottom=False, labeltop=True, left=False)
### Optionally add colorbars.
fig.subplots_adjust(left=0.06, right=0.65)
rows = 3
cols = len(df.columns) // rows + int(len(df.columns)%rows > 0)
gs = GridSpec(rows, cols)
gs.update(left=0.7, right=0.95, wspace=1, hspace=0.3)
for i, im in enumerate(ims):
cax = fig.add_subplot(gs[i//cols, i % cols])
fig.colorbar(im, cax = cax)
cax.set_title(df.columns[i], fontsize=10)
plt.show()

Python: How to create a legend using an example

This is from Chapter 2 in the book Machine Learning In Action and I am trying to make the plot pictured here:
The author has posted the plot's code here, which I believe may be a bit hacky (he also mentions this code is sloppy since it is out of the book's scope).
Here is my attempt to re-create the plot:
First, the .txt file holding the data is as follows (source: "datingTestSet2.txt" in Ch.2 here):
40920 8.326976 0.953952 largeDoses
14488 7.153469 1.673904 smallDoses
26052 1.441871 0.805124 didntLike
75136 13.147394 0.428964 didntLike
38344 1.669788 0.134296 didntLike
...
Assume datingDataMat is a numpy.ndarray of shape `(1000L, 2L) where column 0 is "Frequent Flier Miles Per Year", column 1 is "% Time Playing Video Games", and column 2 is "liter of ice cream consumed per week", as shown in the sample above.
Assume datingLabels is a list of ints 1, 2, or 3 meaning "Did Not Like", "Liked in Small Doses", and "Liked in Large Doses" respectively - associated with column 3 above.
Here is the code I have to create the plot (full details for file2matrix are at the end):
datingDataMat,datingLabels = file2matrix("datingTestSet2.txt")
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot (111)
plt.xlabel("Freq flier miles")
plt.ylabel("% time video games")
# Not sure how to finish this: plt.legend([1, 2, 3], ["did not like", "small doses", "large doses"])
plt.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*np.array(datingLabels), 15.0*np.array(datingLabels)) # Change marker color and size
plt.show()
The output is here:
My main concern is how to create this legend. Is there a way to do this without needing a direct handle to the points?
Next, I am curious whether I can find a way to switch the colors to match those of the plot. Is there a way to do this without having some kind of "handle" on the individual points?
Also, if interested, here is the file2matrix implementation:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = np.zeros((numberOfLines,3)) #numpy.zeros(shape, dtype=float, order='C')
classLabelVector = []
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3] # FFmiles/yr, % time gaming, L ice cream/wk
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
Here's an example that mimics the code you already have that shows the approach described in Saullo Castro's example.
It also shows how to set the colors in the example.
If you want more information on the colors available, see the documentation at http://matplotlib.org/api/colors_api.html
It would also be worth looking at the scatter plot documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.scatter
from numpy.random import rand, randint
from matplotlib import pyplot as plt
n = 1000
# Generate random data
data = rand(n, 2)
# Make a random array to mimic datingLabels
labels = randint(1, 4, n)
# Separate the data according to the labels
data_1 = data[labels==1]
data_2 = data[labels==2]
data_3 = data[labels==3]
# Plot each set of points separately
# 's' is the size parameter.
# 'c' is the color parameter.
# I have chosen the colors so that they match the plot shown.
# With each set of points, input the desired label for the legend.
plt.scatter(data_1[:,0], data_1[:,1], s=15, c='r', label="label 1")
plt.scatter(data_2[:,0], data_2[:,1], s=30, c='g', label="label 2")
plt.scatter(data_3[:,0], data_3[:,1], s=45, c='b', label="label 3")
# Put labels on the axes
plt.ylabel("ylabel")
plt.xlabel("xlabel")
# Place the Legend in the plot.
plt.gca().legend(loc="upper left")
# Display it.
plt.show()
The gray borders should become white if you use plt.savefig to save the figure to file instead of displaying it.
Remember to run plt.clf() or plt.cla() after saving to file to clear the axes so you don't end up replotting the same data on top of itself over and over again.
To create the legend you have to:
give labels to each curve
call the legend() method from the current AxesSubplot object, which can be obtained using plt.gca(), for example.
See the example below:
plt.scatter(datingDataMat[:,0], datingDataMat[:,1],
15.0*np.array(datingLabels), 15.0*np.array(datingLabels),
label='Label for this data')
plt.gca().legend(loc='upper left')

Categories