I'm plotting a matrix, as shown below, and the legend repeats over and over again. I've tried using numpoints = 1 and this didn't seem to have any effect. Any hints?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10, 8) # set default figure size, 8in by 6inimport numpy as np
data = pd.read_csv('data/assg-03-data.csv', names=['exam1', 'exam2', 'admitted'])
x = data[['exam1', 'exam2']].as_matrix()
y = data.admitted.as_matrix()
# plot the visualization of the exam scores here
no_admit = np.where(y == 0)
admit = np.where(y == 1)
from pylab import *
# plot the example figure
plt.figure()
# plot the points in our two categories, y=0 and y=1, using markers to indicated
# the category or output
plt.plot(x[no_admit,0], x[no_admit,1],'yo', label = 'Not admitted', markersize=8, markeredgewidth=1)
plt.plot(x[admit,0], x[admit,1], 'r^', label = 'Admitted', markersize=8, markeredgewidth=1)
# add some labels and titles
plt.xlabel('$Exam 1 score$')
plt.ylabel('$Exam 2 score$')
plt.title('Admit/No Admit as a function of Exam Scores')
plt.legend()
It's nearly impossible to understand the problem if you don't put an example of data format especially if one is not familiar with pandas.
However, assuming your input has this format:
x=pd.DataFrame(np.array([np.arange(10),np.arange(10)**2]).T,columns=['exam1','exam2']).as_matrix()
y=pd.DataFrame(np.arange(10)%2).as_matrix()
>>x
array([[ 0, 0],
[ 1, 1],
[ 2, 4],
[ 3, 9],
[ 4, 16],
[ 5, 25],
[ 6, 36],
[ 7, 49],
[ 8, 64],
[ 9, 81]])
>> y
array([[0],
[1],
[0],
[1],
[0],
[1],
[0],
[1],
[0],
[1]])
the reason is the strange transformation from DataFrame to matrix, I guess it wouldn't happen if you have vectors (1D arrays).
For my example this works (not sure if it is the cleanest form, I don't know where the 2D matrix for x and y comes from):
plt.plot(x[no_admit,0][0], x[no_admit,1][0],'yo', label = 'Not admitted', markersize=8, markeredgewidth=1)
plt.plot(x[admit,0][0], x[admit,1][0], 'r^', label = 'Admitted', markersize=8, markeredgewidth=1)
Related
Using the physt library you can create a polar histogram of data that automatically returns a colorbar, such as:
from physt.histogram_nd import Histogram2D
# Histogram2D([radian_bins, angular_bins], [histogram values for given bins])
hist = Histogram2D([[0, 0.5, 1], [0, 1, 2, 3]], [[0.2, 2.2, 7.3], [6, 5, 3]])
ax = hist.plot.polar_map(cmap = 'viridis', show_zero=False)
I can't link an image of this output as I don't yet have enough reputation it seems.
The colorbar is created and looks great but has no label whatsoever. Is there some keyword or arguement I can use in the polar_map function to:
Label my colorbar or
extract the colorbar object so I can use established functions such as:
cbar.ax.set_ylabel("colorbar name")
A tutorial exists (https://physt.readthedocs.io/en/latest/tutorial.html) for this library but it doesn't really interact with the colorbar anywhere in the tutorial
You can indeed extract the colorbar object with ax.get_figure():
from physt.histogram_nd import Histogram2D
#
# Histogram2D([radian_bins, angular_bins], [histogram values for given bins])
hist = Histogram2D([[0, 0.5, 1], [0, 1, 2, 3]], [[0.2, 2.2, 7.3], [6, 5, 3]])
ax = hist.plot.polar_map(cmap = 'viridis', show_zero=False)
fig = ax.get_figure()
fig.axes[1].set_ylabel("colorbar name")
Resulting figure:
I am stuck on what seems like an easy problem trying to color the different groups on a scatterplot I am creating. I have the following example dataframe and graph:
test_df = pd.DataFrame({ 'A' : 1.,
'B' : np.array([1, 5, 9, 7, 3], dtype='int32'),
'C' : np.array([6, 7, 8, 9, 3], dtype='int32'),
'D' : np.array([2, 2, 3, 4, 4], dtype='int32'),
'E' : pd.Categorical(["test","train","test","train","train"]),
'F' : 'foo' })
# fix to category
# test_df['D'] = test_df["D"].astype('category')
# and test plot
f, ax = plt.subplots(figsize=(6,6))
ax = sns.scatterplot(x="B", y="C", hue="D", s=100,
data=test_df)
which creates this graph:
However, instead of a continuous scale, I'd like a categorical scale for each of the 3 categories [2, 3, 4]. After I uncomment the line of code test_df['D'] = ..., to change this column to a category column-type for category-coloring in the seaborn plot, I receive the following error from the seaborn plot: TypeError: data type not understood
Does anybody know the correct way to convert this numeric column to a factor / categorical column to use for coloring?
Thanks!
I copy/pasted your code, added libraries for import and removed the comment as I thought it looked good. I get a plot with 'categorical' colouring for value [2,3,4] without changing any of your code.
Try updating your seaborn module using: pip install --upgrade seaborn
Here is a list of working libraries used with your code.
matplotlib==3.1.2
numpy==1.18.1
seaborn==0.10.0
pandas==0.25.3
... which executed below code.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
test_df = pd.DataFrame({ 'A' : 1.,
'B' : np.array([1, 5, 9, 7, 3], dtype='int32'),
'C' : np.array([6, 7, 8, 9, 3], dtype='int32'),
'D' : np.array([2, 2, 3, 4, 4], dtype='int32'),
'E' : pd.Categorical(["test","train","test","train","train"]),
'F' : 'foo' })
# fix to category
test_df['D'] = test_df["D"].astype('category')
# and test plot
f, ax = plt.subplots(figsize=(6,6))
ax = sns.scatterplot(x="B", y="C", hue="D", s=100,
data=test_df)
plt.show()
I encoutered the same error TypeError: data type not understood.
Workaround that works is to use option legend="full". Conversion to categorical type is not necessary in this approach:
ax = sns.scatterplot(x="B", y="C", hue="D", s=100, legend="full", data=test_df)
Another solution is to use custom palette:
ax = sns.scatterplot(x="B", y="C", hue="D", s=100, palette=["b", "g", "r"], data=test_df)
In this case number of colours must be equal to unique values in column "D".
I generated a dendrogram plot for my dataset and I am not happy how the splits at some levels have been ordered. I am thus looking for a way to swap the two branches (or leaves) of a single split.
If we look at the code and dendrogram plot at the bottom, there are two labels 11 and 25 split away from the rest of the big cluster. I am really unhappy about this, and would like that the branch with 11 and 25 to be the right branch of the split and the rest of the cluster to be the left branch. The shown distances would still be the same, and thus the data would not be changed, just the aesthetics.
Can this be done? And how? I am specifically for a manual intervention because the optimal leaf ordering algorithm supposedly does not work in this case.
import numpy as np
# random data set with two clusters
np.random.seed(65) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)
# create linkage and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward')
plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=12., # font size for the x axis labels
)
plt.show()
I had a similar problem and got solved by using optimal_ordering option in linkage. I attach the code and result for your case, which might not be exactly what you like but seems highly improved to me.
import numpy as np
import matplotlib.pyplot as plt
# random data set with two clusters
np.random.seed(65) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)
# create linkage and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward', optimal_ordering = True)
plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=12., # font size for the x axis labels
distance_sort=False,
show_leaf_counts=True,
count_sort=False
)
plt.show()
result of using optimal_ordering in linkage
I would like to be able to plot multiple overlaid kde plots on the y axis margin (don't need the x axis margin plot). Each kde plot would correspond to the color category (there are 4) so that I would have 4 kde's each depicting the distribution of one of the categories. This is as far as I got:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
x = [106405611, 107148674, 107151119, 107159869, 107183396, 107229405, 107231917, 107236097,
107239994, 107259338, 107273842, 107275873, 107281000, 107287770, 106452671, 106471246,
106478110, 106494135, 106518400, 106539079]
y = np.array([ 9.09803208, 5.357552 , 8.98868469, 6.84549005,
8.17990909, 10.60640521, 9.89935692, 9.24079133,
8.97441459, 9.09803208, 10.63753055, 11.82336724,
7.93663794, 8.74819285, 8.07146236, 9.82336724,
8.4429435 , 10.53332973, 8.23361968, 10.30035256])
x1 = pd.Series(x, name="$V$")
x2 = pd.Series(y, name="$Distance$")
col = np.array([2, 4, 4, 1, 3, 4, 3, 3, 4, 1, 4, 3, 2, 4, 1, 1, 2, 2, 3, 1])
g = sns.JointGrid(x1, x2)
g = g.plot_joint(plt.scatter, color=col, edgecolor="black", cmap=plt.cm.get_cmap('RdBu', 11))
cax = g.fig.add_axes([1, .25, .02, .4])
plt.colorbar(cax=cax, ticks=np.linspace(1,11,11))
g.plot_marginals(sns.kdeplot, color="black", shade=True)
To plot a distribution of each category, I think the best way is to first combine the data into a pandas dataframe. Then you can loop through each unique category by filtering the dataframe and plot the distribution using calls to sns.kdeplot.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
x = np.array([106405611, 107148674, 107151119, 107159869, 107183396, 107229405,
107231917, 107236097, 107239994, 107259338, 107273842, 107275873,
107281000, 107287770, 106452671, 106471246, 106478110, 106494135,
106518400, 106539079])
y = np.array([9.09803208, 5.357552 , 8.98868469, 6.84549005,
8.17990909, 10.60640521, 9.89935692, 9.24079133,
8.97441459, 9.09803208, 10.63753055, 11.82336724,
7.93663794, 8.74819285, 8.07146236, 9.82336724,
8.4429435 , 10.53332973, 8.23361968, 10.30035256])
col = np.array([2, 4, 4, 1, 3, 4, 3, 3, 4, 1, 4, 3, 2, 4, 1, 1, 2, 2, 3, 1])
# Combine data into DataFrame
df = pd.DataFrame({'V': x, 'Distance': y, 'col': col})
# Define colormap and create corresponding color palette
cmap = sns.diverging_palette(20, 220, as_cmap=True)
colors = sns.diverging_palette(20, 220, n=4)
# Plot data onto seaborn JointGrid
g = sns.JointGrid('V', 'Distance', data=df, ratio=2)
g = g.plot_joint(plt.scatter, c=df['col'], edgecolor="black", cmap=cmap)
# Loop through unique categories and plot individual kdes
for c in df['col'].unique():
sns.kdeplot(df['Distance'][df['col']==c], ax=g.ax_marg_y, vertical=True,
color=colors[c-1], shade=True)
sns.kdeplot(df['V'][df['col']==c], ax=g.ax_marg_x, vertical=False,
color=colors[c-1], shade=True)
This is in my opinion a much better and cleaner solution than my original answer in which I needlessly redefined the seaborn kdeplot because I had not thought to do it this way. Thanks to mwaskom for pointing that out. Also note that the legend labels are removed in the posted solution and are done so using
g.ax_marg_x.legend_.remove()
g.ax_marg_y.legend_.remove()
I have a matrix that represents temperature distribution in a hollow square plate (hope the attached figure helps). The problem is with the hollow part in the plate which doesn't represent any solid material so I need to exclude this part from the plot.
The simulation returns an np.array() with the temperature results (except of course for the hollow part). and this is the part where I define dimensions of the grid:
import numpy as np
plate_height = 0.4 #meters
hollow_square_height = 0.2 #meters
#discretization data
delta_x = delta_y = 0.05 #meters
grid_points_n = (plate_height/delta_x) + 1
grid = np.zeros(shape=(grid_points_n, grid_points_n))
# the simulation assures that the hollow part will remain zero valued.
So, how do I approach this?
Instead of changing the original data, you can mask the values that you don't want to be used in calculations, plots, etc.:
import matplotlib.pyplot as plt
import numpy as np
data = [
[11, 11, 12, 13],
[9, 0, 0, 12],
[8, 0, 0, 11],
[8, 9, 10, 11]
]
#Here's what you have:
data_array = np.array(data)
#Mask every position where there is a 0:
masked_data = np.ma.masked_equal(data_array, 0)
#Plot the matrix:
fig = plt.figure()
ax = fig.gca()
ax.matshow(masked_data, cmap=plt.cm.autumn_r) #_r => reverse the standard color map
plt.show()
#plt.savefig('heatmap.png')
Replace zeros by nan, nan values are ignored in any plot. For example:
import matplotlib.pyplot as plt
from numpy import nan,matrix
M = matrix([
[20,30,25,20,50],
[22,nan,nan,nan,27],
[30,nan,nan,nan,20],
[33,nan,nan,nan,31],
[21,28,29,23,36]])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.matshow(M, cmap=plt.cm.jet) # Show matrix color
plt.show()
You can replace zeros by nan in a matrix as follow:
from numpy import nan
A[A==0.0]=nan # A is your matrix