How to draw distribution plot for discrete variables in seaborn - python

When I draw displot for discrete variables, the distribution might not be as what I think. For example.
We can find that there are crevices in the barplot so that the curve in kdeplot is "lower" in y axis.
In my work, it was even worse:
I think it may because the "width" or "weight" was not 1 for each bar. But I didn't find any parameter that can justify it.
I'd like to draw such curve (It should be more smooth)

One way to deal with this problem might be to adjust the "bandwidth" of the KDE (see the documentation for seaborn.kdeplot())
n = np.round(np.random.normal(5,2,size=(10000,)))
sns.distplot(n, kde_kws={'bw':1})
EDIT Here is an alternative with a different scale for the bars and the KDE
n = np.round(np.random.normal(5,2,size=(10000,)))
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
sns.distplot(n, kde=False, ax=ax1)
sns.distplot(n, hist=False, ax=ax2, kde_kws={'bw':1})

If the problem is that there are some emptry bins in the histogram, it probably makes sense to specify the bins to match the data. In this case, use bins=np.arange(0,16) to get the bins for all integers in the data.
import numpy as np; np.random.seed(1)
import matplotlib.pyplot as plt
import seaborn as sns
n = np.random.randint(0,15,10000)
sns.distplot(n, bins=np.arange(0,16), hist_kws=dict(ec="k"))
plt.show()

It seems sns.distplot (or displot https://seaborn.pydata.org/generated/seaborn.displot.html) is for plotting histograms and no barplots. Both Histogram and KDE (which is an approximation of the probability density function) make sense only with continuous random variables.
So in your case, as you'd like to plot a distribution of a discrete random variable, you must go for a bar plot and plotting the Probability Mass Function (PMF) instead.
import numpy as np
import matplotlib.pyplot as plt
array = np.random.randint(15, size=10000)
unique, counts = np.unique(array, return_counts=True)
freq =counts/10000 # to change into frequency, no count
# plotting the points
plt.bar(unique, freq)
# naming the x axis
plt.xlabel('Value')
# naming the y axis
plt.ylabel('Frequency')
#Title
plt.title("Discrete uniform distribution")
# function to show the plot
plt.show()

Related

How to Generate Two Separate Y-Axes For A Histogram on the Same Figure In Seaborn

I'd like to generate a single figure that has two y axes: Count (from the histogram) and Density (from the KDE).
I want to use sns.displot in Seaborn >= v 0.11.
import seaborn as sns
df = sns.load_dataset('tips')
# graph 1: This should be the Y-Axis on the left side of the figure
sns.displot(df['total_bill'], kind='hist', bins=10)
# graph 2: This should be the Y-axis on the right side of the figure
sns.displot(df['total_bill'], kind='kde')
The code I've written generates two separate graphs; I could just use a facet grid for two separate graphs, but I want to be more concise and place the two y-axes on the two separate grids into a single figure sharing the same x-axis.
displot() is a figure-level function, which can create multiple subplots inside a figure. As such, you don't have control over individual axes.
To create combined plots, you can use the underlying axes-level functions: histplot() and kdeplot() for Seaborn v.0.11. These functions accept an ax= parameter. twinx() creates a second y-axis.
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('tips')
fig, ax = plt.subplots()
sns.histplot(df['total_bill'], bins=10, ax=ax)
ax2 = ax.twinx()
sns.kdeplot(df['total_bill'], ax=ax2)
plt.tight_layout()
plt.show()
Edit:
As mentioned in the comments, the y-axes aren't aligned. The left axis only tells something about the histogram. E.g. the highest bin having height 68 means that there are exactly 68 total bills between 12.618 and 17.392. The right axis only tells something about the kde. E.g. a y-value of 0.043 for x=20 would mean there is about 4.3 % probability that the total bill would be between 19.5 and 20.5.
To align both similar to sns.histplot(..., kde=True), the area of the histogram can be calculated (bin width times number of data values) and used as a scaling factor. Such scaling would make the area of the histogram and the area below the kde curve equal when measured in pixels:
num_bins = 10
bin_width = (df['total_bill'].max() - df['total_bill'].min()) / num_bins
hist_area = len(df) * bin_width
ax2.set_ylim(ymax=ax.get_ylim()[1] / hist_area)
Note that the right axis would be more similar to a percentage if the histogram would use a bin width with a power of ten (e.g. sns.histplot(..., bins=np.arange(0, df['total_bill'].max()+10, 10)). Which bins would be most suitable strongly depends on how you want to interpret your data.

Pyplot: How to make a colorbar with a nonlinear scale?

I want to add to my plot a colorbar, which has a nonlinear scale. For example, for such a plot:
I would like to have just 5 different colors on the bar on the right-hand side, instead of the gradient (don't pay attention to the plot itself; it's just an example).
I don't want to use contourf and would like to find some more general solution.
If you want to have discrete values in your colorbar, a quick way to do this would be to use the cmap=plt.cm.get_cmap() function and pass the name of whatever colormap class you are working with, along with the desired number of bins.
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
# Random Data Visualation
x = np.linspace(0, 10, 1000)
data = np.sin(x) * np.cos(x[:, np.newaxis])
plt.imshow(data, cmap=plt.cm.get_cmap('viridis', 5))
plt.colorbar()
plt.clim(-1, 1);
More documentation on everything color maps in Matplotlib [here]

Density plot from plotting multiple arrays

I have a MxN (say, 1000x50) array. I want to plot each 50-point line onto the same plot, and have a heatmap of their density.
Simply doing a plt.pcolor(data) is not what I want, since I don't want to plot the matrix.
This is what I want to plot, but as I said it doesn't provide me with the heatmap I need.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.rand(1000, 50)
fig, ax = plt.subplots()
for i in range(0,1000):
ax.plot(data[i], '.')
plt.show()
I would like a way of getting this together (I assume it will have something to do with histograms and binning?).
EDIT: simply adding an alpha value to the plot ( ax.plot(data[i], '.r', alpha=0.01)) achieves something similar to what I want. I would like, however, to have a heatmap with different colours.
As you already pointed out in your question, probably one of the simplest approaches involves histograms. A linear approximation of the histogram is probably enough for this application.
You can use np.histogram to calculate bin heights and edges and use scipy.interpolate.interp1d to obtain a function that provides an interpolation of the histogram. We can define a simple helper function to get the approximate density around each value in one column of the data array:
# import scipy.interpolate as interp
def get_density(vals, bins=30, kind="linear"):
y, bin_edges = np.histogram(vals, bins=bins, density=True)
x = (bin_edges[1:] + bin_edges[:-1])/2.
f = interp.interp1d(x, y, kind=kind, fill_value="extrapolate")
return f(vals)
Then you can use any colormap you want to map the density to a color value. The easiest way to go from here is to use plt.scatter instead of plot, where you can provide a specific color for every data point.
I would do something like this:
fig, ax = plt.subplots()
for i in range(data.shape[1]):
colors = plt.cm.viridis(get_density(data[:, i]))
ax.scatter(i*np.ones(data.shape[0]), data[:, i], c=colors, marker='.')

How to plot normalized histogram with pdf properly using matplotlib?

I try to plot normalized histogram using example from numpy.random.normal documentation. For this purpose I generate normally distributed random sample.
mu_true = 0
sigma_true = 0.1
s = np.random.normal(mu_true, sigma_true, 2000)
Then I fitt normal distribution to the data and calculate pdf.
mu, sigma = stats.norm.fit(s)
points = np.linspace(stats.norm.ppf(0.01,loc=mu,scale=sigma),
stats.norm.ppf(0.9999,loc=mu,scale=sigma),100)
pdf = stats.norm.pdf(points,loc=mu,scale=sigma)
Display fitted pdf and data histogram.
plt.hist(s, 30, density=True);
plt.plot(points, pdf, color='r')
plt.show()
I use density=True, but it is obviously, that pdf and histogram are not normalized.
What can one suggests to plot truly normalized histogram and pdf?
Seaborn distplot also doesn't solve the problem.
import seaborn as sns
ax = sns.distplot(s)
What makes you think it is not normalised? At a guess, it's probably because the heights of each column extend to values greater than 1. However, this thinking is flawed because in a normalised histogram/pdf, the total area under it should sum to one (not the heights). When you are dealing with small steps in x (as you are), that are less than one, then it is not surprising that the column heights are greater than one!
You can see this clearly in the scipy example you link: the x-values are much greater (by an order of magnitude) so it follows that their y-values are also smaller. You will see the same effect if you change your distribution to cover a wider range of values. Try a sigma of 10 instead of 0.1, see what happens!
import numpy as np
from numpy.random import seed, randn
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
"Try this, for 𝜇 = 0"
seed(0)
points = np.linspace(-5,5,100)
pdf = norm.pdf(points,0,1)
plt.plot(points, pdf, color='r')
plt.hist(randn(50), density=True);
plt.show()
"or this, for 𝜇 = 10"
seed(0)
points = np.linspace(5,15,100)
pdf = norm.pdf(points,10,1)
plt.plot(points, pdf, color='r')
plt.hist(10+randn(50), density=True);
plt.show()

Matplotlib axis custom scale adjustment

I am plotting from a pandas dataframe with commands like
fig1 = plt.hist(dataset_1[dataset_1>-1.0],bins=bins,alpha=0.75,label=label1,normed=True)
and the plots comprise multiple histograms on one canvas. Since each histogram is normalised to its own integral (hence the histograms have the same area, because the purpose of the histograms is to illustrate the shape of the datasets rather than their relative sizes), the numbers on the y axis are not meaningful. For now, I am suppressing y axis labelling using
axes.set_ylabel("(Normalised to unity)")
axes.get_yaxis().set_ticks([])
Is there a way of adjusting the scaling of the y axis such that "1" corresponds to the highest value on any histogram? This would display a vertical scale to guide the eye and with which to judge the relative values of different bins. In essence, I mean re-normalising the maximum displayed y value without affecting the scaling of the histograms (i.e. decoupling the axis scale from what it represents).
You have two options:
Drawing histogram, adjusting y axis tick.
You may set the y tick to the location of the maximum and label it with 1 afterwards.
import numpy as np; np.random.seed(1)
import matplotlib.pyplot as plt
a = np.random.rayleigh(scale=3, size=2000)
hist, edges,_ = plt.hist(a, ec="k")
plt.yticks([0,hist.max()], [0,1])
plt.show()
Normalizing histogram, drawing to scale.
You may normalize the histogram in the way you desire by first calculating the histogram, dividing it by its maximum and then plot a bar plot of it.
import numpy as np; np.random.seed(1)
import matplotlib.pyplot as plt
a = np.random.rayleigh(scale=3, size=2000)
hist, edges = np.histogram(a)
hist = hist/float(hist.max())
plt.bar(edges[1:], hist, width=np.diff(edges)[0], align="edge", ec="k")
plt.yticks([0,1])
plt.show()
The output in both cases would be the same:

Categories