How do I normalize a histogram using Matplotlib? - python

I am trying to generate a histogram using matplotlib. I am reading data from the following file:
https://github.com/meghnasubramani/Files/blob/master/class_id.txt
My intent is to generate a histogram with the following bins: 1, 2-5, 5-100, 100-200, 200-1000, >1000.
When I generate the graph it doesn't look nice.
I would like to normalize the y axis to (frequency of occurrence in a bin/total items). I tried using the density parameter but whenever I try that my graph ends up completely blank. How do I go about doing this.
How do I get the width's of the bars to be the same, even though the bin ranges are varied?
Is it also possible to specify the ticks on the histogram? I want to have the ticks correspond to the bin ranges.
import matplotlib.pyplot as plt
FILE_NAME = 'class_id.txt'
class_id = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [1, 2, 5, 100, 200, 1000, max(class_id)]
x = plt.hist(class_id, bins=num_bins, histtype='bar', align='mid', rwidth=0.5, color='b')
print (x)
plt.legend()
plt.xlabel('Items')
plt.ylabel('Frequency')

As suggested by importanceofbeingernest, we can use bar charts to plot categorical data and we need to categorize values in bins, for ex with pandas:
import matplotlib.pyplot as plt
import pandas
FILE_NAME = 'class_id.txt'
class_id_file = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [0, 2, 5, 100, 200, 1000, max(class_id_file)]
categories = pandas.cut(class_id_file, num_bins)
df = pandas.DataFrame(class_id_file)
dfg = df.groupby(categories).count()
bins_labels = ["1-2", "2-5", "5-100", "100-200", "200-1000", ">1000"]
plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=bins_labels)
#plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=categories.categories)
plt.xlabel('Items')
plt.ylabel('Frequency')
Not what you asked for, but you could also stay with histogram and choose logarithm scale to improve readability:
plt.xscale('log')

Related

Create a Seaborn style histogram / kernel density plot using the actual density function

I really like to the look of Seaborn's KDE plot:
I was wondering how can I replicate this for line plot.
In my case I actually have the function to generate the density instead of samples of the data.
So assuming I have the data in a data frame:
x - The value of x per sample.
y - The value of the density function at y.
μσ - Categorical variable to group data from the same density (In the code, I use the mean and standard deviation of a normal distribution).
I can use Seaborn's lineplot to get what I want without the area below the curve as in the image above.
I'm after achieving the look as above for the data I have.
Is there a way to replicate this theme, area under the curve included, for lineplot?
The code below shows what I got so far:
import numpy as np
import scipy as sp
import pandas as pd
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
num_grid_pts = 1000
val_μ = [0, -1, 1, 0]
val_σ = [1, 2, 3, 4]
num_var = len(val_μ) # variations
x = np.linspace(-10, 10, num_grid_pts)
P = np.zeros((num_grid_pts, num_var)) # PDF
μσ = [f'μ = {μ}, σ = {σ}' for μ, σ in zip(val_μ, val_σ)]
for ii, (μ, σ) in enumerate(zip(val_μ, val_σ)):
randVar = norm(μ, σ)
P[:, ii] = randVar.pdf(x)
df_P = pd.DataFrame(data = {'x': np.tile(x, num_var), 'PDF': P.flatten('F'), 'μσ': np.repeat(μσ, len(x))})
f, ax = plt.subplots(figsize=(15, 10))
sns.lineplot(data=df_P, x='x', y='PDF', hue='μσ', ax=ax)
plot_lines = ax.get_lines()
for ii in range(num_var):
ax.fill_between(x=plot_lines[ii].get_xdata(), y1=plot_lines[ii].get_ydata(), alpha=0.25, color=plot_lines[ii].get_color())
ax.set_title(f'Normal Distribution')
ax.set_xlabel(f'Value')
ax.set_ylabel(f'Probability')
plt.show()
I used the lineplot to create the lines and then created the fills. But this is a hack, I was wondering if I can do it more naturally within Seaborn.
I found a way to manually play with the elements do so using the area object:
(
so.Plot(healthexp, "Year", "Spending_USD", color="Country")
.add(so.Area(alpha=.7), so.Stack())
)
The result is:
Yet for some reason the example code doesn't work.
What I did was using Seabron's lineplot() and then manually add fill_between() polygon:
ax = sns.lineplot(data=data_frame, x='data_x', y='data_y', hue='data_color')
plot_lines = ax.get_lines()
for i in range(num_unique_colors):
ax.fill_between(x=plot_lines[i].get_xdata(), y1=plot_lines[i].get_ydata(), alpha=0.25, color=plot_lines[i].get_color())

How to plot histogram, when the number of values in interval is given? (python)

I know that when you usually plot a histogram you have an array of values and intervals.
But if I have intervals and the number of values that are in those intervals, how can I plot the histogram?
I have something that looks like this:
amounts = np.array([23, 7, 18, 5])
and my interval is from 0 to 4 with step 1,
so on interval [0,1] there are 23 values and so on.
You could probably try matplotlib.pyplot.stairs for this.
import matplotlib.pyplot as plt
import numpy as np
amounts = np.array([23, 7, 18, 5])
plt.stairs(amounts, range(5))
plt.show()
Please mark it as solved if this helps.
I find it easier to just simulate some data having the desired distribution, and then use plt.hist to plot the histogram.
Here is am example. Hopefully it will be helpful!
import numpy as np
import matplotlib.pyplot as plt
amounts = np.array([23, 7, 18, 5])
bin_edges = np.arange(5)
bin_centres = (bin_edges[1:] + bin_edges[:-1]) / 2
# fake some data having the desired distribution
data = [[bc] * amount for bc, amount in zip(bin_centres, amounts)]
data = np.concatenate(data)
hist = plt.hist(data, bins=bin_edges, histtype='step')[0]
plt.show()
# the plotted distribution is consistent with amounts
assert np.allclose(hist, amounts)
If you already know the values, then the histogram just becomes a bar plot.
amounts = np.array([23, 7, 18, 5])
interval = np.arange(5)
midvals = (interval + 0.5)[0:len(vals)-1] # 0.5, 1.5, 2.5, 3.5
plt.bar(midvals,
amounts)
plt.xticks(interval) # Shows the interval ranges rather than the centers of the bars
plt.show()
If the gap between the bars looks to wide, you can change the width of the bars by passing in a width (as a fraction of 1 - default is 0.8) argument to plt.bar().

Change the width of merged bins in Matplotlib and Seaborn

I have a table of grades and I want all of the bins to be of the same width
i want the bins to be in the range of [0,56,60,65,70,80,85,90,95,100]
when the first bin is from 0-56 then 56-60 ... with the same width
sns.set_style('darkgrid')
newBins = [0,56,60,65,70,80,85,90,95,100]
sns.displot(data= scores , bins=newBins)
plt.xlabel('grade')
plt.xlim(0,100)
plt.xticks(newBins);
Expected output
how I can balance the width of the bins?
You need to cheat a bit. Define you own bins and name the bins with a linear range. Here is an example:
s = pd.Series(np.random.randint(100, size=100000))
bins = [-0.1, 50, 75, 95, 101]
s2 = pd.cut(s, bins=bins, labels=range(len(bins)-1))
ax = s2.astype(int).plot.hist(bins=len(bins)-
1)
ax.set_xticks(np.linspace(0, len(bins)-2, len(bins)))
ax.set_xticklabels(bins)
Output:
Old answer:
Why don't you let seaborn pick the bins for you:
sns.displot(data=scores, bins='auto')
Or set the number of bins that you want:
sns.displot(data=scores, bins=10)
They will be evenly distributed
You assigning a list to the bins argument of sns.distplot(). This specifies the edges of bins. Since these edges are not spaced evenly, the widths of bins vary.
I think that you may want to use a bar plot (sbs.barplot()) and not a histogram. You would need to compute how many data points are in each bin, and then plot bars without the information what range of values each bar represents. Something like this:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
import numpy as np
# sample data
data = np.random.randint(0, 100, 200)
newBins = [0,56,60,65,70,80,85,90,95,100]
# compute bar heights
hist, _ = np.histogram(data, bins=newBins)
# plot a bar diagram
sns.barplot(x = list(range(len(hist))), y = hist)
plt.show()
It gives:
just change the list of values that are you using as binds:
newBins = numpy.arange(0, 100, 1)
 You can use bin parameter from histplots but to get exact answer you have to use pd.cut() to creating your own bins.
np.random.seed(101)
df = pd.DataFrame({'scores':pd.Series(np.random.randint(100,size=175)),
'bins_created':pd.cut(scores,bins=[0,55,60,65,70,75,80,85,90,95,100])})
new_data = df['bins_created'].value_counts()
plt.figure(figsize=(10,5),dpi=100)
plots = sns.barplot(x=new_data.index,y=new_data.values)
plt.xlabel('grades')
plt.ylabel('counts')
for bar in plots.patches:
plots.annotate(format(bar.get_height(), '.2f'),
(bar.get_x() + bar.get_width() / 2,
bar.get_height()), ha='center', va='center',
size=10, xytext=(0,5),
textcoords='offset points')
plt.show()

python violin plot regular axis

I want to to a violin plot of binned data but at the same time be able to plot a model prediction and visualize how well the model describes the main part of the individual data distributions. My problem here is, I guess, that the x-axis after the violin plot does not behave like a regular axis with numbers, but more like string-values that just accidentally happen to be numbers. Maybe not a good description, but in the example I would like to have a "normal" plot a function, e.g. f(x) = 2*x**2, and at x=1, x=5.2, x=18.3 and x=27 I would like to have the violin in the background.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
np.random.seed(10)
collectn_1 = np.random.normal(1, 2, 200)
collectn_2 = np.random.normal(802, 30, 200)
collectn_3 = np.random.normal(90, 20, 200)
collectn_4 = np.random.normal(70, 25, 200)
ys = [collectn_1, collectn_2, collectn_3, collectn_4]
xs = [1, 5.2, 18.3, 27]
sns.violinplot(x=xs, y=ys)
xx = np.arange(0, 30, 10)
plt.plot(xx, 2*xx**2)
plt.show()
Somehow this code actually does not plot violins but only bars, this is only a problem in this example and not in the original code though. In my real code I want to have different "half-violins" on both sides, therefore I use sns.violinplot(x="..", y="..", hue="..", data=.., split=True).
I think that would be hard to do with seaborn because it does not provide an easy way to manipulate the artists that it creates, particularly if there are other things plotted on the same Axes. Matplotlib's violinplot allows setting the position of the violins, but does not provide an option for plotting only half violins. Therefore, I would suggest using statsmodels.graphics.boxplots.violinplot, which does both.
from statsmodels.graphics.boxplots import violinplot
df = sns.load_dataset('tips')
x_col = 'day'
y_col = 'total_bill'
hue_col = 'smoker'
xs = [1, 5.2, 18.3, 27]
xx = np.arange(0, 30, 1)
yy = 0.1*xx**2
cs = ['C0','C1']
fig, ax = plt.subplots()
ax.plot(xx,yy)
for (_,gr0),side,c in zip(df.groupby(hue_col),['left','right'],cs):
print(side)
data = [gr1 for (_,gr1) in gr0.groupby(x_col)[y_col]]
violinplot(ax=ax, data=data, positions=xs, side=side, show_boxplot=False, plot_opts=dict(violin_fc=c))
# violinplot above messes up which ticks are shown, the line below restores a sensible tick locator
ax.xaxis.set_major_locator(matplotlib.ticker.MaxNLocator())

Line-based heatmap or 2D line histogram

I have a synthetic dataset with 1000 noisy polygons of various orders and sin/cos curves that I can plot as lines using python seaborn.
Since I have quite a few lines that are overlapping, I'd like to plot some sort of heatmap or histogram of my line graphs.
I've tried iterating over the columns and aggregating the counts to use seaborn's heatmap graph, but with many lines this takes quite a while.
The next best thing that results in what I want was a hexbin graph (with seaborn jointgraph).
But it's a compromise between runtime and granularity (the shown graph has gridsize 750). I couldn't find any other graph-type for my problem. But I also don't know exactly what it might be called.
I've also tried with line alpha set to 0.2. This results in a similar graph to what I want. But it's less precise (if more than 5 lines overlap at the same point I already have zero transparency left). Also, it misses the typical coloration of heatmaps.
(Moot search terms were: heatmap, 2D line histogram, line histogram, density plots...)
Does anybody know packages to plot this more efficiently and high(er) quality or knows how to do it with the popular python plotters (i.e. the matplotlib family: matplotlib, seaborn, bokeh). I'm really fine with any package though.
It took me awhile, but I finally solved this using Datashader. If using a notebook, the plots can be embedded into interactive Bokeh plots, which looks really nice.
Anyhow, here is the code for static images, in case someone else is in need of something similar:
# coding: utf-8
import time
import numpy as np
from numpy.polynomial import polynomial
import pandas as pd
import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf
plt.style.use("seaborn-whitegrid")
def create_data():
# ...
# Each column is one data sample
df = create_data()
# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
# THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286
# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))
# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'
# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))
# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))
# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)
# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))
# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)
With following run times (in seconds):
Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242
The resulting plots:
Although it seems you have tried this, plotting the counts seems to give a good representation of the data. However, it really depends what you're trying to find in your data, what is it supposed to tell you?
The reason for the long run time is due to plotting so many lines, a heatmap based on the counts however will plot fairly quickly.
I created some dummy data for sinus waves, based on noise, no. of lines, amplitude and shift. Added both a boxplot and heatmap.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd
np.random.seed(0)
#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
a = np.random.randint(5, 40)/5 #amplitude
x = random.choice([int(N/5), int(N/(2/5))]) #random shift
sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N)) + np.random.randn(N), x))
fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )
ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))
# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)
# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
xticks[index].set_visible(False) # hide ticks where labels are hidden
#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())
# calculate histogram
hists = []
for col in sins.columns:
count, division = np.histogram(sins.iloc[:,col], bins=bins)
hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())
cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
'#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])
#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)
yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()
for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
if index % 3 != 0: #make some labels invisible
yticks[index].set_visible(False) # hide ticks where labels are hidden
plt.show()
Although the boxplot is easy to interpret, it doesn't show the actual distribution of the data very well, but knowing where the median and quantiles lie may be helpful.
Increasing the number of lines and amount of values per line will increase plotting time considerably for the line plots, the heatmap is still fairly quick though to generate. The boxplot becomes indiscernible however.
I couldn't exactly replicate your data (or know the actual size of it), but perhaps the heatmap may be helpful.

Categories