How to plot a histogram using numpy.histogram output? [duplicate] - python

I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
Given this data, I can use
pylab.hist(data, bins=[...])
to plot a histogram.
In my case, the data has been pre-counted and is represented as a dictionary:
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:
data = list(chain.from_iterable(repeat(value, count)
for (value, count) in counted_data.iteritems()))
This is inefficient when counted_data contains counts for millions of data points.
Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?
Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?

You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)
val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)
Assuming you only have integers as the keys, you can also use bar directly:
min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())
bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)
for k,v in counted_data.items():
vals[k - min_bin] = v
plt.bar(bins, vals, ...)
where ... is what ever arguments you want to pass to bar (doc)
If you want to re-bin your data see Histogram with separate list denoting frequency

I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:
pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))
This allows me to rely on hist to re-bin my data.

You can also use seaborn to plot the histogram :
import seaborn as sns
sns.distplot(
list(
counted_data.keys()
),
hist_kws={
"weights": list(counted_data.values())
}
)

the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:
import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)

Adding to tacaswell's comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with
i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)
Other statistical trends may prefer to instead plot every 100th bar or something similar.
The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.

hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):
bins = [1,2,3]
heights = [10,20,30]
ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])

Related

How to plot histogram, when the number of values in interval is given? (python)

I know that when you usually plot a histogram you have an array of values and intervals.
But if I have intervals and the number of values that are in those intervals, how can I plot the histogram?
I have something that looks like this:
amounts = np.array([23, 7, 18, 5])
and my interval is from 0 to 4 with step 1,
so on interval [0,1] there are 23 values and so on.
You could probably try matplotlib.pyplot.stairs for this.
import matplotlib.pyplot as plt
import numpy as np
amounts = np.array([23, 7, 18, 5])
plt.stairs(amounts, range(5))
plt.show()
Please mark it as solved if this helps.
I find it easier to just simulate some data having the desired distribution, and then use plt.hist to plot the histogram.
Here is am example. Hopefully it will be helpful!
import numpy as np
import matplotlib.pyplot as plt
amounts = np.array([23, 7, 18, 5])
bin_edges = np.arange(5)
bin_centres = (bin_edges[1:] + bin_edges[:-1]) / 2
# fake some data having the desired distribution
data = [[bc] * amount for bc, amount in zip(bin_centres, amounts)]
data = np.concatenate(data)
hist = plt.hist(data, bins=bin_edges, histtype='step')[0]
plt.show()
# the plotted distribution is consistent with amounts
assert np.allclose(hist, amounts)
If you already know the values, then the histogram just becomes a bar plot.
amounts = np.array([23, 7, 18, 5])
interval = np.arange(5)
midvals = (interval + 0.5)[0:len(vals)-1] # 0.5, 1.5, 2.5, 3.5
plt.bar(midvals,
amounts)
plt.xticks(interval) # Shows the interval ranges rather than the centers of the bars
plt.show()
If the gap between the bars looks to wide, you can change the width of the bars by passing in a width (as a fraction of 1 - default is 0.8) argument to plt.bar().

Disable hue nesting in Seaborn

When plotting a bar chart with Seaborn and using the hue parameter to color the bars according to their column value, bars with identical column values are nested, or aggregated, and only a single bar is shown. The image below illustrates the problem. Patient number 1 has two samples of sample_type 1, with values 10 and 20. The two values have been nested, and both values are represented as a single bar (as the average of the two).
I'd like to avoid this nesting, and rather have something like in the image below.
Is this possible to achieve? MVE below. Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
"patient_number": [1, 1, 1, 2, 2, 2],
"sample_type": [1, 1, 2, 1, 2, 3],
"value": [10, 20, 15, 10, 11, 12]
})
sns.barplot(x="patient_number", y="value", hue="sample_type", data=df)
plt.show()
The following approach obtains the desired plot:
Seaborn's hue= parameter both defines the color and the position of the bars.
Per patient, an extra field ('idx') contains a unique number for each of the desired bars. This field 'idx' restarts from 0 for every next patient and is added to the dataframe.
'idx' can then be used as hue='idx' to get the desired columns, although they will be colored just sequently.
In order to get one color per sample type, an extra column now contains a factorized version of the sample types (so, 0 for the first type, 1 for the next, etc.)
Seaborn generates the bars per hue, one for each patient. These bars can be accessed as a list via ax.patches If some patient doesn't have a value for a given 'idx', a dummy bar is will be added to the list.
By iterating through the patients and then through the 'idx', all bars can be visited and colored via 'sample_type'. As the ordering of the bars is a bit tricky, an adequate renumbering is needed.
The legend needs to be changed to reflect the sample types.
The given data is extended a bit to be able to test different numbers of samples per patient, and sample types that aren't simple subsequent numbers.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'patient_number': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'sample_type': ['st1', 'st1', 'st2', 'st1', 'st2', 'st3', 'st4', 'st4', 'st4', 'st4', 'st4'],
'value': [10, 20, 15, 10, 11, 12, 1, 2, 3, 4, 5]
})
df['idx'] = df.groupby('patient_number').cumcount()
df['sample_factors'], sample_labels = pd.factorize(df['sample_type'])
ax = sns.barplot(x='patient_number', y='value', hue='idx', data=df)
colors = plt.cm.get_cmap('Set2').colors # https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
handles = [None for _ in sample_labels]
num_patients = len(ax.patches) // (df['idx'].max() + 1)
for i, (patient_id, group) in enumerate(df.groupby('patient_number')):
for j, factor in enumerate(group['sample_factors']):
patch = ax.patches[i + j * num_patients]
patch.set_color(colors[factor])
handles[factor] = patch
ax.legend(handles=handles, labels=list(sample_labels), title='Sample type')
plt.show()

Pyplot/Matplotlib: Binary data with strings on x-axis

I know it's such a basic thing, but due to ridiculous time constraints and the severity of the situation I'm forced to ask something like this:
I've got two arrays of 160 000 entries. One contains strings(names I need to use), the other contains corresponding 1's and 0's.
I'm trying to make a simple "step" graph in pyplot with the array of names along the X-axis and 0 and 1 along the Y-axis.
I have this currently:
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 4, 5, 9]
bindata = [0,1,1,0,1,1,0,0,0,1]
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
plt.step(xaxis, yaxis)
plt.xlabel('Filter Degree Combinations')
plt.ylabel('Negative Or Positive')
plt.title("Car 1")
#plt.savefig('foo.png') #For saving
plt.show()
It gives me this:
But I want something like this:
I cobbled the code together from some examples, tutorials and stackoverflow questions, but I run into "ValueError: x and y must have same first dimension" so often that I'm not getting anywhere when I try to experiment my way forward.
You can achieve the desired plot by specifying the tick labels and their positions on the x-axis using plt.xticks. The first argument range(0, 10, 2) is the positions followed by the strings
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 4, 5, 9]
bindata = [0,1,1,0,1,1,0,0,0,1]
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
plt.step(xaxis, yaxis)
xlabels = ['Josh', 'Anna', 'Kevin', 'Sophie', 'Steve'] # <-- specify tick-labels
plt.xlabel('Filter Degree Combinations')
plt.ylabel('Negative Or Positive')
plt.title("Car 1")
plt.xticks(range(0, 10, 2), xlabels) # <-- assign tick-labels
plt.show()

How do I normalize a histogram using Matplotlib?

I am trying to generate a histogram using matplotlib. I am reading data from the following file:
https://github.com/meghnasubramani/Files/blob/master/class_id.txt
My intent is to generate a histogram with the following bins: 1, 2-5, 5-100, 100-200, 200-1000, >1000.
When I generate the graph it doesn't look nice.
I would like to normalize the y axis to (frequency of occurrence in a bin/total items). I tried using the density parameter but whenever I try that my graph ends up completely blank. How do I go about doing this.
How do I get the width's of the bars to be the same, even though the bin ranges are varied?
Is it also possible to specify the ticks on the histogram? I want to have the ticks correspond to the bin ranges.
import matplotlib.pyplot as plt
FILE_NAME = 'class_id.txt'
class_id = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [1, 2, 5, 100, 200, 1000, max(class_id)]
x = plt.hist(class_id, bins=num_bins, histtype='bar', align='mid', rwidth=0.5, color='b')
print (x)
plt.legend()
plt.xlabel('Items')
plt.ylabel('Frequency')
As suggested by importanceofbeingernest, we can use bar charts to plot categorical data and we need to categorize values in bins, for ex with pandas:
import matplotlib.pyplot as plt
import pandas
FILE_NAME = 'class_id.txt'
class_id_file = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [0, 2, 5, 100, 200, 1000, max(class_id_file)]
categories = pandas.cut(class_id_file, num_bins)
df = pandas.DataFrame(class_id_file)
dfg = df.groupby(categories).count()
bins_labels = ["1-2", "2-5", "5-100", "100-200", "200-1000", ">1000"]
plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=bins_labels)
#plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=categories.categories)
plt.xlabel('Items')
plt.ylabel('Frequency')
Not what you asked for, but you could also stay with histogram and choose logarithm scale to improve readability:
plt.xscale('log')

Plotting a histogram from pre-counted data in Matplotlib

I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data
data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]
Given this data, I can use
pylab.hist(data, bins=[...])
to plot a histogram.
In my case, the data has been pre-counted and is represented as a dictionary:
counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}
Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:
data = list(chain.from_iterable(repeat(value, count)
for (value, count) in counted_data.iteritems()))
This is inefficient when counted_data contains counts for millions of data points.
Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?
Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?
You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)
val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)
Assuming you only have integers as the keys, you can also use bar directly:
min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())
bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)
for k,v in counted_data.items():
vals[k - min_bin] = v
plt.bar(bins, vals, ...)
where ... is what ever arguments you want to pass to bar (doc)
If you want to re-bin your data see Histogram with separate list denoting frequency
I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:
pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))
This allows me to rely on hist to re-bin my data.
You can also use seaborn to plot the histogram :
import seaborn as sns
sns.distplot(
list(
counted_data.keys()
),
hist_kws={
"weights": list(counted_data.values())
}
)
the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:
import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
Adding to tacaswell's comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with
i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)
Other statistical trends may prefer to instead plot every 100th bar or something similar.
The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.
hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):
bins = [1,2,3]
heights = [10,20,30]
ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])

Categories