How to re-scale the counts in a matplotlib histogram

How to re-scale the counts in a matplotlib histogram - python

I have a matplotlib histogram that works fine.
hist_bin_width = 4
on_hist = plt.hist(my_data,bins=range(-100, 200,hist_bin_width),alpha=.3,color='#6e9bd1',label='on')
All I want to do is to rescale by a factor of, say, 2. I don't want to change the bin width, or to change the y axis labels. I want to take the counts in all the bins (e.g. bin 1 has 17 counts) and multiply by 2 so that bin 1 now has 34 counts in it.
Is this possible?
Thank you.

As it's just a simple rescaling of the y-axis, this must be possible. The complication arises because Matplotlib's hist computes and draws the histogram, making it difficult to intervene. However, as the documentation also notes, you can use the weights parameter to "draw a histogram of data that has already been binned". You can bin the data in a first step with Numpy's histogram function. Applying the scaling factor is then straightforward:
from matplotlib import pyplot
import numpy
numpy.random.seed(0)
data = numpy.random.normal(50, 20, 10000)
(counts, bins) = numpy.histogram(data, bins=range(101))
factor = 2
pyplot.hist(bins[:-1], bins, weights=factor*counts)
pyplot.show()

pyplot.hist's weights argument can be used to weight each data point with a factor like
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
data = np.random.normal(50, 20, 10000)
factor = 2
hist_bin_width = 40
plt.hist(data, bins=range(-100, 200, hist_bin_width),
weights=factor*np.ones_like(data))
plt.show()

Related

Radial heatmap from similarity matrix in Python

Summary
I have a 2880x2880 similarity matrix (8.5 mil points). My attempt with Holoviews resulted in a 500 MB HTML file which never finishes "opening". So how do I make a round heatmap of the matrix?
Details
I had data from 10 different places, measured over 1 whole year. The hours of each month were turned into arrays, so each month had 24 arrays (one for all 00:00, one for all 01:00 ... 22:00, 23:00).
These were about 28-31 cells long, and each cell had the measurement of the thing I'm trying to analyze. So there are these 24 arrays for each month of 1 whole year, i.e. 24x12 = 288 arrays per place. And there are measurements from 10 places. So a total of 2880 arrays were created and all compared to each other, and saved in a 2880x2880 matrix with similarity coefficients.
I'm trying to turn it into a radial similarity matrix like the one from holoviews, but without the ticks and tags (since the format Place01Jan0800 would be cumbersome to look at for 2880 rows), just the shape and colors and divisions:
I managed to create the HTML file itself, but it ended up being 500 MB big, so it never shows up when I open it up. It's just blank. I've added a minimal example below of what I have, and replaced the loading of the datafile with some randomly generated data.
import sys
sys.setrecursionlimit(10000)
import random
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
from bokeh.plotting import show
import gc
# Function creating dummy data for this example
def transformer():
dimension = 2880
dummy_matrix = ([[ random.random() for i in range(dimension) ] for j in range(dimension)]) #Fake, similar data
col_vals = [str(i) for i in range(dimension*dimension)] # Placeholder
row_vals = [str(i) for i in range(dimension*dimension)] # Placeholder
val_vals = (np.reshape(np.array(dummy_matrix), -1)).tolist() # Turn matrix into an array
idx_vals = [i for i in range(dimension*dimension)] # Placeholder
return idx_vals, val_vals, row_vals, col_vals
idx_arr, val_arr, row_arr, col_arr = transformer()
df = pd.DataFrame({"values": val_arr, "x-label": row_arr, "y-label": col_arr}, index=idx_arr)
hv.extension('bokeh')
heatmap = hv.HeatMap(df, ["x-label", "y-label"])
heatmap.opts(opts.HeatMap(cmap="viridis", radial=True))
gc.collect() # Attempt to save memory, because this thing is huge
show(hv.render(heatmap))
I had a look at datashader to see if it would help, but I have no idea how to plug it in (if it's possible for this case) to this radial heatmap, since it seems like the radial heatmap doesn't have that datashade-feature.
So I have no idea how to tackle this. I would be content with a broad overview too, I don't need the details nor the hover-infobox nor ability to zoom or any fancy extra features, I just need the general overview for a presentation. I'm open to any solution really.

I recommend you to use heatmp instead of radial heatamp for showing the similarity matrix. The reasons are:
The radial heatmap is designed for periodic variable. The time varible(288 hours) can be considered to be periodic data, however, I think the 288*10(288 hours, 10 places) is no longer periodic because of the existence of the "place".
Near the center of the radial heatmap, the color points will be too dense to be understood by the human.
The following is a simple code to show a heatmap.
import matplotlib.cm
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import numpy as np
n = 2880
m = 2880
dummy_matrix = np.random.rand(m, n)
fig = plt.figure(figsize=(50,50)) # change the figsize to control the resolution
ax = fig.add_subplot(111)
cmap = matplotlib.cm.get_cmap("Blues") # you may use other build-in colormap or define you own colormap
# if your data is not in range[0,1], use a normalization. Here is normalized by min and max values.
norm = Normalize(vmin=np.amin(dummy_matrix), vmax=np.amax(dummy_matrix))
image = ax.imshow(dummy_matrix, cmap=cmap, norm=norm)
plt.colorbar(image)
plt.show()
Which gives:
Another idea that comes to me is that, perhaps the computation of similarity matrix is unnecessary, and you can plot the orginial 288 * 10 data using radial heat map or just a normal heatmap, and one can get to know the data similarity from the color distribution directly.

Plain Matplotlib seems to be able to handle it, based on answers from here: How do I create radial heatmap in matplotlib?
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
fig = plt.figure()
ax = Axes3D(fig)
n = 2880
m = 2880
rad = np.linspace(0, 10, m)
a = np.linspace(0, 2 * np.pi, n)
r, th = np.meshgrid(rad, a)
dummy_matrix = ([[ random.random() for i in range(n) ] for j in range(m)])
plt.subplot(projection="polar")
plt.pcolormesh(th, r, dummy_matrix, cmap = 'Blues')
plt.plot(a, r, ls='none', color = 'k')
plt.grid()
plt.colorbar()
plt.savefig("custom_radial_heatmap.png")
plt.show()
And it didn't even take an eternity, took only about 20 seconds max.
You would think it would turn out monstrous like that
But the sheer amount of points drowns out the jaggedness, WOOHOO!
There's some things left to be desired, like tags and ticks, but I think I'll figure that out.

How can I create a list of the values on the y-axis without having to plot a graph in Python?

I have a piece of code that plots a random walk with a specified number of bins on my y-axis. Is there a way in Python to replicate/recreate the values on my y-axis, without having to plot the graph? Below is the code I've been working on and the method I've tried is to divide the min-max range by the number of
wanted bins and thereafter create a list with these values. However, I find my method far from optimal and not close to the results I get by using the below code.
I am greatful for any help on this matter!
import matplotlib.pyplot as plt
import numpy as np
import random
dims = 1
step_n = 2000
step_set = [-1, 0, 1]
origin = np.zeros((1,dims))
random.seed(30)
step_shape = (step_n,dims)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
# create subplot
fig, ax = plt.subplots(1,1, figsize=(20, 11))
img = ax.plot(path)
plt.locator_params(axis='y', nbins=20)
y_values = ax.get_yticks() # y_values is a numpy array with my y values

I am not sure, if I understood your problem correctly.
Matplotlib defines the differences between the ticks in a way, that I assume are mostly multiples of 5.
But a general approach could be, to calculate a padding based on the bins you want and add/subtract it. For your given example the following gives the same result as ax.get_yticks()
bins = 19
padding = np.ceil((np.max(path) - np.min(path)) / bins)
np.linspace(np.min(path) - padding, np.max(path) + padding, bins, dtype=np.int32)

display a histogram with very non-uniform bin widths

Here is the histogram
To generate this plot, I did:
bins = np.array([0.03, 0.3, 2, 100])
plt.hist(m, bins = bins, weights=np.zeros_like(m) + 1. / m.size)
However, as you noticed, I want to plot the histogram of the relative frequency of each data point with only 3 bins that have different sizes:
bin1 = 0.03 -> 0.3
bin2 = 0.3 -> 2
bin3 = 2 -> 100
The histogram looks ugly since the size of the last bin is extremely large relative to the other bins. How can I fix the histogram? I want to change the width of the bins but I do not want to change the range of each bin.

As #cel pointed out, this is no longer a histogram, but you can do what you are asking using plt.bar and np.histogram. You then just need to set the xticklabels to a string describing the bin edges. For example:
import numpy as np
import matplotlib.pyplot as plt
bins = [0.03,0.3,2,100] # your bins
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99] # random data
hist, bin_edges = np.histogram(data,bins) # make the histogram
fig,ax = plt.subplots()
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1)
# Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])
# Set the xticklabels to a string that tells us what the bin edges were
ax.set_xticklabels(['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])
plt.show()
EDIT
If you update to matplotlib v1.5.0, you will find that bar now takes a kwarg tick_label, which can make this plotting even easier (see here):
hist, bin_edges = np.histogram(data,bins)
ax.bar(range(len(hist)),hist,width=1,align='center',tick_label=
['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])

If your actual values of the bins are not important but you want to have a histogram of values of completely different orders of magnitude, you can use a logarithmic scaling along the x axis. This here gives you bars with equal widths
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
plt.hist(data,bins=10**np.linspace(-2,2,5))
plt.xscale('log')
plt.show()
When you have to use your bin values you can do
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
bins = [0.03,0.3,2,100]
plt.hist(data,bins=bins)
plt.xscale('log')
plt.show()
However, in this case the widths are not perfectly equal but still readable. If the widths must be equal and you have to use your bins I recommend #tom's solution.

Plot a histogram such that the total height equals 1

This is a follow-up question to this answer. I'm trying to plot normed histogram, but instead of getting 1 as maximum value on y axis, I'm getting different numbers.
For array k=(1,4,3,1)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram, that doesn't look like normed.
For a different array k=(3,3,3,3)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(3,3,3,3)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram with max y-value is 10.
For different k I get different max value of y even though normed=1 or normed=True.
Why the normalization (if it works) changes based on the data and how can I make maximum value of y equals to 1?
UPDATE:
I am trying to implement Carsten König answer from plotting histograms whose bar heights sum to 1 in matplotlib and getting very weird result:
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
weights = np.ones_like(k)/len(k)
plt.hist(k, weights=weights)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
Result:
What am I doing wrong?

When plotting a normalized histogram, the area under the curve should sum to 1, not the height.
In [44]:
import matplotlib.pyplot as plt
k=(3,3,3,3)
x, bins, p=plt.hist(k, density=True) # used to be normed=True in older versions
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
In [45]:
print bins
[ 2.5 2.6 2.7 2.8 2.9 3. 3.1 3.2 3.3 3.4 3.5]
Here, this example, the bin width is 0.1, the area underneath the curve sums up to one (0.1*10).
x stores the height for each bins. p stores each of those individual bins objects (actually, they are patches. So we just sum up x and modify the height of each bin object.
To have the sum of height to be 1, add the following before plt.show():
for item in p:
item.set_height(item.get_height()/sum(x))

You could use the solution outlined here:
weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)

One way is to get the probabilities on your own, and then plot with plt.bar:
In [91]: from collections import Counter
...: c=Counter(k)
...: print c
Counter({1: 2, 3: 1, 4: 1})
In [92]: plt.bar(c.keys(), c.values())
...: plt.show()
result:

A normed histogram is defined such that the sum of products of width and height of each column is equal to the total count. That's why you are not getting your max equal to one.
However, if you still want to force it to be 1, you could use numpy and matplotlib.pyplot.bar in the following way
sample = np.random.normal(0,10,100)
#generate bins boundaries and heights
bin_height,bin_boundary = np.histogram(sample,bins=10)
#define width of each column
width = bin_boundary[1]-bin_boundary[0]
#standardize each column by dividing with the maximum height
bin_height = bin_height/float(max(bin_height))
#plot
plt.bar(bin_boundary[:-1],bin_height,width = width)
plt.show()

I found it very easy to use plotly express. Here is my code for your example:
import plotly.express as px
k= [1,4,3,1]
px.histogram(k,nbins=10,range_x=[0,10],histnorm='probability')
Which gives the normalize histogram the way that you want it. If you want to use percentage instead of probability you can simply change the last line of code to
px.histogram(k,nbins=10,range_x=[0,10],histnorm='percent')
If you don't want to manually set the range_x and nbins to make sure area of histogram is always one, you can use the following codes:
x_min=int(min(k))-1
x_max=int(max(k))+1
x_bins = x_max-x_min
px.histogram(k,nbins=x_bins,range_x=[x_min,x_max],histnorm='probability')

How to have logarithmic bins in a Python histogram

As far as I know the option Log=True in the histogram function only refers to the y-axis.
P.hist(d,bins=50,log=True,alpha=0.5,color='b',histtype='step')
I need the bins to be equally spaced in log10. Is there something that can do this?

use logspace() to create a geometric sequence, and pass it to bins parameter. And set the scale of xaxis to log scale.
import pylab as pl
import numpy as np
data = np.random.normal(size=10000)
pl.hist(data, bins=np.logspace(np.log10(0.1),np.log10(1.0), 50))
pl.gca().set_xscale("log")
pl.show()

The most direct way is to just compute the log10 of the limits, compute linearly spaced bins, and then convert back by raising to the power of 10, as below:
import pylab as pl
import numpy as np
data = np.random.normal(size=10000)
MIN, MAX = .01, 10.0
pl.figure()
pl.hist(data, bins = 10 ** np.linspace(np.log10(MIN), np.log10(MAX), 50))
pl.gca().set_xscale("log")
pl.show()

The following code indicates how you can use bins='auto' with the log scale.
import numpy as np
import matplotlib.pyplot as plt
data = 10**np.random.normal(size=500)
_, bins = np.histogram(np.log10(data + 1), bins='auto')
plt.hist(data, bins=10**bins);
plt.gca().set_xscale("log")

In addition to what was stated, performing this on pandas dataframes works as well:
some_column_hist = dataframe['some_column'].plot(bins=np.logspace(-2, np.log10(max_value), 100), kind='hist', loglog=True, xlim=(0,max_value))
I would caution, that there may be an issue with normalizing the bins. Each bin is larger than the previous one, and therefore must be divided by it's size to normalize the frequencies before plotting, and it seems that neither my solution, nor HYRY's solution accounts for this.
Source: https://arxiv.org/pdf/cond-mat/0412004.pdf

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to re-scale the counts in a matplotlib histogram - python

Related

Radial heatmap from similarity matrix in Python

How can I create a list of the values on the y-axis without having to plot a graph in Python?

display a histogram with very non-uniform bin widths

Plot a histogram such that the total height equals 1

How to have logarithmic bins in a Python histogram

Categories

Resources