How to have logarithmic bins in a Python histogram - python

As far as I know the option Log=True in the histogram function only refers to the y-axis.
P.hist(d,bins=50,log=True,alpha=0.5,color='b',histtype='step')
I need the bins to be equally spaced in log10. Is there something that can do this?

use logspace() to create a geometric sequence, and pass it to bins parameter. And set the scale of xaxis to log scale.
import pylab as pl
import numpy as np
data = np.random.normal(size=10000)
pl.hist(data, bins=np.logspace(np.log10(0.1),np.log10(1.0), 50))
pl.gca().set_xscale("log")
pl.show()

The most direct way is to just compute the log10 of the limits, compute linearly spaced bins, and then convert back by raising to the power of 10, as below:
import pylab as pl
import numpy as np
data = np.random.normal(size=10000)
MIN, MAX = .01, 10.0
pl.figure()
pl.hist(data, bins = 10 ** np.linspace(np.log10(MIN), np.log10(MAX), 50))
pl.gca().set_xscale("log")
pl.show()

The following code indicates how you can use bins='auto' with the log scale.
import numpy as np
import matplotlib.pyplot as plt
data = 10**np.random.normal(size=500)
_, bins = np.histogram(np.log10(data + 1), bins='auto')
plt.hist(data, bins=10**bins);
plt.gca().set_xscale("log")

In addition to what was stated, performing this on pandas dataframes works as well:
some_column_hist = dataframe['some_column'].plot(bins=np.logspace(-2, np.log10(max_value), 100), kind='hist', loglog=True, xlim=(0,max_value))
I would caution, that there may be an issue with normalizing the bins. Each bin is larger than the previous one, and therefore must be divided by it's size to normalize the frequencies before plotting, and it seems that neither my solution, nor HYRY's solution accounts for this.
Source: https://arxiv.org/pdf/cond-mat/0412004.pdf

Related

Radial heatmap from similarity matrix in Python

Summary
I have a 2880x2880 similarity matrix (8.5 mil points). My attempt with Holoviews resulted in a 500 MB HTML file which never finishes "opening". So how do I make a round heatmap of the matrix?
Details
I had data from 10 different places, measured over 1 whole year. The hours of each month were turned into arrays, so each month had 24 arrays (one for all 00:00, one for all 01:00 ... 22:00, 23:00).
These were about 28-31 cells long, and each cell had the measurement of the thing I'm trying to analyze. So there are these 24 arrays for each month of 1 whole year, i.e. 24x12 = 288 arrays per place. And there are measurements from 10 places. So a total of 2880 arrays were created and all compared to each other, and saved in a 2880x2880 matrix with similarity coefficients.
I'm trying to turn it into a radial similarity matrix like the one from holoviews, but without the ticks and tags (since the format Place01Jan0800 would be cumbersome to look at for 2880 rows), just the shape and colors and divisions:
I managed to create the HTML file itself, but it ended up being 500 MB big, so it never shows up when I open it up. It's just blank. I've added a minimal example below of what I have, and replaced the loading of the datafile with some randomly generated data.
import sys
sys.setrecursionlimit(10000)
import random
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts
from bokeh.plotting import show
import gc
# Function creating dummy data for this example
def transformer():
dimension = 2880
dummy_matrix = ([[ random.random() for i in range(dimension) ] for j in range(dimension)]) #Fake, similar data
col_vals = [str(i) for i in range(dimension*dimension)] # Placeholder
row_vals = [str(i) for i in range(dimension*dimension)] # Placeholder
val_vals = (np.reshape(np.array(dummy_matrix), -1)).tolist() # Turn matrix into an array
idx_vals = [i for i in range(dimension*dimension)] # Placeholder
return idx_vals, val_vals, row_vals, col_vals
idx_arr, val_arr, row_arr, col_arr = transformer()
df = pd.DataFrame({"values": val_arr, "x-label": row_arr, "y-label": col_arr}, index=idx_arr)
hv.extension('bokeh')
heatmap = hv.HeatMap(df, ["x-label", "y-label"])
heatmap.opts(opts.HeatMap(cmap="viridis", radial=True))
gc.collect() # Attempt to save memory, because this thing is huge
show(hv.render(heatmap))
I had a look at datashader to see if it would help, but I have no idea how to plug it in (if it's possible for this case) to this radial heatmap, since it seems like the radial heatmap doesn't have that datashade-feature.
So I have no idea how to tackle this. I would be content with a broad overview too, I don't need the details nor the hover-infobox nor ability to zoom or any fancy extra features, I just need the general overview for a presentation. I'm open to any solution really.
I recommend you to use heatmp instead of radial heatamp for showing the similarity matrix. The reasons are:
The radial heatmap is designed for periodic variable. The time varible(288 hours) can be considered to be periodic data, however, I think the 288*10(288 hours, 10 places) is no longer periodic because of the existence of the "place".
Near the center of the radial heatmap, the color points will be too dense to be understood by the human.
The following is a simple code to show a heatmap.
import matplotlib.cm
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import numpy as np
n = 2880
m = 2880
dummy_matrix = np.random.rand(m, n)
fig = plt.figure(figsize=(50,50)) # change the figsize to control the resolution
ax = fig.add_subplot(111)
cmap = matplotlib.cm.get_cmap("Blues") # you may use other build-in colormap or define you own colormap
# if your data is not in range[0,1], use a normalization. Here is normalized by min and max values.
norm = Normalize(vmin=np.amin(dummy_matrix), vmax=np.amax(dummy_matrix))
image = ax.imshow(dummy_matrix, cmap=cmap, norm=norm)
plt.colorbar(image)
plt.show()
Which gives:
Another idea that comes to me is that, perhaps the computation of similarity matrix is unnecessary, and you can plot the orginial 288 * 10 data using radial heat map or just a normal heatmap, and one can get to know the data similarity from the color distribution directly.
Plain Matplotlib seems to be able to handle it, based on answers from here: How do I create radial heatmap in matplotlib?
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
fig = plt.figure()
ax = Axes3D(fig)
n = 2880
m = 2880
rad = np.linspace(0, 10, m)
a = np.linspace(0, 2 * np.pi, n)
r, th = np.meshgrid(rad, a)
dummy_matrix = ([[ random.random() for i in range(n) ] for j in range(m)])
plt.subplot(projection="polar")
plt.pcolormesh(th, r, dummy_matrix, cmap = 'Blues')
plt.plot(a, r, ls='none', color = 'k')
plt.grid()
plt.colorbar()
plt.savefig("custom_radial_heatmap.png")
plt.show()
And it didn't even take an eternity, took only about 20 seconds max.
You would think it would turn out monstrous like that
But the sheer amount of points drowns out the jaggedness, WOOHOO!
There's some things left to be desired, like tags and ticks, but I think I'll figure that out.

How to re-scale the counts in a matplotlib histogram

I have a matplotlib histogram that works fine.
hist_bin_width = 4
on_hist = plt.hist(my_data,bins=range(-100, 200,hist_bin_width),alpha=.3,color='#6e9bd1',label='on')
All I want to do is to rescale by a factor of, say, 2. I don't want to change the bin width, or to change the y axis labels. I want to take the counts in all the bins (e.g. bin 1 has 17 counts) and multiply by 2 so that bin 1 now has 34 counts in it.
Is this possible?
Thank you.
As it's just a simple rescaling of the y-axis, this must be possible. The complication arises because Matplotlib's hist computes and draws the histogram, making it difficult to intervene. However, as the documentation also notes, you can use the weights parameter to "draw a histogram of data that has already been binned". You can bin the data in a first step with Numpy's histogram function. Applying the scaling factor is then straightforward:
from matplotlib import pyplot
import numpy
numpy.random.seed(0)
data = numpy.random.normal(50, 20, 10000)
(counts, bins) = numpy.histogram(data, bins=range(101))
factor = 2
pyplot.hist(bins[:-1], bins, weights=factor*counts)
pyplot.show()
pyplot.hist's weights argument can be used to weight each data point with a factor like
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
data = np.random.normal(50, 20, 10000)
factor = 2
hist_bin_width = 40
plt.hist(data, bins=range(-100, 200, hist_bin_width),
weights=factor*np.ones_like(data))
plt.show()

Python matplotlib: how to let matrixplot have variable column widths

I have a simple need but cannot find its simple solution. I have a matrix to plot, but I wish the row/columns to have given widths.
Something looking like a blocked matrix where you tell block sizes.
Any workaround with the same visual result is accepted.
import matplotlib.pyplot as plt
import numpy as np
samplemat = np.random.rand(3,3)
widths = np.array([.7, .2, .1])
# Display matrix
plt.matshow(samplemat)
plt.show()
matshow or imshow work with equal sized cells. They hence cannot be used here. Instead you may use pcolor or pcolormesh. This would require to supply the coordinates of the cell edges.
Hence you first need to calculate those from the given width. Assuming you want them to start at 0, you may just sum them up.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(43)
samplemat = np.random.rand(3,3)
widths = np.array([.7, .2, .1])
coords = np.cumsum(np.append([0], widths))
X,Y = np.meshgrid(coords,coords)
# Display matrix
plt.pcolormesh(X,Y,samplemat)
plt.show()

How to use xlim to shape x-axis in python

I've read about some Questions concerning this, but I couldn't figure out...
I want to make my CDF graph just show me the x range of 0~120
The problem is xlim raise error of invalid syntax..
Any suggestion or hing would be very appreciated because this is the first thing I do in python.
Here is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = np.load('C:\\Users\\python\\Desktop\\abnormal.npy')
# Choose how many bins you want here
num_bins = 20
# Use the histogram function to bin the data
counts, bin_edges = np.histogram(data, bins=num_bins, normed=True)
# Now find the cdf
cdf = np.cumsum(counts)
# And finally plot the cdf
plt.plot(bin_edges[1:], cdf)
xlim([0 120])
plt.show()

set constant width to every bar in a bar plot

I am trying to plot a bar plot where each bin has a difference length and as a result I end up with a very ugly result.c:) What I would like to do is still be able to define a bin of deference lengths but all the bars be plotted the same fixed width. How can I do that? Here is what I have done so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
np.random.seed(9221999)
data = [0,2,30,40,50,10,50,40,150,70,150,10,3,70,70,90,10,2]
bins = [0,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200]
plt.hist(data, bins=bins);
EDIT
This question has been marked as duplicate but in fact non of the proposed links solved my problem; the 1st is a very crappy workaround and the 2nd doesn't solve the problem at all as it sets all bars' width to a certain number.
Here you go, with seaborn, as you please. But you have to understand that seaborn itself uses matplotlib to create plots.
AND: Please delete your other question, now it really is a duplicate.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
data = [0,2,30,40,50,10,50,40,150,70,150,10,3,70,70,90,10,2]
bins = [0,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200]
bin_middles = bins[:-1] + np.diff(bins)/2.
bar_width = 1.
m, bins = np.histogram(data, bins)
plt.bar(np.arange(len(m)) + (1-bar_width)/2., m, width=bar_width)
ax = plt.gca()
ax.set_xticks(np.arange(len(bins)))
ax.set_xticklabels(['{:.0f}'.format(i) for i in bins])
plt.show()
Personally I think, that plotting your data like this is confusing. Having non-linear (or non-log) axis scaling is usually not a good idea.
Are you wanting to place a bar with a fixed width at the center of each bin?
If so, try something something similar to this:
import numpy as np
import matplotlib.pyplot as plt
data = [0,2,30,40,50,10,50,40,150,70,150,10,3,70,70,90,10,2]
bins = [0,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200]
counts, _ = np.histogram(data, bins)
centers = np.mean([bins[:-1], bins[1:]], axis=0)
plt.bar(centers, counts, width=5, align='center')
plt.show()

Categories