Normalize a multiple data histogram

Normalize a multiple data histogram - python

I have several arrays that I'm plotting a histogram of, like so:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
plt.hist((x,y),normed=True)
Of course, however, this normalizes both of the arrays individually, so that they both have the same peak. I'm looking to normalize them to the total number of elements, so that the histogram of y will be visibly taller than that of x. Is there a handy way to do this in matplotlib or will I have to mess around in numpy? I haven't found anything about it.
Another way to put it is that if I were instead to make a cumulative plot of the two arrays, they shouldn't both top out at 1, but should add to 1.

Yes, you can compute the histogram with numpy and renormalise it.
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
xhist, xbins = np.histogram(x, normed=True)
yhist, ybins = np.histogram(x, normed=True)
And now, you apply your regularisation. For example, if you want x to be normalised to 1 and y proportional:
yhist *= len(y) / len(x)
Now, to plot the histogram:
def plot_histogram(data, edge_bins, **kwargs):
bins = edge_bins[:-1] + edge_bins[1:]
plt.step(bins, data, **kwargs)
plot_histogram(xhist, xbins, c='b')
plot_histogram(yhist, ybins, c='g')

Related

colormap from a matrix in python

I have a 2D output matrix (say, Z) which was calculated as a function of two variables x,y.
x varies in a non-uniform manner like [1e-5,5e-5,1e-4,5e-4,1e-3,5e-3,1e-2]
y varies in a uniform manner like [300,400,500,600,700,800]
[ say, Z = np.random.rand(7,6) ]
I was trying to plot a colormap of the matrix Z by first creating a meshgrid for x,y and then using the pcolormesh. Since, my x values are non-uniform, I do not kn ow how to proceed. Any inputs would be greatly appreciated.

No need for meshgrids; regarding the non-uniform axes: In your case a log-scale works fine:
import numpy as np
from matplotlib import pyplot as plt
x = [1e-5,5e-5,1e-4,5e-4,1e-3,5e-3,1e-2]
y = [300,400,500,600,700,800]
# either enlarge x and y by one number (right-most
# endpoint for those bins), or make Z smaller as I did
Z = np.random.rand(6,5)
fig = plt.figure()
ax = fig.gca()
ax.pcolormesh(x,y,Z.T)
ax.set_xscale("log")
fig.show()

Plot and function with three variables in python

An equation which is represent as below
sin(x)*sin(y)*sin(z)+cos(x)*sin(y)*cos(z)=0
I know the code to plot function for z=f(x,y) using matplotlib but to plot above function I don’t know the code, but I tried MATLAB MuPad code which is as follows
Plot(sin(x)*sin(y)*sin(z)+cos(x)*sin(y)*cos(z),#3d)

This will be much easier if you can isolate z. Your equation is the same as sin(z)/cos(z) = -cos(x)*sin(y)/(sin(x)*sin(y)) so z = atan(-cos(x)*sin(y)/(sin(x)*sin(y))).

Please don't mistake me, but I think your given equation to plot can be reduced to a simple 2D plot.
sin(x)*sin(y)*sin(z)+cos(x)*sin(y)*cos(z) = 0
sin(y)[sin(x)*sin(z)+cos(x)*cos(z)] = 0
sin(y)*cos(x-z) = 0
Hence sin(y) = 0 or cos(x-z)=0
Hence y = n*pi (1) or x-z=(2*n + 1)pi/2
Implies, x = z + (2*n + 1)pi/2 (2)
For (1), it will be a straight line (the plot of y vs n) and in second case, you will get parallel lines which cuts x-axis at (2*n + 1)pi/2 and distance between two parallel lines would be pi. (Assuming you keep n constant).
Assuming, your y can't be zero, you could simplify the plot to a 2D plot with just x and z.
And answering your original question, you need to use mplot3d to plot 3D plots. But as with any graphing tool, you need values or points of x, y, z. (You can compute the possible points by programming). Then you feed those points to the plot, like below.
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection="3d")
xs = [] # X values
ys = [] # Y values
zs = [] # Z values
ax.plot3D(xs, ys, zs)
plt.show()

How to bin a 2D data along the x-axis with Python

I have two arrays of corresponding data (x and y) that I plot as above on a log-log plot. The data is currently too granular and I would like to bin them to get a smoother relationship. Could I get some guidance on how I can bin along the x-axis, in exponential bin sizes, so that it appears linear on the log-log scale?
For example, if the first bin is of range x = 10^0 to 10^1, I want to collect all y-values with corresponding x in that range and average them into one value for that bin. I don't think np.hist or plt.hist quite does the trick, since they do binning by counting occurrences.
Edit: For context, if it helps, the above plot is an assortativity plot that plots the in vs out degree of a certain network.

You may use scipy.stats.binned_statistic to get the mean of the data in each bin. The bins would best be created via numpy.logspace. You may then plot those means e.g. as horiziontal lines spanning the bin width or as scatter at the mean position.
import numpy as np; np.random.seed(42)
from scipy.stats import binned_statistic
import matplotlib.pyplot as plt
x = np.logspace(0,5,300)
y = np.logspace(0,5,300)+np.random.rand(300)*1.e3
fig, ax = plt.subplots()
ax.scatter(x,y, s=9)
s, edges, _ = binned_statistic(x,y, statistic='mean', bins=np.logspace(0,5,6))
ys = np.repeat(s,2)
xs = np.repeat(edges,2)[1:-1]
ax.hlines(s,edges[:-1],edges[1:], color="crimson", )
for e in edges:
ax.axvline(e, color="grey", linestyle="--")
ax.scatter(edges[:-1]+np.diff(edges)/2, s, c="limegreen", zorder=3)
ax.set_xscale("log")
ax.set_yscale("log")
plt.show()

You can achieve this with pandas. The idea is to assign each X value to an interval using np.digitize. Since you are using a log scale, it makes sense to use np.logspace to choose intervals of exponentially changing lengths. Finally, you can group X values in each interval and compute mean Y values.
import pandas as pd
import numpy as np
x_max = 10
xs = np.exp(x_max * np.random.rand(1000))
ys = np.exp(np.random.rand(1000))
df = pd.DataFrame({
'X': xs,
'Y': ys,
})
df['Xbins'] = np.digitize(df.X, np.logspace(0, x_max, 30, base=np.exp(1)))
df['Ymean'] = df.groupby('Xbins').Y.transform('mean')
df.plot(kind='scatter', x='X', y='Ymean')

Using numpy arrays to count the number of points within the cells of a regular grid

I am working with a large number of 3D points, each with x,y,z values stored in numpy arrays.
For background, the points will always fall within a cylinder of fixed radius, and height = max z value of the points.
My objective is to split the bounding cylinder (or column if it is easier) into e.g. 1 m height strata, and then count the number of points within each cell
of a regular grid (e.g. 1 m x 1 m) overlaid on each strata.
Conceptually, the operation would be the same as overlaying a raster and counting the points intersecting each pixel.
The grid of cells can form a square or a disk, it doesn't matter.
After a lot of searching and reading, my current thinking is to use some combination of numpy.linspace and numpy.meshgrid to generate the vertices of each cell stored within an array and test each cell against each point to see if it is 'in'. This seems inefficient, especially when working with thousands of points.
The numpy / scipy suite seems well suited to the problem, but I have not found a solution yet. Any suggestions would be much appreciated.
I have included a few example points and some code to visualize the data.
# Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load in X,Y,Z values from a sub-sample of 10 points for testing
# XY Values are scaled to a reasonable point of origin
z_vals = np.array([3.08,4.46,0.27,2.40,0.48,0.21,0.31,3.28,4.09,1.75])
x_vals = np.array([22.88,20.00,20.36,24.11,40.48,29.08,36.02,29.14,32.20,18.96])
y_vals = np.array([31.31,25.04,31.86,41.81,38.23,31.57,42.65,18.09,35.78,31.78])
# This plot is instructive to visualize the problem
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x_vals, y_vals, z_vals, c='b', marker='o')
plt.show()

I am not sure I understand perfectly what you are looking for, but since every "cell" seems to have a 1m side for all directions, couldn't you:
round all your values to integers (rasterize your data) probably with some floor function;
create a bijection from these integer coordinates to something more convenient with something like:
(64**2)*x + (64)*y + z # assuming all values are in [0,63]
You can put z rather at the beginning if you want to more easely focus on height later
compute the histogram of each "cell" (several functions from numpy/scipy or numpy can do it);
revert the bijection if needed (ie. know the "true" coordinates of each cell once the count is known)
Maybe I didn't understand well, but in case it helps...

Thanks #Baruchel. It turns out the n-dimensional histograms suggested by #DilithiumMatrix provides a fairly simple solution to the problem I posted. After some reading, here is my current solution for anyone else that faces a similar problem.
As this is my first Python/Numpy effort any improvements/suggestions, especially regarding performance, would be welcome. Thanks.
# Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load in X,Y,Z values from a sub-sample of 10 points for testing
# XY Values are scaled to a reasonable point of origin
z_vals = np.array([3.08,4.46,0.27,2.40,0.48,0.21,0.31,3.28,4.09,1.75])
x_vals = np.array([22.88,20.00,20.36,24.11,40.48,29.08,36.02,29.14,32.20,18.96])
y_vals = np.array([31.31,25.04,31.86,41.81,38.23,31.57,42.65,18.09,35.78,31.78])
# Updated code below
# Variables needed for 2D,3D histograms
xmax, ymax, zmax = int(x_vals.max())+1, int(y_vals.max())+1, int(z_vals.max())+1
xmin, ymin, zmin = int(x_vals.min()), int(y_vals.min()), int(z_vals.min())
xrange, yrange, zrange = xmax-xmin, ymax-ymin, zmax-zmin
xedges = np.linspace(xmin, xmax, (xrange + 1), dtype=int)
yedges = np.linspace(ymin, ymax, (yrange + 1), dtype=int)
zedges = np.linspace(zmin, zmax, (zrange + 1), dtype=int)
# Make the 2D histogram
h2d, xedges, yedges = np.histogram2d(x_vals, y_vals, bins=(xedges, yedges))
assert np.count_nonzero(h2d) == len(x_vals), "Unclassified points in the array"
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.imshow(h2d.transpose(), extent=extent, interpolation='none', origin='low')
# Transpose and origin must be used to make the array line up when using imshow, unsure why
# Plot settings, not sure yet on matplotlib update/override objects
plt.grid(b=True, which='both')
plt.xticks(xedges)
plt.yticks(yedges)
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.plot(x_vals, y_vals, 'ro')
plt.show()
# 3-dimensional histogram with 1 x 1 x 1 m bins. Produces point counts in each 1m3 cell.
xyzstack = np.stack([x_vals,y_vals,z_vals], axis=1)
h3d, Hedges = np.histogramdd(xyzstack, bins=(xedges, yedges, zedges))
assert np.count_nonzero(h3d) == len(x_vals), "Unclassified points in the array"
h3d.shape # Shape of the array should be same as the edge dimensions
testzbin = np.sum(np.logical_and(z_vals >= 1, z_vals < 2)) # Slice to test with
np.sum(h3d[:,:,1]) == testzbin # Test num points in second bins
np.sum(h3d, axis=2) # Sum of all vertical points above each x,y 'pixel'
# only in this example the h2d and np.sum(h3d,axis=2) arrays will match as no z bins have >1 points
# Remaining issue - how to get a r x c count of empty z bins.
# i.e. for each 'pixel' how many z bins contained no points?
# Possible solution is to reshape to use logical operators
count2d = h3d.reshape(xrange * yrange, zrange) # Maintain dimensions per num 3D cells defined
zerobins = (count2d == 0).sum(1)
zerobins.shape
# Get back to x,y grid with counts - ready for output as image with counts=pixel digital number
bincount_pixels = zerobins.reshape(xrange,yrange)
# Appears to work, perhaps there is a way without reshapeing?
PS if you are facing a similar problem scikit patch extraction looks like another possible solution.

Griding with python

I am trying to plot a picture like this in python.
I have three parameters for ploting.
x:
[ 0.03570416 0.05201517 0.05418171 0.01868341 0.07116423 0.07547471]
y:
[-0.32079484 -0.53330218 -1.02866859 -0.94808545 -0.51682506 -0.26788337]
z:
[-0.32079484 -0.53330218 -1.02866859 -0.94808545 -0.51682506 -0.26788337]
so x is x-axis and y is y-axis. however z is the intensity of the pixel.
I come up with this code:
z = np.array(reals)
x = np.array(ra)
y = np.array(dec)
nrows, ncols = 10, 10
grid = z.reshape((nrows, ncols))
plt.imshow(grid, extent=(x.min(), x.max(), y.max(), y.min()), interpolation='nearest', cmap=cm.gist_rainbow)
plt.title('This is a phase function')
plt.xlabel('ra')
plt.ylabel('dec')
plt.show()
However I get this error:
grid = z.reshape((nrows, ncols))
ValueError: total size of new array must be unchanged
ra, dec and reals are normal arrays with the same size. I calculated them before and then I create the numpy arrays with them

The data you show is not consistent with making an image, but you could make a scatter plot with it.
The two basic types of plots for z values at (x,y) coordinate pairs are:
scatter plots, where for each (x,y) pair, a z-value is specified.
image (imshow, pcolor, pcolormesh, contour), where an x-axis with m regularly spaced values, and a y-axis with n regularly spaced values are specified, and then an array of z-values with size (m,n) is given.
Your data looks more like the former type, so I'm suggesting a scatter plot.
Here's what a scatter plot looks like (btw, your y and z values are the same, which if probably a mistake):
import numpy as np
import matplotlib.pyplot as plt
x = np.array([ 0.03570416, 0.05201517, 0.05418171, 0.01868341, 0.07116423, 0.07547471])
y = np.array([-0.32079484, -0.53330218, -1.02866859, -0.94808545, -0.51682506, -0.26788337])
z = np.array([-0.32079484, -0.53330218, -1.02866859, -0.94808545, -0.51682506, -0.26788337])
plt.scatter(x, y, c=z, s =250)
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalize a multiple data histogram - python

Related

colormap from a matrix in python

Plot and function with three variables in python

How to bin a 2D data along the x-axis with Python

Using numpy arrays to count the number of points within the cells of a regular grid

Griding with python

Categories

Resources