I have some data as numpy arrays x, y, v as shown in the code below.
This is actually dummy data for velocity (v) of dust particles in a x-y plane.
I have binned my data into 4 bins and for each bin I have calculated mean of entries in each bin and made a heat map.
Now what I want to do is make a histogram/distribution of v in each bin with 0 as the centre of the histogram.
I do not want to plot the mean anymore, just want to divide my data into the same bins as this code and for each bin I want to generate a histogram of the values in the bins.
How should I do it?
I think this is a way to model the spectrum of an emission line from the gas particles. Any help is appreciated! Thanks!
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x = np.array([-10,-2,4,12,3,6,8,14,3])
y = np.array([5,5,-6,8,-20,10,2,2,8])
v = np.array([4,-6,-10,40,22,-14,20,8,-10])
x_bins = np.linspace(-20, 20, 3)
y_bins = np.linspace(-20, 20, 3)
H, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins], weights = v)
pstat = stats.binned_statistic_2d(x, y, v, statistic='mean', bins = [x_bins, y_bins])
plt.xlabel("x")
plt.ylabel("y")
plt.imshow(pstat.statistic.T, origin='lower', cmap='RdBu',
extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
plt.colorbar().set_label('mean', rotation=270)
EDIT: Please note that my original data is huge. My arrays for x,y, v are very large and I am using 30x30 grid, that is, not just 4quadrants but 900 bins. I might also need to increase the bin number. So, we want to find a way to automatically divide the 'v' data into the regularly spaced bins and then be able to plot the histograms of the 'v' data in each bin.
I would iterate over the zipped x and y, then flag if v is inside the quadrant and append them to a quadrant list. after, you can plot whatever you'd like:
x = np.array([-10,-2,4,12,3,6,8,14,3])
y = np.array([5,5,-6,8,-20,10,2,2,8])
v = np.array([4,-6,-10,40,22,-14,20,8,-10])
q1 = []
q2 = []
q3 = []
q4 = []
for i, (x1,y1) in enumerate(zip(x,y)):
if x1<0 and y1>=0:
q1.append(v[i])
elif x1>=0 and y1>=0:
q2.append(v[i])
elif x1>=0 and y1<0:
q3.append(v[i])
elif x1<0 and y1<0:
q4.append(v[i])
print(q1)
print(q2)
print(q3)
print(q4)
#[4, -6]
#[40, -14, 20, 8, -10]
#[-10, 22]
#[]
plt.hist(q1, density=True)
plt.hist(q2, density=True)
plt.hist(q3, density=True)
#q4 is empty
Related
Good day to everyone. I was wondering if there is any way to extract a mass map and a mass density map for a scatter plot of mass distributions.
Developing the code for the mass distributions:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.ndimage.filters import gaussian_filter
from numpy.random import rand
# Finds nran number of random points in two dimensions
def randomizer(nran):
arr = rand(nran, 2)
return arr
# Calculates a sort of 'density' plot. Using this from a previous StackOverflow Question: https://stackoverflow.com/questions/2369492/generate-a-heatmap-in-matplotlib-using-a-scatter-data-set
def myplot(x, y, s, bins = 1000):
plot, xedges, yedges = np.histogram2d(x, y, bins = bins)
plot = gaussian_filter(plot, sigma = s)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
return plot.T, extent
Trying out an example:
arr = randomizer(1000)
plot, extent = myplot(arr[:, 0], arr[:, 1], 20)
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
ax[0].scatter(arr[:, 0], arr[:, 1])
ax[0].set_aspect('equal')
ax[0].set_xlabel('x')
ax[0].set_ylabel('y')
ax[0].set_title('Scatter Plot')
img = ax[1].imshow(plot)
ax[1].set_title('Density Plot?')
ax[1].set_aspect('equal')
ax[1].set_xlabel('x')
ax[1].set_ylabel('y')
plt.colorbar(img)
This yields a scatter plot and what I think kind of represents a density plot (please correct if wrong). Now, suppose that each dot has a mass of 50 kg. Does the "density plot" represent a map of the total mass distribution (if that makes sense?)since the colorbar has a max value much less than 50. Then, using this, how can I compute a mass density for this mass distribution? I would really appreciate if someone could help. Thank you.
Edit: Added the website from where I got the heatmap function.
Okay, I think I've got the solution. I've been meaning to upload this for quite an amount of time. Here it goes:
# Importing packages
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from numpy.random import random
from scipy.stats import binned_statistic_2d
# Finds nran number of random points in two dimensions
def randomizer(nran):
arr_x = []
arr_y = []
for i in range(nran):
arr_x += [10 * random()] # Since random() only produces floats in (0, 1), I multiply by 10 (for illustrative purposes)
arr_y += [10 *random()] # Since random() only produces floats in (0, 1), I multiply by 10 (for illustrative purposes)
return arr_x, arr_y
# Computing weight array
def weights_array(weight, length):
weights = np.array([weight] * length)
return weights
# Computes a weighted histogram and divides it by the total grid area to get the density
def histogramizer(x_array, y_array, weights, num_pixels, Dimension):
Range = [0, Dimension] # Assumes the weights are distributed in a square area
grid, _, _, _ = binned_statistic_2d(x_array, y_array, weights, 'sum', bins=num_pixels, range=[Range,Range])
area = int(np.max(x_array)) * int(np.max(y_array))
density = grid/area
return density
Then, actually implementing this, one finds:
arr_x, arr_y = randomizer(1000000)
weights = []
for i in range(len(arr_x)):
weights += [50]
density = histogramizer(arr_x, arr_y, weights, [400,400], np.max(arr_x))
fig, ax = plt.subplots(figsize = (15, 5))
plt.imshow(density, extent = [0, int(np.max(arr_x)), 0, int(np.max(arr_x))]);
plt.colorbar(label = '$kg m^-2$');
The result I got for this was the following plot (I know it's generally not recommended to add a photo, but I wanted to add it for sake of showing my code's output):
I have a challenging problem that I have not figured out yet. So, I have a bunch of data from a fluid flow simulation. I have two files: the spatial (x,y,z) data, which looks like this (note, I only care about 2D, so only the x and y values):
(-2 -1.5 0.1)
(5 -1.5 0.1)
(-2 -1.5 0.6)
(5 -1.5 0.6)
(-2 1.92708 0.1)
...
and its corresponding velocity_magnitude values. where each line corresponds to the velocty_x at the location in the spatial data file. For example, the value 0.08 is at (-2 -1.5 0.1).
0.08
0.07
0.1
0.34 ...
...
I want to make this into a heat map. I naively first just focused on the velocity data, reformatted into a 2D array, and showed that heatmap but the locations are all wrong. The problem is the spatial data is not in order, so doing it my way did not work. How do I combine both the x,y location with the actual velocity value to create a heatmap for my data?
If you are interested in rendering mean velocity on the heatmap Matplotlib, Numpy and Scipy are packages of interest. Let's investigate some options you have...
Data Visualisation
Trial Dataset
First we create a trial dataset:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.tri as mtri
# Create trial dataset:
N = 10000
a = np.array([-10, -10, 0])
b = np.array([15, 15, 0])
x0 = 3*np.random.randn(N, 3) + a
x1 = 5*np.random.randn(N, 3) + b
x = np.vstack([x0, x1])
v0 = np.exp(-0.01*np.linalg.norm(x0-a, axis=1)**2)
v1 = np.exp(-0.01*np.linalg.norm(x1-b, axis=1)**2)
v = np.hstack([v0, v1])
# Render dataset:
axe = plt.axes(projection='3d')
axe.plot_trisurf(x[:,0], x[:,1], v, cmap='jet', alpha=0.5)
axe.set_xlabel("x")
axe.set_ylabel("y")
axe.set_zlabel("Speed")
axe.view_init(elev=25, azim=-45)
It looks like:
2D Hexagonal Histogram
The easiest way is probably to use Matplotlib hexbin function:
# Render hexagonal histogram:
pc = plt.hexbin(x[:,0], x[:,1], C=v, gridsize=20)
pc.axes.set_title("Heatmap")
pc.axes.set_xlabel("x")
pc.axes.set_ylabel("y")
pc.axes.set_aspect("equal")
cb = plt.colorbar(ax=pc.axes)
cb.set_label("Speed")
It renders:
2D Rectangular Histogram
You can also use numpy.histogram2D and Matplolib imshow:
# Bin Counts:
c, *_ = np.histogram2d(x[:,0], x[:,1], bins=20)
# Bin Weight Sums:
s, xbin, ybin = np.histogram2d(x[:,0], x[:,1], bins=20, weights=v)
lims = [xbin.min(), xbin.max(), ybin.min(), ybin.max()]
# Render rectangular histogram:
iax = plt.imshow((s/c).T, extent=lims, origin='lower')
iax.axes.set_title("Heatmap")
iax.axes.set_xlabel("x")
iax.axes.set_ylabel("y")
iax.axes.set_aspect("equal")
cb = plt.colorbar(ax=iax.axes)
cb.set_label("Speed")
It outputs:
Linear Interpolation
As pointed out by #rioV8, your dataset seems to be spatially irregular. If you need to map it to a rectangular grid, you can use the mutlidimensional linear interpolator of Scipy.
from scipy import interpolate
# Create interpolator:
ndpol = interpolate.LinearNDInterpolator(x[:,:2], v)
# Create meshgrid:
xl = np.linspace(-20, 30, 20)
X, Y = np.meshgrid(xl, xl)
lims = [xl.min(), xl.max(), xl.min(), xl.max()]
# Interpolate over meshgrid:
V = ndpol(list(zip(X.ravel(),Y.ravel()))).reshape(X.shape)
# Render interpolated speeds:
iax = plt.imshow(V, extent=lims, origin='lower')
iax.axes.set_title("Heatmap")
iax.axes.set_xlabel("x")
iax.axes.set_ylabel("y")
iax.axes.set_aspect("equal")
cb = plt.colorbar(ax=iax.axes)
cb.set_label("Speed")
It renders:
Nota: in this version ticks still need to be centered on each pixel.
Contours
Once you have a rectangular grid you can also draw Matplotlib contours:
# Render contours:
iax = plt.contour(X, Y, V)
iax.axes.set_title("Contours")
iax.axes.set_xlabel("x")
iax.axes.set_ylabel("y")
iax.axes.set_aspect("equal")
iax.axes.grid()
iax.axes.clabel(iax)
Data Manipulation
Based on the file formats you provided, it is easy to import it using pandas:
import io
import pandas as pd
with open("spatial.txt") as fh:
file1 = io.StringIO(fh.read().replace("(", "").replace(")", ""))
x = pd.read_csv(file1, sep=" ", header=None).values
v = pd.read_csv("speed.txt", header=None).squeeze().values
This is partially two questions:
How to center a (diverging) colormap around some given value?
How to do that and at the same time map indexes in data to values in colormap? (further explained below)
Some types of data, e.g. BMI score, have a natural mid-point. In matplotlib, there are several diverging colormaps. I want the center of the colormap, i.e. the "middle" of the spectrum to be on the "ideal" BMI score, independent of what distribution of BMI scores is plotted.
BMI class thresholds are: bmi_threshold = [16, 17, 18.5, 25, 30, 35].
In the code below I make a scatter-plot of 300 random BMI values, with weight on x-axis and height on y-axis, as shown in the image below it.
In the first image, I have used np.digitize(bmi, bmi_threshold) as c-parameter to the ax.scatter()-call, but then each value in colorbar also become in range(7), whereas I want the colorbar ticks to be in BMI scores (approxx. 15-40). (bmi is the array of 300 random bmi scores corresponding to x and y)
BMI thresholds are not evenly spread out, so the distance from digitized class indexes e.g. between 2 and 3, is will not be correctly represented if I merely change the tick labels in the colorbar.
In the second image, which is used with the code as shown below, does not seem to be centered correctly at the "ideal" BMI score of 22. I try to use the technique from "Make a scatter colorbar display only a subset of the vmin/vmax" to adjust the color range in the colorbar, but it doesn't seem to work as (I) expected.
Further, I think I could emphasize the "center" aka "ideal" scores by "squeezing" the colors by setting low and high in cmap(np.linspace(low, high, 7)) to values outside [0, 1], e.g. [-0.5,1.5], but then I have even more trouble to center the colorbar.
What am I doing wrong, and how can I achieve this?
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib as mpl
np.random.seed(4242)
# Define BMI class thresholds
bmi_thresholds = np.array([16, 17, 18.5, 25, 30, 35])
# Range to sample BMIs from
max_bmi = max(bmi_thresholds)*0.9
min_bmi = min(bmi_thresholds)*0.3
# Convert meters into centimeters along x-axis
#mpl.ticker.FuncFormatter
def m_to_cm(m, pos):
return f'{int(m*100)}'
# Number of samples
n = 300
# Heights in range 0.50 to 2.20 meters
x = np.linspace(0.5, 2.2, n)
# Random BMI values in range [min_bmi, max_bmi]
bmi = np.random.rand(n)*(max_bmi-min_bmi) + min_bmi
# Compute corresponding weights
y = bmi * x**2
# Prepare plot with labels, etc.
fig, ax = plt.subplots(figsize=(10,6))
ax.set_title(f'Random BMI values. $n={n}$')
ax.set_ylabel('Weight in kg')
ax.set_xlabel('Height in cm')
ax.xaxis.set_major_formatter(m_to_cm)
ax.set_ylim(min(y)*0.95, max(y)*1.05)
ax.set_xlim(min(x), max(x))
# plot bmi class regions (i.e. the "background")
for i in range(len(bmi_thresholds)+1):
area_min = bmi_thresholds[i-1] if i > 0 else 0
area_max = bmi_thresholds[i] if i < len(bmi_thresholds) else 10000#np.inf
area_color = 'g' if i == 3 else 'y' if i in [2,4] else 'orange' if i in [1,5] else 'r'
ax.fill_between(x, area_min * x**2, area_max * x**2, color=area_color, alpha=0.2, interpolate=True)
# Plot lines to emphasize regions, and additional bmi score lines (i.e. 10 and 40)
common_plot_kwargs = dict(alpha=0.8, linewidth=0.5)
for t in (t for t in np.concatenate((bmi_thresholds, [10, 40]))):
style = 'g-' if t in [18.5, 25] else 'r-' if t in [10,40] else 'k-'
ax.plot(x, t * x**2, style, **common_plot_kwargs)
# Compute offset from target_center to median of data range
target_center = 22
mid_bmi = np.median(bmi)
s = max(bmi) - min(bmi)
d = target_center - mid_bmi
# Use offset to normalize offset as to the range [0, 1]
high = 1 if d < 0 else (s-d)/s
low = 0 if d >= 0 else -d/s
# Use normalized offset to create custom cmap to centered around ideal BMI?
cmap = plt.get_cmap('PuOr')
colors = cmap(np.linspace(low, high, 7))
cmap = mpl.colors.LinearSegmentedColormap.from_list('my cmap', colors)
# plot random BMIs
c = np.digitize(bmi, bmi_thresholds)
sax = ax.scatter(x, y, s=15, marker='.', c=bmi, cmap=cmap)
cbar = fig.colorbar(sax, ticks=np.concatenate((bmi_thresholds, [22, 10, 40])))
plt.tight_layout()
You can use the matplotlib built-in function that does the same thing:
matplotlib.colors.TwoSlopeNorm
See: https://matplotlib.org/3.2.2/gallery/userdemo/colormap_normalizations_diverging.html
I found a decent solution here:
http://chris35wills.github.io/matplotlib_diverging_colorbar/
They created a normalization class using this code:
class MidpointNormalize(colors.Normalize):
def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
self.midpoint = midpoint
colors.Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
# I'm ignoring masked values and all kinds of edge cases to make a
# simple example...
x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
return np.ma.masked_array(np.interp(value, x, y), np.isnan(value))
The class is used by doing something like this:
elev_max=3000; mid_val=0;
plt.imshow(ras, cmap=cmap, clim=(elev_min, elev_max), norm=MidpointNormalize(midpoint=mid_val,vmin=elev_min, vmax=elev_max))
plt.colorbar()
plt.show()
I have a three-dimensional array.
The first dimension has 4 elements.
The second dimension has 10 elements.
The third dimension has 5 elements.
I want to plot the contents of this array as follows.
Each element of the first dimension gets its own graph (four graphs on the page)
The values of the second dimension correspond to the y values of the graphs. (there are 10 lines on each graph)
The values of the third dimension correspond to the x values of the graphs (each of the 10 lines has 5 x values)
I'm pretty new to python, and even newer to graphing.
I figured out how to correctly load my array with the data...and I'm not even trying to get the 'four graphs on one page' aspect working.
For now I just want one graph to work correctly.
Here's what I have so far (once my array is set up, and I've correctly loaded my arrays. Right now the graph shows up, but it's blank, and the x-axis includes negative values. None of my data is negative)
for n in range(1):
for m in range(10):
for o in range(5):
plt.plot(quadnumcounts[n][m][o])
plt.xlabel("Trials")
plt.ylabel("Frequency")
plt.show()
Any help would be really appreciated!
Edit. Further clarification. Let's say my array is loaded as follows:
myarray[0][1][0] = 22
myarray[0][1][1] = 10
myarray[0][1][2] = 15
myarray[0][1][3] = 25
myarray[0][1][4] = 13
I want there to be a line, with the y values 22, 10, 15, 25, 13, and the x values 1, 2, 3, 4, 5 (since it's 0 indexed, I can just +1 before printing the label)
Then, let's say I have
myarray[0][2][0] = 10
myarray[0][2][1] = 17
myarray[0][2][2] = 9
myarray[0][2][3] = 12
myarray[0][2][4] = 3
I want that to be another line, following the same rules as the first.
Here's how to make the 4 plots with 10 lines in each.
import matplotlib.pyplot as plt
for i, fig_data in enumerate(quadnumcounts):
# Set current figure to the i'th subplot in the 2x2 grid
plt.subplot(2, 2, i + 1)
# Set axis labels for current figure
plt.xlabel('Trials')
plt.ylabel('Frequency')
for line_data in fig_data:
# Plot a single line
xs = [i + 1 for i in range(len(line_data))]
ys = line_data
plt.plot(xs, ys)
# Now that we have created all plots, show the result
plt.show()
Here is the example of creating subplots of your data. You have not provided the dataset so I used x to be an angle from 0 to 360 degrees and the y to be the trigonemetric functions of x (sine and cosine).
Code example:
import numpy as np
import pylab as plt
x = np.arange(0, 361) # 0 to 360 degrees
y = []
y.append(1*np.sin(x*np.pi/180.0))
y.append(2*np.sin(x*np.pi/180.0))
y.append(1*np.cos(x*np.pi/180.0))
y.append(2*np.cos(x*np.pi/180.0))
z = [[x, y[0]], [x, y[1]], [x, y[2]], [x, y[3]]] # 3-dimensional array
# plot graphs
for count, (x_data, y_data) in enumerate(z):
plt.subplot(2, 2, count + 1)
plt.plot(x_data, y_data)
plt.xlabel('Angle')
plt.ylabel('Amplitude')
plt.grid(True)
plt.show()
Output:
UPDATE:
Using the sample date you provided in your update, you could proceed as follows:
import numpy as np
import pylab as plt
y1 = (10, 17, 9, 12, 3)
y2 = (22, 10, 15, 25, 13)
y3 = tuple(reversed(y1)) # generated for explanation
y4 = tuple(reversed(y2)) # generated for explanation
mydata = [y1, y2, y3, y4]
# plot graphs
for count, y_data in enumerate(mydata):
x_data = range(1, len(y_data) + 1)
print x_data
print y_data
plt.subplot(2, 2, count + 1)
plt.plot(x_data, y_data, '-*')
plt.xlabel('Trials')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Note that the dimensions are slightly different from yours. Here they are such that mydata[0][0] == 10, mydata[1][3] == 25 etc. The output is show below:
Suppose I create a histogram using scipy/numpy, so I have two arrays: one for the bin counts, and one for the bin edges. If I use the histogram to represent a probability distribution function, how can I efficiently generate random numbers from that distribution?
It's probably what np.random.choice does in #Ophion's answer, but you can construct a normalized cumulative density function, then choose based on a uniform random number:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)
bin_midpoints = bins[:-1] + np.diff(bins)/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]
plt.subplot(121)
plt.hist(data, 50)
plt.subplot(122)
plt.hist(random_from_cdf, 50)
plt.show()
A 2D case can be done as follows:
data = np.column_stack((np.random.normal(scale=10, size=1000),
np.random.normal(scale=20, size=1000)))
x, y = data.T
hist, x_bins, y_bins = np.histogram2d(x, y, bins=(50, 50))
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
cdf = np.cumsum(hist.ravel())
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = random_from_cdf.T
plt.subplot(121, aspect='equal')
plt.hist2d(x, y, bins=(50, 50))
plt.subplot(122, aspect='equal')
plt.hist2d(new_x, new_y, bins=(50, 50))
plt.show()
#Jaime solution is great, but you should consider using the kde (kernel density estimation) of the histogram. A great explanation why it's problematic to do statistics over histogram, and why you should use kde instead can be found here
I edited #Jaime's code to show how to use kde from scipy. It looks almost the same, but captures better the histogram generator.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
def run():
data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)
x_grid = np.linspace(min(data), max(data), 1000)
kdepdf = kde(data, x_grid, bandwidth=0.1)
random_from_kde = generate_rand_from_pdf(kdepdf, x_grid)
bin_midpoints = bins[:-1] + np.diff(bins) / 2
random_from_cdf = generate_rand_from_pdf(hist, bin_midpoints)
plt.subplot(121)
plt.hist(data, 50, normed=True, alpha=0.5, label='hist')
plt.plot(x_grid, kdepdf, color='r', alpha=0.5, lw=3, label='kde')
plt.legend()
plt.subplot(122)
plt.hist(random_from_cdf, 50, alpha=0.5, label='from hist')
plt.hist(random_from_kde, 50, alpha=0.5, label='from kde')
plt.legend()
plt.show()
def kde(x, x_grid, bandwidth=0.2, **kwargs):
"""Kernel Density Estimation with Scipy"""
kde = gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1), **kwargs)
return kde.evaluate(x_grid)
def generate_rand_from_pdf(pdf, x_grid):
cdf = np.cumsum(pdf)
cdf = cdf / cdf[-1]
values = np.random.rand(1000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = x_grid[value_bins]
return random_from_cdf
Perhaps something like this. Uses the count of the histogram as a weight and chooses values of indices based on this weight.
import numpy as np
initial=np.random.rand(1000)
values,indices=np.histogram(initial,bins=20)
values=values.astype(np.float32)
weights=values/np.sum(values)
#Below, 5 is the dimension of the returned array.
new_random=np.random.choice(indices[1:],5,p=weights)
print new_random
#[ 0.55141614 0.30226256 0.25243184 0.90023117 0.55141614]
I had the same problem as the OP and I would like to share my approach to this problem.
Following Jaime answer and Noam Peled answer I've built a solution for a 2D problem using a Kernel Density Estimation (KDE).
Frist, let's generate some random data and then calculate its Probability Density Function (PDF) from the KDE. I will use the example available in SciPy for that.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots()
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
And the plot is:
Now, we obtain random data from the PDF obtained from the KDE, which is the variable Z.
# Generate the bins for each axis
x_bins = np.linspace(xmin, xmax, Z.shape[0]+1)
y_bins = np.linspace(ymin, ymax, Z.shape[1]+1)
# Find the middle point for each bin
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
# Calculate the Cumulative Distribution Function(CDF)from the PDF
cdf = np.cumsum(Z.ravel())
cdf = cdf / cdf[-1] # NormalizaĆ§Ć£o
# Create random data
values = np.random.rand(10000)
# Find the data position
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
# Create the new data
new_data = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = new_data.T
And we can calculate the KDE from this new data and the plot it.
kernel = stats.gaussian_kde(new_data.T)
new_Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots()
ax.imshow(np.rot90(new_Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(new_x, new_y, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
Here is a solution, that returns datapoints that are uniformly distributed within each bin instead of the bin center:
def draw_from_hist(hist, bins, nsamples = 100000):
cumsum = [0] + list(I.np.cumsum(hist))
rand = I.np.random.rand(nsamples)*max(cumsum)
return [I.np.interp(x, cumsum, bins) for x in rand]
A few things do not work well for the solutions suggested by #daniel, #arco-bast, et al
Taking the last example
def draw_from_hist(hist, bins, nsamples = 100000):
cumsum = [0] + list(I.np.cumsum(hist))
rand = I.np.random.rand(nsamples)*max(cumsum)
return [I.np.interp(x, cumsum, bins) for x in rand]
This assumes that at least the first bin has zero content, which may or may not be true. Secondly, this assumes that the value of the PDF is at the upper bound of the bins, which it isn't - it's mostly in the centre of the bin.
Here's another solution done in two parts
def init_cdf(hist,bins):
"""Initialize CDF from histogram
Parameters
----------
hist : array-like, float of size N
Histogram height
bins : array-like, float of size N+1
Histogram bin boundaries
Returns:
--------
cdf : array-like, float of size N+1
"""
from numpy import concatenate, diff,cumsum
# Calculate half bin sizes
steps = diff(bins) / 2 # Half bin size
# Calculate slope between bin centres
slopes = diff(hist) / (steps[:-1]+steps[1:])
# Find height of end points by linear interpolation
# - First part is linear interpolation from second over first
# point to lowest bin edge
# - Second part is linear interpolation left neighbor to
# right neighbor up to but not including last point
# - Third part is linear interpolation from second to last point
# over last point to highest bin edge
# Can probably be done more elegant
ends = concatenate(([hist[0] - steps[0] * slopes[0]],
hist[:-1] + steps[:-1] * slopes,
[hist[-1] + steps[-1] * slopes[-1]]))
# Calculate cumulative sum
sum = cumsum(ends)
# Subtract off lower bound and scale by upper bound
sum -= sum[0]
sum /= sum[-1]
# Return the CDF
return sum
def sample_cdf(cdf,bins,size):
"""Sample a CDF defined at specific points.
Linear interpolation between defined points
Parameters
----------
cdf : array-like, float, size N
CDF evaluated at all points of bins. First and
last point of bins are assumed to define the domain
over which the CDF is normalized.
bins : array-like, float, size N
Points where the CDF is evaluated. First and last points
are assumed to define the end-points of the CDF's domain
size : integer, non-zero
Number of samples to draw
Returns
-------
sample : array-like, float, of size ``size``
Random sample
"""
from numpy import interp
from numpy.random import random
return interp(random(size), cdf, bins)
# Begin example code
import numpy as np
import matplotlib.pyplot as plt
# initial histogram, coarse binning
hist,bins = np.histogram(np.random.normal(size=1000),np.linspace(-2,2,21))
# Calculate CDF, make sample, and new histogram w/finer binning
cdf = init_cdf(hist,bins)
sample = sample_cdf(cdf,bins,1000)
hist2,bins2 = np.histogram(sample,np.linspace(-3,3,61))
# Calculate bin centres and widths
mx = (bins[1:]+bins[:-1])/2
dx = np.diff(bins)
mx2 = (bins2[1:]+bins2[:-1])/2
dx2 = np.diff(bins2)
# Plot, taking care to show uncertainties and so on
plt.errorbar(mx,hist/dx,np.sqrt(hist)/dx,dx/2,'.',label='original')
plt.errorbar(mx2,hist2/dx2,np.sqrt(hist2)/dx2,dx2/2,'.',label='new')
plt.legend()
Sorry, I don't know how to get this to show up in StackOverflow, so copy'n'paste and run to see the point.
I stumbled upon this question when I was looking for a way to generate a random array based on a distribution of another array. If this would be in numpy, I would call it random_like() function.
Then I realized, I have written a package Redistributor which might do this for me even though the package was created with a bit different motivation (Sklearn transformer capable of transforming data from an arbitrary distribution to an arbitrary known distribution for machine learning purposes). Of course I understand unnecessary dependencies are not desired, but at least knowing this package might be useful to you someday. The thing OP asked about is basically done under the hood here.
WARNING: under the hood, everything is done in 1D. The package also implements multidimensional wrapper, but I have not written this example using it as I find it to be too niche.
Installation:
pip install git+https://gitlab.com/paloha/redistributor
Implementation:
import numpy as np
import matplotlib.pyplot as plt
def random_like(source, bins=0, seed=None):
from redistributor import Redistributor
np.random.seed(seed)
noise = np.random.uniform(source.min(), source.max(), size=source.shape)
s = Redistributor(bins=bins, bbox=[source.min(), source.max()]).fit(source.ravel())
s.cdf, s.ppf = s.source_cdf, s.source_ppf
r = Redistributor(target=s, bbox=[noise.min(), noise.max()]).fit(noise.ravel())
return r.transform(noise.ravel()).reshape(noise.shape)
source = np.random.normal(loc=0, scale=1, size=(100,100))
t = random_like(source, bins=80) # More bins more precision (0 = automatic)
# Plotting
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title(f'Distribution of source data, shape: {source.shape}')
plt.hist(source.ravel(), bins=100)
plt.subplot(122); plt.title(f'Distribution of generated data, shape: {t.shape}')
plt.hist(t.ravel(), bins=100); plt.show()
Explanation:
import numpy as np
import matplotlib.pyplot as plt
from redistributor import Redistributor
from sklearn.metrics import mean_squared_error
# We have some source array with "some unknown" distribution (e.g. an image)
# For the sake of example we just generate a random gaussian matrix
source = np.random.normal(loc=0, scale=1, size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Source data'); plt.imshow(source, origin='lower')
plt.subplot(122); plt.title('Source data hist'); plt.hist(source.ravel(), bins=100); plt.show()
# We want to generate a random matrix from the distribution of the source
# So we create a random uniformly distributed array called noise
noise = np.random.uniform(source.min(), source.max(), size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Uniform noise'); plt.imshow(noise, origin='lower')
plt.subplot(122); plt.title('Uniform noise hist'); plt.hist(noise.ravel(), bins=100); plt.show()
# Then we fit (approximate) the source distribution using Redistributor
# This step internally approximates the cdf and ppf functions.
s = Redistributor(bins=200, bbox=[source.min(), source.max()]).fit(source.ravel())
# A little naming workaround to make obj s work as a target distribution
s.cdf = s.source_cdf
s.ppf = s.source_ppf
# Here we create another Redistributor but now we use the fitted Redistributor s as a target
r = Redistributor(target=s, bbox=[noise.min(), noise.max()])
# Here we fit the Redistributor r to the noise array's distribution
r.fit(noise.ravel())
# And finally, we transform the noise into the source's distribution
t = r.transform(noise.ravel()).reshape(noise.shape)
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Transformed noise'); plt.imshow(t, origin='lower')
plt.subplot(122); plt.title('Transformed noise hist'); plt.hist(t.ravel(), bins=100); plt.show()
# Computing the difference between the two arrays
print('Mean Squared Error between source and transformed: ', mean_squared_error(source, t))
Mean Squared Error between source and transformed: 2.0574123162302143