Finding corresponding bins between two data sets

Finding corresponding bins between two data sets - python

So I have two data sets which overlap in their parameter space:
I want to bin up the red set and find the standard deviation of each bin. Then for each point in the blue set, I want to find which red bin that point corresponds to and grab the standard deviation calculated for that bin.
So far, I've been using scipy.statistics.binned_2d, but I'm not sure where to go from here:
import scipy.stats
import numpy as np
# given numpy recarrays red_set and blue_set with columns x,y,values
nbins = 50
red_bins = scipy.stats.binned_statistic_2d(red_set['x'],
red_set['y'],
red_set['values'],
statistic = np.std,
bins = nbins)
blue_bins = scipy.stats.binned_statistic_2d(blue_set['x']
blue_set['y']
blue_set['values']
statistic = count,
bins = red_bins[1],red_bins[2])
Now, I don't know how to get the value of the corresponding red bin for each blue point. I know that scipy.statistics.binned_2d's third return is a binnumber for each input data point, but I don't know how to translate that to the actual calculated statistic (standard deviation in this example).
I know that the blue set is getting binned exactly the same as the red (a quick plot will confirm this). It seems like it should be totally straightforward to grab the corresponding red bin, but I can't figure it out.
Let me know if I can make my question clearer

You need to make sure you specify the same range when binning the data. In that way, the corresponding indices of the bins will be consistent. I've used the lower level numpy function hist2d, extension to standard deviations can be done in the same way using scipy.stats.binned_statistic_2d,
import numpy as np
import matplotlib.pyplot as plt
#Setup random data
red = np.random.randn(100,2)
blue = np.random.randn(100,2)
#plot
plt.plot(red[:,0],red[:,1],'r.')
plt.plot(blue[:,0],blue[:,1],'b.')
#Specify limits of binned data
xmin = -3.; xmax = 3.
ymin = -3.; ymax = 3.
#Bin data using hist2d
rbins, xrb, yrb = np.histogram2d(red[:,0],red[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
bbins, xbb, ybb = np.histogram2d(blue[:,0],blue[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
#Check that bins correspond to the same positions in space
assert all(xrb == xbb)
assert all(yrb == ybb)
#Obtain centers of the bins and plots difference
xc = xrb[:-1] + 0.5 * (xrb[1:] - xrb[:-1])
yc = yrb[:-1] + 0.5 * (yrb[1:] - yrb[:-1])
plt.contourf(xc, yc, rbins-bbins, alpha=0.4)
plt.colorbar()
plt.show()

Related

Fast Fourier Plot in Python

I have a vibration data in time domain and want to convert it to frequency domain with fft. However the plot of the FFT only shows a big spike at zero and nothing else.
This is my vibration data: https://pastebin.com/7RK57kJW
My code:
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(3000)
a1_fft= np.fft.fft(a1, axis=0)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, a1_fft)
My FFT Plot:
What am I doing wrong here? I am pretty sure my data is uniform, which provoces in other cases a similar problem with fft.

The bins of the FFT correspond to the frequencies at 0, df, 2df, 3df, ..., F-2df, F-df, where df is determined by the number of bins and F is 1 cycle per bin.
Notice the zero frequency at the beginning. This is called the DC offset. It's the mean of your data. In the data that you show, the mean is ~1.32, while the amplitude of the sine wave is around 0.04. It's not surprising that you can't see a peak that's 33x smaller than the DC term.
There are some common ways to visualize the data that help you get around this. One common methods is to keep the DC offset but use a log scale, at least for the y-axis:
plt.semilogy(freq, a1_fft)
OR
plt.loglog(freq, a1_fft)
Another thing you can do is zoom in on the bottom 1/33rd or so of the plot. You can do this manually, or by adjusting the span of the displayed Y-axis:
p = np.abs(a1_fft[1:]).max() * [-1.1, 1.1]
plt.ylim(p)
If you are plotting the absolute values already, use
p = np.abs(a1_fft[1:]).max() * [-0.1, 1.1]
Another method is to remove the DC offset. A more elegant way of doing this than what #J. Schmidt suggests is to simply not display the DC term:
plt.plot(freq[1:], a1_fft[1:])
Or for the positive frequencies only:
n = freq.size
plt.plot(freq[1:n//2], a1_fft[1:n//2])
The cutoff at n // 2 is only approximate. The correct cutoff depends on whether the FFT has an even or odd number of elements. For even numbers, the middle bin actual has energy from both sides of the spectrum and often gets special treatment.

The peak at 0 is the DC-gain, which is very high since you didn't normalize your data. Also, the Fourier transform is a complex number, you should plot the absolute value and phase separately. In this code I also plotted only the positive frequencies:
import numpy as np
import matplotlib.pyplot as plt
#Import data
a1 = np.loadtxt('a1.txt')
plt.plot(a1)
#Normalize a1
a1 -= np.mean(a1)
#Your code
t = np.arange(3000)
a1_fft= np.fft.fft(a1, axis=0)
freq = np.fft.fftfreq(t.shape[-1])
#Only plot positive frequencies
plt.figure()
plt.plot(freq[freq>=0], np.abs(a1_fft)[freq>=0])

Distance between two group of values in a numpy array

I have a very basic question which in theory is easy to do (with fewer points and a lot of manual labour in ArcGIS), but I am not able to start at all with the coding to solve this problem (also I am new to complicated python coding).
I have 2 variables 'Root zone' aka RTZ and 'Tree cover' aka TC both are an array of 250x186 values (which are basically grids with each grid having a specific value). The values in TC varies from 0 to 100. Each grid size is 0.25 degrees (might be helpful in understanding the distance).
My problem is "I want to calculate the distance of each TC value ranging between 50-100 (so each value of TC value greater than 50 at each lat and lon) from the points where nearest TC ranges between 0-30 (less than 30)."
Just take into consideration that we are not looking at the np.nan part of the TC. So the white part in TC is also white in RZS.
What I want to do is create a 2-dimensional scatter plot with X-axis denoting the 'distance of 50-100 TC from 0-30 values', Y-axis denoting 'RZS of those 50-100 TC points'. The above figure might make things more clear.
I hope I could have provided any code for this, but I am not to even able to start on the distance thing.
Please provide any suggestion on how should I proceed with this.
Let's consider an example:
If you look at the x: 70 and y:70, one can see a lot of points with values from 0-30 of the tree cover all across the dataset. But I only want the distance from the nearest value to my point which falls between 0-30.

The following code might work, with random example data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Create some completely random data, and include an area of NaNs as well
rzs = np.random.uniform(0, 100, size=(250, 168))
tc = np.random.lognormal(3.0, size=(250, 168))
tc = np.clip(tc, 0, 100)
rzs[60:80,:] = np.nan
tc[60:80,:] = np.nan
plt.subplot(2,2,1)
plt.imshow(rzs)
plt.colorbar()
plt.subplot(2,2,2)
plt.imshow(tc)
plt.colorbar()
Now do the real work:
# Select the indices of the low- and high-valued points
# This will results in warnings here because of NaNs;
# the NaNs should be filtered out in the indices, since they will
# compare to False in all the comparisons, and thus not be
# indexed by 'low' and 'high'
low = (tc >= 0) & (tc <= 30)
high = (tc >= 50) & (tc <= 100)
# Get the coordinates for the low- and high-valued points,
# combine and transpose them to be in the correct format
y, x = np.where(low)
low_coords = np.array([x, y]).T
y, x = np.where(high)
high_coords = np.array([x, y]).T
# We now calculate the distances between *all* low-valued points, and *all* high-valued points.
# This calculation scales as O^2, as does the memory cost (of the output),
# so be wary when using it with large input sizes.
from scipy.spatial.distance import cdist, pdist
distances = cdist(low_coords, high_coords)
# Now find the minimum distance along the axis of the high-valued coords,
# which here is the second axis.
# Since we also want to find values corresponding to those minimum distances,
# we should use the `argmin` function instead of a normal `min` function.
indices = distances.argmin(axis=1)
mindistances = distances[np.arange(distances.shape[0]), indices]
minrzs = rzs.flatten()[indices]
plt.scatter(mindistances, minrzs)
The resulting plot looks a bit weird, since there are rather discrete distances because of the grid (1, sqrt(1^1+1^1), 2, sqrt(1^1+2^2), sqrt(2^2+2^2), 3, sqrt(1^1+3^2), ...); this is because both TC values are randomly distributed, and thus low values may end up directly adjacent to high values (and because we're looking for minimum distances, most plotted points are for these cases). The vertical distribution is because the RZS values were uniformly distributed between 0 and 100.
This is simply a result of the input example data, which is not too representative of the real data.

Matplotlib: How to make a histogram with bins of equal area?

Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.

I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.

Need help weighting (scaling) each of the bins in a histogram by a different factor

I'm trying to make a histogram of the radial distribution of a circular scatterring of particles, and I'm trying to scale the histogram so that the radial distribution is in particles per unit area.
Disclaimer: If you don't care about the math behind what I'm talking about, just skip over this section:
I'm splitting the radial distribution in to annuluses of equal width, going out from the center. So, in the center, I will have a circle of some radius, a. The area of this inner most portion will be $\pi a^{2}$.
Now if we want to know the area of the annulus going from radial distance a to 2a, we do $$ \int_{a}^{2a} 2 \pi r \ dr = 3 \pi a^{2} $$
Continuing in a similar fashion (going from 2a to 3a, 3a to 4a, etc.) we see that the areas increase as follows: $$ Areas = \pi a^{2}, 3 \pi a^{2}, 5 \pi a^{2}, 7 \pi a^{2}, ... $$
So, when I weight the histogram for the radial distribution of my scatter, going out from the center, each bin will have to be weighted so that the count of first bin is left alone, the count of the second bin is divided by 3, the count of the third bin is divided by 5, etc, etc.
So: Here's my try at the code:
import numpy as np
import matplotlib.pyplot as plt
# making random sample of 100000 points between -2.5 and 2.5
y_vec = 5*np.random.random(100000) - 2.5
z_vec = 5*np.random.random(100000) - 2.5
# blank canvasses for the y, z, and radial arrays
y_vec2 = []
z_vec2 = []
R_vec = []
# number of bins I want in the ending histogram
bns = 40
# cutting out the random samplings that aren't in a circular distribution
# and making the radial array
for i in range(0, 100000):
if np.sqrt((y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i])) <= 2.5:
y_vec2.append(y_vec[i])
z_vec2.append(z_vec[i])
R_vec.append(np.sqrt(y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i]))
# setting up the figures and plots
fig, ax = plt.subplots()
fig2, hst = plt.subplots()
# creating a weighting array for the histogram
wghts = []
i = 0
c = 1
# making the weighting array so that each of the bins will be weighted correctly
# (splitting the radial array up evenly in to groups of the size the bins will be
# and weighting them appropriately). I assumed the because the documentation says
# the "weights" array has to be the same size as the "x" initial input, that the
# weights act on each point individually...
while i < bns:
wghts.extend((1/c)*np.ones(len(R_vec)/bns))
c = c + 2
i = i + 1
# Making the plots
ax.scatter(y_vec2, z_vec2)
hst.hist(R_vec, bins = bns, weights = wghts)
# plotting
plt.show()
The scatter plot looks great:
But, the radial plot suggest that I got the weighting wrong. It should be constant across all annuli, but it is increasing, as though it were not weighted at all:
The erratic look of the Radial Distribution suggests to me that the weighting function in the "hist" operator weights each member of R_vec individually instead of weighting the bins.
How would I weight the bins by the factors I need to scale them by? Any help?

You are correct when you surmise that the weights weight the individual values and not the bins. This is documented:
Each value in x only contributes its associated weight towards the bin count (instead of 1).
Therefore the basic problem is that, in calculating the weights, you aren't taking account of the order of the points. You created points at random, but then you create the weights in sequence from greatest to least. This means you're not assigning the right weights to the right points.
The way you should create the weights is by directly computing each point's weight from its radius. The way you seem to want to do this is by discretizing the radius into a binned radius, then weighting inversely by that. Instead of what you're doing for the weights, try this:
R_vec = np.array(R_vec)
wghts = 1 / (2*(R_vec//(2.5/bns))+1)
This gives me the right result:
You can also get essentially the same result without doing the binning in the weighting --- that is, just directly weight each point by the reciporcal of its radius:
R_vec = np.array(R_vec)
wghts = 1 / R_vec
The advantage of doing this is that you can then plot a histogram a different number of bins without recomputing the weights. It also makes somewhat more conceptual sense to weight each point by how far out it is in a continuous sense, not by whether it falls on one side or the other of a discrete bin boundary.

When you want to plot something "per unit area", use area as your independent variable.
This way, you can still use a histogram if you like, but you don't have to worry about non-uniform binning or weighting.
I replaced your line:
hst.hist(R_vec, bins = bns, weights = wghts)
with:
hst.hist(np.pi*np.square(R_vec),bins=bns)

probability density function from histogram in python to fit another histrogram

I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help

You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
Example:
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
pyplot.show()
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.