I have a grid of model spectra, which have a constant, very high spectral resolution, and I need to down-sample them to a lower resolution, while preserving the total number of counts.
In essence, if the first 5 bins have (nominal center-of-bin) wavelengths [7.8, 7.81, 7.82, 7.83, 7.84], and the values [1.01, 1.02, 1.015, 1.014, 1.02], and my desired bins are some (non-integer) factor (say, 2.5) times as wide, I want my new spectrum to have nominal wavelengths [7.81, 7.83] and values [1.01+1.02+0.5*1.015, 0.5*1.015+1.014+1.02] (in general, though, the bins are not lined up as well, so you may get fractions of bins on either side).
I'll call my grid spec_ssp, and it has a shape of (93, 16, 39848). Wavelength varies along axis 2, and the first two axes are other parameters. I also have the nominal (central) wavelengths for each wavelength bin (technically, they're the log of the wavelength, but that shouldn't matter), called logL_ssp, and the desired new spacing of the logL grid, dlogL_new. I can figure out the nominal logL spacing of my templates dlogL_ssp by calculating np.median(logL_ssp[1:] - logL_ssp[:-1]), and it's about 20% the desired logL spacing. We'll call that fraction f.
I originally tried to use scipy.ndimage.zoom, using the aforementioned factor f, but discovered that it gives me an array that's downsampled by a factor of exactly 4. I need an exact resampling, so this won't work.
Next, I tried linearly interpolating along axis 2 using np.interp1d, after setting up new bin limits, with the aim of integrating the spectra in my grid using np.integrate.quad between successive bin limits, effectively getting an estimate of the total light in each of my new bins, more or less rigorously. However, quad doesn't play nicely with interp1d's interpolators (quad doesn't like non-float inputs). And, since I have ~1500 model spectra, the whole thing takes ages to run while iterating over all three axes (yes, I'm only making a new interpolator once per model spectrum).
Any ideas how to tackle this?
Related
I have working code that plots a bivariate gaussian distribution. The distribution is produced by adjusting the COV matrix to account for specific variables. Specifically, every XY coordinate is applied with a radius. The COV matrix is then adjusted by a scaling factor to expand the radius in x-direction and contract in y-direction. The direction of this is measured by theta. The output is expressed as a probability density function (PDF).
I have normalised the PDF values. However, I'm calling a separate PDF for each frame. As such, the maximum value changes and hence the probability will be transformed differently for each frame.
Question: Using #Prasanth's suggestion. Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Below is the function I'm currently using to normalise the PDF for a single frame.
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Indeed is possible. In your case you probably need to rescale your arrays between two values, say -1 and 1, before plotting. So that the minimum becomes -1, the maximum 1 and the intermediate values are scaled accordingly.
You could also choose 0 and 1 or whatever as minimum and maximum, but let's go with -1 and 1 so that a the middle value is 0.
To do this, in your code replace:
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
with:
renormPDF = PDFs[0]-PDFs[1]
renormPDF -= renormPDF.min()
normPDF = (renormPDF * 2 / renormPDF.max()) -1
This three lines ensure that normPDF.min() == -1 and normPDF.max() == 1.
Now when plotting the animation the axis on the right of your image does not change.
Your problem is to find the maximum values of PDFs[0].max() and PDFs[1].max() for all frames.
Why don't you run plotmvs on all your planned frames in order to find the absolute maximum for PDFs[0] and PDFs[1] and then run your animation with these absolute maxima to normalize your plots? This way, the colorbar will be the same for all frames.
Following on from the post found here: 2D histogram coloured by standard deviation in each bin
I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.
Note that, in this dataset, each point has a continuous label value between 0-1.
For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:
The way this was done was by using
scipy.stats.binned_statistic_2d()
(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)
..and setting the statistic argument to 'std'
But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?
It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!
Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.
The fraction of elements in an array below a threshold can be calculated as
fraction = lambda a, threshold: len(a[a<threshold])/len(a)
Hence you can call
scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))
I have a 13x1340 matrix that I usually plot correctly without the need to specify an aspect ratio.
However, I would now like to tweak that aspect ratio so that two matrices whose 13 rows correspond to different scales are plotted as rectangles of equal length but different height, proportionally to the corresponding axis scale.
I have tried to use the get_aspect() method to obtain the numerical value that is being used, but it returns 'auto'. I have tried to guess the value and found that it is close to 4.5/(1340*180), which looks like a completely absurd value to me. I expected it to be something closer to 13/1340, but perhaps I don't quite understand how aspect ratios are calculated.
Setting the aspect ratio to 1 gives me an incredibly thin figure, with the proper vertical size. As the value decreases, the figure becomes longer in length, until it reaches ~ 4.5/(1340*180). After that, it starts losing height while keeping a fixed length.
Figure size is set to 3 inches high by 7 inches large, and the dpi is set to 300 on the savefig() method.
The get_data_ratio() method returns a value slightly larger than 13*1340, although it is clear that this value is not the aspect ratio used do construct the figure.
Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.
I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.
When using numpy.histogram, with density=True, the function returns an array with the pdf values at each point. However my question is, does it return pdf values at the leading edge of the bin or in the middle of the bin?
For example, if I have bins 0-1, 1-2, 2-3 etc... will it give me the pdfs at the points 0, 1, 2 etc... or at 0.5, 1.5, 2.5 etc...
Thank you!
Each normalized histogram value will give you the estimated probability density for your sample over the range spanned by its corresponding bin edges. If you had bin edges a and b then the corresponding normalized histogram value would be the probability density over the interval [a,b).
Intuitively, to estimate density from some finite number of samples you count the number of samples that fall into each histogram bin, then divide by the area of the bin. For infinitely many samples and infinitely small bins this would eventually converge on the PDF of the underlying continuous distribution.