Following on from the post found here: 2D histogram coloured by standard deviation in each bin
I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.
Note that, in this dataset, each point has a continuous label value between 0-1.
For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:
The way this was done was by using
scipy.stats.binned_statistic_2d()
(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)
..and setting the statistic argument to 'std'
But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?
It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!
Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.
The fraction of elements in an array below a threshold can be calculated as
fraction = lambda a, threshold: len(a[a<threshold])/len(a)
Hence you can call
scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))
Related
I have a vector of floats V with values from 0 to 1. I want to create a histogram with some window say A==0.01. And check how close is the resulting histogram to uniform distribution getting one value from zero to one where 0 is correlating perfectly and 1 meaning not correlating at all. For me correlation here first of all means histogram shape.
How one would do such a thing in python with numpy?
You can create the histogram with np.histogram. Then, you can generate the uniform histogram from the average of the previously retrieved histogram with np.mean. Then you can use a statistical test like the Pearson coefficient to do that with scipy.stats.pearsonr.
I have working code that plots a bivariate gaussian distribution. The distribution is produced by adjusting the COV matrix to account for specific variables. Specifically, every XY coordinate is applied with a radius. The COV matrix is then adjusted by a scaling factor to expand the radius in x-direction and contract in y-direction. The direction of this is measured by theta. The output is expressed as a probability density function (PDF).
I have normalised the PDF values. However, I'm calling a separate PDF for each frame. As such, the maximum value changes and hence the probability will be transformed differently for each frame.
Question: Using #Prasanth's suggestion. Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Below is the function I'm currently using to normalise the PDF for a single frame.
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
Is it possible to create normalized arrays for each frame before plotting, and then plot these arrays?
Indeed is possible. In your case you probably need to rescale your arrays between two values, say -1 and 1, before plotting. So that the minimum becomes -1, the maximum 1 and the intermediate values are scaled accordingly.
You could also choose 0 and 1 or whatever as minimum and maximum, but let's go with -1 and 1 so that a the middle value is 0.
To do this, in your code replace:
normPDF = (PDFs[0]-PDFs[1])/max(PDFs[0].max(),PDFs[1].max())
with:
renormPDF = PDFs[0]-PDFs[1]
renormPDF -= renormPDF.min()
normPDF = (renormPDF * 2 / renormPDF.max()) -1
This three lines ensure that normPDF.min() == -1 and normPDF.max() == 1.
Now when plotting the animation the axis on the right of your image does not change.
Your problem is to find the maximum values of PDFs[0].max() and PDFs[1].max() for all frames.
Why don't you run plotmvs on all your planned frames in order to find the absolute maximum for PDFs[0] and PDFs[1] and then run your animation with these absolute maxima to normalize your plots? This way, the colorbar will be the same for all frames.
I am doing some analysis to calculate the value of log_10(x) which is a negative number. I am now trying to plot these values, however, since the range of the answers is very large I would like to use a logarithmic scale for this. If I simply use plt.yscale('log') I get a message telling me UserWarning: Data has no positive values, and therefore cannot be log-scaled. I also cannot supply the values of x to plt.plot as the result of log_10(x) is so large and negative that the answer of x**(log_10(x)) is simply 0.
What might be the most straightforward way of plotting this data?
You can use
plt.yscale('symlog')
to set the scale to a symmetic log scale. This means that it will scale logarithmically to both sides of 0. Only using the negative part of the symlog scale would work just fine.
Two alternatives to ImportanceOfBeingErnest's solution:
Plot -log_10(x) on a semilog y axis and set the y-label to display negative units
Plot -log_10(-log_10(x)) on a linear scale
However, in all cases (including the solution proposed by ImportanceOfBeingErnest), the interpretation is not straightforward since you are displaying or calculating the log of a log.
Finally, in order to return the value for x, you need to calculate 10**(log_10(x)) not x**(log_10(x))
I have a set of discrete 2-dimensional data points. Each of these points has a measured value associated with it. I would like to get a scatter plot with points colored by their measured values. But the data points are so dense that points with different colors would overlap with each other, that may not be good for visualization. So I am thinking if I could associate the color for each point based on the coarse-grained average of measured values of some points near it. Does anyone know how to implement this in Python?
Thanks!
I have it done by using sklearn.neighbors.RadiusNeighborsClassifier(), the idea is the take the average of the values of the neighbors within a specific radius. Suppose the coordinates of the data points are in the list temp_coors, the values associated with these points are coloring, then coloring could be coarse-grained in the following way:
r_neigh = RadiusNeighborsRegressor(radius=smoothing_radius, weights='uniform')
r_neigh.fit(temp_coors, coloring)
coloring = r_neigh.predict(temp_coors)
I have a grid of model spectra, which have a constant, very high spectral resolution, and I need to down-sample them to a lower resolution, while preserving the total number of counts.
In essence, if the first 5 bins have (nominal center-of-bin) wavelengths [7.8, 7.81, 7.82, 7.83, 7.84], and the values [1.01, 1.02, 1.015, 1.014, 1.02], and my desired bins are some (non-integer) factor (say, 2.5) times as wide, I want my new spectrum to have nominal wavelengths [7.81, 7.83] and values [1.01+1.02+0.5*1.015, 0.5*1.015+1.014+1.02] (in general, though, the bins are not lined up as well, so you may get fractions of bins on either side).
I'll call my grid spec_ssp, and it has a shape of (93, 16, 39848). Wavelength varies along axis 2, and the first two axes are other parameters. I also have the nominal (central) wavelengths for each wavelength bin (technically, they're the log of the wavelength, but that shouldn't matter), called logL_ssp, and the desired new spacing of the logL grid, dlogL_new. I can figure out the nominal logL spacing of my templates dlogL_ssp by calculating np.median(logL_ssp[1:] - logL_ssp[:-1]), and it's about 20% the desired logL spacing. We'll call that fraction f.
I originally tried to use scipy.ndimage.zoom, using the aforementioned factor f, but discovered that it gives me an array that's downsampled by a factor of exactly 4. I need an exact resampling, so this won't work.
Next, I tried linearly interpolating along axis 2 using np.interp1d, after setting up new bin limits, with the aim of integrating the spectra in my grid using np.integrate.quad between successive bin limits, effectively getting an estimate of the total light in each of my new bins, more or less rigorously. However, quad doesn't play nicely with interp1d's interpolators (quad doesn't like non-float inputs). And, since I have ~1500 model spectra, the whole thing takes ages to run while iterating over all three axes (yes, I'm only making a new interpolator once per model spectrum).
Any ideas how to tackle this?