Interpolate smaller histogram bins (Pandas / Plotly) - python

I have a histogram of probability data, where the probabilities are bucketed in bins of size 10. When I display the histogram as a heatmap in plotly using buckets that are the same size as the data, I get this:
I'd like to view the heatmap at a finer granularity, like buckets of size 1. Since the data is in buckets of size 10, it looks like this:
How do I interpolate the data to fill in values for the gaps?
One idea is to divide the bucket from 190 to 200 evenly between 190-191, 191-192, etc, but that would not accurately represent the bell-curve shape of the histogram. Since the peak is at ~200, ideally there would be more weight at 199-200 than at 190-191.

I just found distplot. I can probably use that to get a good idea of the fine-grained distribution. I'll post back with the results when I have them.

Related

Histogram with same frequencies

I am plotting a histogram, with another set of data, but the frequencies are all 1, no matter how I change the number of bins. I did this with data generated from a normal distribution in the following fashion
x=npr.normal(0,2,(1,100))
plt.hist(x,bins=10)
and I get the following histogram:
This happens even if I increase the number of simulations to 1000 or 10000.
How do I plot a histogram that displays the bell shape of the normal distribution?
Thanks in advance.
You are ploting one histogram for each column of your input array. That is one histogram with 1 value for each of your 100 columns.
x=npr.normal(0,2,(1,100))
plt.hist(x[0],bins=10)
will do (note that I am selecting the first (and only) row of x).

FFT is coming to be a peak at 0 Hz frequency

I am trying to perform the discrete fourier transform using the FFT algorithm from numpy in python to a very large dataset (more than 900k points).
Here is the original function's graph:
And here is the transform after I plot it:
I've tried detrending the data using scipy's detrend() function and also subtracting the average from the data points. But the only difference that happens is that the thin gap at 0 Hz is gone.
I was expecting two peaks which would result from the big spike and then the little bump.
What can be causing the transform to come out this way? I have not checked if my time data points are spaced out evenly - I assumed that. The big spike's values go all the way to infinity and for the purposes of applying the FFT, I replaced those points with sys.maxsize. Could this somehow factor into the resulting waveform? I need to understand what is causing the resulting waveform to come out this way.
Here is my code for performing the transform and plotting it:
channelA and time are numpy arrays.
fY = np.fft.fft(channelA)
freq = np.fft.fftfreq(len(channelA), time[1]-time[0])
plt.title("Transform")
plt.xlabel("Frequency / Hz")
plt.ylabel("Magnitude")
plt.plot(freq, np.abs(fY))
plt.show()

How to plot linear regression between two continous values?

I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:
Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

2D histogram colour by "label fraction" of data in each bin

Following on from the post found here: 2D histogram coloured by standard deviation in each bin
I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.
Note that, in this dataset, each point has a continuous label value between 0-1.
For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:
The way this was done was by using
scipy.stats.binned_statistic_2d()
(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)
..and setting the statistic argument to 'std'
But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?
It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!
Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.
The fraction of elements in an array below a threshold can be calculated as
fraction = lambda a, threshold: len(a[a<threshold])/len(a)
Hence you can call
scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))

Rebinning numpy array by non-integer factor

I have a grid of model spectra, which have a constant, very high spectral resolution, and I need to down-sample them to a lower resolution, while preserving the total number of counts.
In essence, if the first 5 bins have (nominal center-of-bin) wavelengths [7.8, 7.81, 7.82, 7.83, 7.84], and the values [1.01, 1.02, 1.015, 1.014, 1.02], and my desired bins are some (non-integer) factor (say, 2.5) times as wide, I want my new spectrum to have nominal wavelengths [7.81, 7.83] and values [1.01+1.02+0.5*1.015, 0.5*1.015+1.014+1.02] (in general, though, the bins are not lined up as well, so you may get fractions of bins on either side).
I'll call my grid spec_ssp, and it has a shape of (93, 16, 39848). Wavelength varies along axis 2, and the first two axes are other parameters. I also have the nominal (central) wavelengths for each wavelength bin (technically, they're the log of the wavelength, but that shouldn't matter), called logL_ssp, and the desired new spacing of the logL grid, dlogL_new. I can figure out the nominal logL spacing of my templates dlogL_ssp by calculating np.median(logL_ssp[1:] - logL_ssp[:-1]), and it's about 20% the desired logL spacing. We'll call that fraction f.
I originally tried to use scipy.ndimage.zoom, using the aforementioned factor f, but discovered that it gives me an array that's downsampled by a factor of exactly 4. I need an exact resampling, so this won't work.
Next, I tried linearly interpolating along axis 2 using np.interp1d, after setting up new bin limits, with the aim of integrating the spectra in my grid using np.integrate.quad between successive bin limits, effectively getting an estimate of the total light in each of my new bins, more or less rigorously. However, quad doesn't play nicely with interp1d's interpolators (quad doesn't like non-float inputs). And, since I have ~1500 model spectra, the whole thing takes ages to run while iterating over all three axes (yes, I'm only making a new interpolator once per model spectrum).
Any ideas how to tackle this?

Categories