histogram matching in Python - python

I am trying to do histogram matching of simulated data to observed precipitation data. The below shows a simple simulated case. I got the CDF of both the simulated and observed data and got stuck theree. I hope a clue would help me to get across..Thanks you in advance
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
import scipy.stats as st
sim = st.gamma(1,loc=0,scale=0.8) # Simulated
obs = st.gamma(2,loc=0,scale=0.7) # Observed
x = np.linspace(0,4,1000)
simpdf = sim.pdf(x)
obspdf = obs.pdf(x)
plt.plot(x,simpdf,label='Simulated')
plt.plot(x,obspdf,'r--',label='Observed')
plt.title('PDF of Observed and Simulated Precipitation')
plt.legend(loc='best')
plt.show()
plt.figure(1)
simcdf = sim.cdf(x)
obscdf = obs.cdf(x)
plt.plot(x,simcdf,label='Simulated')
plt.plot(x,obscdf,'r--',label='Observed')
plt.title('CDF of Observed and Simulated Precipitation')
plt.legend(loc='best')
plt.show()
# Inverse CDF
invcdf = interp1d(obscdf,x)
transfer_func = invcdf(simcdf)
plt.figure(2)
plt.plot(transfer_func,x,'g-')
plt.show()

I tried to reproduce your code, and got the following error:
ValueError: A value in x_new is above the interpolation range.
If you look at the plot of your two CDFs it is pretty straight forward to figure out what is going on:
When you now define invcdf = interp1d(obscdf, x), notice that obscdf ranges from
>>> obscdf[0]
0.0
>>> obscdf[-1]
0.977852889924409
and so invcdf can only interpolate values between those limits: beyond them we would have to do extrapolation, which is not all that well defined. SciPy's default behavior is to raise an error when asked to extrapolate. Which is exactly what happens when you ask for invcdf(simcdf), because
>>> simcdf[-1]
0.99326205300091452
is beyond the interpolation range.
If you read the interp1d docs you will see that this behavior can be modified doing
invcdf = interp1d(obscdf, x, bounds_error=False)
and now everything works out fine, although you need to reverse the order of your plotting arguments to plt.plot(x, transfer_func,'g-') to get the same as in the figure you posted:

Related

How to make a histogram from 30 csv files to plot the historgram and then for it with gaussian function and the standard deviation?

I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.

Python: Histogram return wrong values for counts (EDIT: more general with example)

EDIT: Ive found a general example where it doesnt work either!
I am trying to extract the data for a histogram, but different counts seem wrong. As an example code:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.rand(1000000)
bins = np.arange(0,1,0.0001)
a,b,c = plt.hist(data,bins)
This gives me this rather messy histogram, and i've saved the counts as a and the interval as b. Now, plotting a and b, I should expect the same histogram, right? But that's not what I get:
plt.scatter(b[0:len(b)-1],a,s=2)
which gives me this, which doesnt match at all! Furthurmore, when I try and find the maximum value of a, it gives me 144, which fits fine with the scatterplot, but not with the histogram function.
If I count the numbers myself with the following code:
len(np.intersect1d(np.where(data>=b[np.argmax(a)]),np.where(data<b[np.argmax(a)+1])))
then it also gives me 144, in accordance with the values. So is the displayed histogram just wrong for some reason, and I should ignore it and just take the extracted data?
Old, unedited post:
For a physics course I am trying to bin my results in the following way:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
from scipy.optimize import curve_fit
plt.rc("font", family=["Helvetica", "Arial"])
plt.rc("axes", labelsize=18)
plt.rc("xtick", labelsize=16, top=True, direction="in")
plt.rc("ytick", labelsize=16, right=True, direction="in")
plt.rc("axes", titlesize=22)
plt.rc("legend", fontsize=16)
data_Ra = np.loadtxt('Ra226_cal2_ch001.txt',skiprows=5)
t_Ra = data_Ra[:,0]*10**-8 # time in seconds
channels_Ra = data_Ra[:,1]
channels_Ra = channels_Ra[np.where(channels_Ra>0)] # removing all the measurements at channel = 0
intervalspace = 2 #The intervals in which we count
bins=np.arange(0,4000,intervalspace)
counts, intervals , stuff = plt.hist(channels_Ra,bins)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.show()
Here, the histogram plot looks totally fine, with a max near 13000 counts. But when I then use np.max(counts), I am given about 24000, and when I try and just plot the values it gives me with:
plt.scatter(intervals[0:len(intervals)-1]+intervalspace/2,counts,s=1)
plt.xlabel('Channels')
plt.ylabel('Counts')
plt.title('Ra225')
plt.show()
it looks like this, which is totally different, and I can't figure out why. I am expecting the scatterplot to resemble the histogram, and while the peaks are located at the same x-vales, the height do not match.
This problem is in other large datasets as well.
I dont think i'm allowed to drop the txt-file here? So im not sure how much more I can show, but any help will be appreciated!
I don't know why you interpret the results in that way.
If you look at the histogram plot, you will be able to see the maximum value of the y-axis is 25,000. That means that there are some values close to 25,000. This fact can be verified in the scatter plot.
Your scatter plot shows actual values. It would be clearer if you describe how your expected plot looks like.
If you want discard some outlier points, you should apply some filtering before plotting the data.

scipy.interp2d warning and different result than expected

I'm trying to convert MATLAB code to equivalent python.
I have 3 arrays and I want to compute interp2d:
nuA = np.asarray([2.439,2.5,2.6,2.7,2.8,3.0,3.2,3.5,4.0,5.0,6.0,8.0,10,15,25])
nuB = np.asarray([0,0.1,0.2,0.3,0.5,0.7,1])
a, b = np.meshgrid(nuA, nuB)
betaTab = np.transpose(np.asarray([[0.0,2.16,1.0,1.0,1.0,1.0,1.0],[0.0,1.592,3.39,1.0,1.0,1.0,1.0],[0.0,0.759,1.8,1.0,1.0,1.0,1.0],[0.0,0.482,1.048,1.694,1.0,1.0,1.0],[0.0,0.36,0.76,1.232,2.229,1.0,1.0],[0.0,0.253,0.518,0.823,1.575,1.0,1.0],[0.0,0.203,0.41,0.632,1.244,1.906,1.0],[0.0,0.165,0.332,0.499,0.943,1.56,1.0],[0.0,0.136,0.271,0.404,0.689,1.23,2.195],[0.0,0.109,0.216,0.323,0.539,0.827,1.917],[0.0,0.096,0.19,0.284,0.472,0.693,1.759],[0.0,0.082,0.163,0.243,0.412,0.601,1.596],[0.0,0.074,0.147,0.22,0.377,0.546,1.482],[0.0,0.064,0.128,0.191,0.33,0.478,1.362],[0.0,0.056,0.112,0.167,0.285,0.428,1.274]]))
ip = scipy.interpolate.interp2d(a,b,betaTab)
when I try to run it, this warning is displayed:
/usr/local/lib/python2.7/dist-packages/scipy/interpolate/fitpack.py:981: RuntimeWarning: No more knots can be added because the additional knot would
coincide with an old one. Probable cause: s too small or too large
a weight to an inaccurate data point. (fp>s)
kx,ky=1,1 nx,ny=4,14 m=105 fp=21.576347 s=0.000000
warnings.warn(RuntimeWarning(_iermess2[ierm][0] + _mess))
I know that interp2d is different from matlab interp2 and in python RectBivariateSpline function is preferred. But I can't use the latter function because of the length of my data. Also, the final result of ip(xi,yi) is different from the MATLAB answer.
How can I compute interp2d without warning and compute it correctly?
Your input data seems to be quite ill-defined. Here's a surface plot of your input points:
This is not an easy problem to interpolate. Incidentally, I've recently ran into problems where interp2d couldn't even interpolate a smooth data set. So I would suggest checking out scipy.interpolate.griddata instead:
import numpy as np
import scipy.interpolate as interp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
#define your data as you did in your question: a, b and betaTab
ip = interp.interp2d(a,b,betaTab) # original interpolator
aplotv = np.linspace(a.min(),a.max(),100) # to interpolate at
bplotv = np.linspace(b.min(),b.max(),100) # to interpolate at
aplot,bplot = np.meshgrid(aplotv,bplotv) # mesh to interpolate at
# actual values from interp2d:
betainterp2d = ip(aplotv,bplotv)
# actual values from griddata:
betagriddata = interp.griddata(np.array([a.ravel(),b.ravel()]).T,betaTab.ravel(),np.array([aplot.ravel(),bplot.ravel()]).T)
# ^ this probably could be written in a less messy way,
# I'll keep thinking about it
#plot results
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(aplot,bplot,betainterp2d,cmap='viridis',cstride=1,rstride=1)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(aplot,bplot,betagriddata,cmap='viridis',cstride=1,rstride=1)
Results: (left: interp2d, right: griddata)
Conclusion: use scipy.interpolate.griddata.

Interpolating Data Using SciPy

I have two arrays of data that correspond to x and y values, that I would like to interpolate with a cubic spline.
I have tried to do this, but my interpolated function doesn't pass through my data points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
re = np.array([0.2,2,20,200,2000,20000],dtype = float)
cd = np.array([103,13.0,2.72,0.800,0.401,0.433],dtype = float)
plt.yscale('log')
plt.xscale('log')
plt.xlabel( "Reynold's number" )
plt.ylabel( "Drag coefficient" )
plt.plot(re,cd,'x', label='Data')
x = np.linspace(0.2,20000,200000)
f = interp1d(re,cd,kind='cubic')
plt.plot(x,f(x))
plt.legend()
plt.show()
What I end up with looks like this;
Which is clearly an awful representation of my function. What am I missing here?
Thank you.
You can get the result you probably expect (smooth spline on the log axes) by doing this:
f = interp1d(np.log(re),np.log(cd), kind='cubic')
plt.plot(x,np.exp(f(np.log(x))))
This will build the interpolation in the log space and plot it correctly. Plot your data on a linear scale to see how the cubic has to flip to get the tail on the left hand side.
The main thing you are missing is the log scaling on your axes. The spline shown is not an unreasonable result given your input data. Try drawing the plot with plt.xscale('linear') instead of plt.xscale('log'). Perhaps a cubic spline is not the best interpolation technique, at least on the raw data. A better option may be to interpolate on the log of the data insead.

Python/Scipy kde fit, scaling

I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)
My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde
df[var].hist()
plt.show() # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys) # a pdf with kinks
Alternatively, is there a slick way to use
count, div = np.histogram(df[var])
and then scale the count array to apply kde() to it?
Update
Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used
df[var].hist(bins=100)
I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.
If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()
yields
The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.
See also:
http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

Categories