I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.
I am trying to make DFT (discrete fourier transforms) plots using pcolor in python. I have previously been using Mathematica 8.0 to do this but I find that the colorbar in mathematica 8.0 has bad one-to-one correlation with the data I try to represent. For instance, here is the data that I am plotting:
[[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]]
So, its a lot of zeros or small numbers in a DFT matrix or small quantity of high frequency energies.
When I plot this using mathematica, this is the result:
The color bar is off and I thought I'd like to plot this with python instead.
My python code (that I hijacked from here) is:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
#from pylab import pcolor, show, colorbar, xticks, yticks
from pylab import *
data = np.array([[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]], np.float)
pcolor(data)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
#show()
savefig('/home/mydir/foo.eps',figsize=(4,4),dpi=100)
And this python code plots as:
Now here is my question/list of questions:
I like how python plots this and would like to use this but...
How do I make all the "blue" which represents "0" go away like it does in my mathematica plot?
How do I rotate the plot to have the bright red spot in the top left corner?
The way I set the "dpi", is that correct?
Any useful references that I should use to strengthen my love for python?
I have looked through other questions on here and the user manual for numpy but found not much help.
I plan on publishing this data and it is rather important that I get all the bits and pieces right! :)
Edit:
Modified python code and resulting plot! What improvements would one suggest to this to make it publication worthy?
from numpy import corrcoef, sum, log, arange, save
from numpy.random import rand
from pylab import *
data = np.array([[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]], np.float)
v1 = abs(data).max()
v2 = abs(data).min()
pcolor(data, cmap="binary")
colorbar()
#xlabel("X", fontsize=12, fontweight="bold")
#ylabel("Y", fontsize=12, fontweight="bold")
xticks(arange(0.5,10.5),range(0,10),fontsize=19)
yticks(arange(0.5,10.5),range(0,10),fontsize=19)
axis([0,7,0,7])
#show()
savefig('/home/mydir/Desktop/py_dft.eps',figsize=(4,4),dpi=600)
The following will get you closer to what you want:
import matplotlib.pyplot as plt
plt.pcolor(data, cmap=plt.cm.OrRd)
plt.yticks(np.arange(0.5,10.5),range(0,10))
plt.xticks(np.arange(0.5,10.5),range(0,10))
plt.colorbar()
plt.gca().invert_yaxis()
plt.gca().set_aspect('equal')
plt.show()
The list of available colormaps by default is here. You'll need one that starts out white.
If none of those suits your needs, you can try generating your own, start by looking at LinearSegmentedColormap.
Just for the record, in Mathematica 9.0:
GraphicsGrid#{{MatrixPlot[l,
ColorFunction -> (ColorData["TemperatureMap"][Rescale[#, {Min#l, Max#l}]] &),
ColorFunctionScaling -> False], BarLegend[{"TemperatureMap", {0, Max#l}}]}}
How would you create a qq-plot using Python?
Assuming that you have a large set of measurements and are using some plotting function that takes XY-values as input. The function should plot the quantiles of the measurements against the corresponding quantiles of some distribution (normal, uniform...).
The resulting plot lets us then evaluate in our measurement follows the assumed distribution or not.
http://en.wikipedia.org/wiki/Quantile-quantile_plot
Both R and Matlab provide ready made functions for this, but I am wondering what the cleanest method for implementing in in Python would be.
Update: As folks have pointed out this answer is not correct. A probplot is different from a quantile-quantile plot. Please see those comments and other answers before you make an error in interpreting or conveying your distributions' relationship.
I think that scipy.stats.probplot will do what you want. See the documentation for more detail.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Result
Using qqplot of statsmodels.api is another option:
Very basic example:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(0,1, 1000)
sm.qqplot(test, line='45')
pylab.show()
Result:
Documentation and more example are here
If you need to do a QQ plot of one sample vs. another, statsmodels includes qqplot_2samples(). Like Ricky Robinson in a comment above, this is what I think of as a QQ plot vs a probability plot which is a sample against a theoretical distribution.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot_2samples.html
I came up with this. Maybe you can improve it. Especially the method of generating the quantiles of the distribution seems cumbersome to me.
You could replace np.random.normal with any other distribution from np.random to compare data against other distributions.
#!/bin/python
import numpy as np
measurements = np.random.normal(loc = 20, scale = 5, size=100000)
def qq_plot(data, sample_size):
qq = np.ones([sample_size, 2])
np.random.shuffle(data)
qq[:, 0] = np.sort(data[0:sample_size])
qq[:, 1] = np.sort(np.random.normal(size = sample_size))
return qq
print qq_plot(measurements, 1000)
To add to the confusion around Q-Q plots and probability plots in the Python and R worlds, this is what the SciPy manual says:
"probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot."
If you try out scipy.stats.probplot, you'll see that indeed it compares a dataset to a theoretical distribution. Q-Q plots, OTOH, compare two datasets (samples).
R has functions qqnorm, qqplot and qqline. From the R help (Version 3.6.3):
qqnorm is a generic function the default method of which produces a
normal QQ plot of the values in y. qqline adds a line to a
“theoretical”, by default normal, quantile-quantile plot which passes
through the probs quantiles, by default the first and third quartiles.
qqplot produces a QQ plot of two datasets.
In short, R's qqnorm offers the same functionality that scipy.stats.probplot provides with the default setting dist=norm. But the fact that they called it qqnorm and that it's supposed to "produce a normal QQ plot" may easily confuse users.
Finally, a word of warning. These plots don't replace proper statistical testing and should be used for illustrative purposes only.
It exists now in the statsmodels package:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html
You can use bokeh
from bokeh.plotting import figure, show
from scipy.stats import probplot
# pd_series is the series you want to plot
series1 = probplot(pd_series, dist="norm")
p1 = figure(title="Normal QQ-Plot", background_fill_color="#E8DDCB")
p1.scatter(series1[0][0],series1[0][1], fill_color="red")
show(p1)
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Here probplot draw the graph measurements vs normal distribution which speofied in dist="norm"
How big is your sample? Here is another option to test your data against any distribution using OpenTURNS library. In the example below, I generate a sample x of 1.000.000 numbers from a Uniform distribution and test it against a Normal distribution.
You can replace x by your data if you reshape it as x= [[x1], [x2], .., [xn]]
import openturns as ot
x = ot.Uniform().getSample(1000000)
g = ot.VisualTest.DrawQQplot(x, ot.Normal())
g
In my Jupyter Notebook, I see:
If you are writing a script, you can do it more properly
from openturns.viewer import View`
import matplotlib.pyplot as plt
View(g)
plt.show()