Plotting histograms in Python using Matplotlib or Pandas - python

I have gone from different posts on this forum, but I cannot find an answer to the behaviour I am seeing.
I have a csv file which header has many entries with 300 points each.
For each fiel (column of the csv file) I would like to plot an histogram. The x axis contains the elements on that column and the y-axis should have the number of samples that fall inside each bin.
As I have 300 points, the total number of samples in all bins added together should be 300, so the y-axis should go from 0 to, let's say, 50 (just an example). However, the values are gigantic (400e8), which makes not sense.
sample of the table
point mydata
1 | 250.23e-9
2 | 250.123e-9
... | ...
300 | 251.34e-9
Please check my code, below. I am using pandas to open the csv and Matplotlib for the rest.
df=pd.read_csv("/home/pcardoso/raw_data/myData.csv")
# Figure parameters
figPath='/home/pcardoso/scripts/python/matplotlib/figures/'
figPrefix='hist_' # Prefix to the name of the file.
figSuffix='_something' # Suffix to the name of the file.
figString='' # Full string passed as the figure name to be saved
precision=3
num_bins = 50
columns=list(df)
for fieldName in columns:
vectorData=df[fieldName]
# statistical data
mu = np.mean(vectorData) # mean of distribution
sigma = np.std(vectorData) # standard deviation of distribution
# Create plot instance
fig, ax = plt.subplots()
# Histogram
n, bins, patches = ax.hist(vectorData, num_bins, density='True',alpha=0.75,rwidth=0.9, label=fieldName)
ax.legend()
# Best-fit curve
y=mlab.normpdf(bins, mu, sigma)
ax.plot(bins, y, '--')
# Setting axis names, grid and title
ax.set_xlabel(fieldName)
ax.set_ylabel('Number of points')
ax.set_title(fieldName + ': $\mu=$' + eng_notation(mu,precision) + ', $\sigma=$' + eng_notation(sigma,precision))
ax.grid(True, alpha=0.2)
fig.tight_layout() # Tweak spacing to prevent clipping of ylabel
# Saving figure
figString=figPrefix + fieldName +figSuffix
fig.savefig(figPath + figString)
plt.show()
plt.close(fig)
In summary, I would like to know how to have the y-axis values right.
Edit: 6 July 2020
Edit 08 June 2020
I would like the density estimator to follow the plot like this:
Thanks in advance.
Best regards,
Pedro

Don't use density='True', as with that option, the value displayed is the members in the bin divided by the width of the bin. If that width is small (as in your case of rather small x-values, the values become large.
Edit:
Ok, to un-norm the normed curve, you need to multiply it with the number of points and the width of one bin. I made a more reduced example:
from numpy.random import normal
from scipy.stats import norm
import pylab
N = 300
sigma = 10.0
B = 30
def main():
x = normal(0, sigma, N)
h, bins, _ = pylab.hist(x, bins=B, rwidth=0.8)
bin_width = bins[1] - bins[0]
h_n = norm.pdf(bins[:-1], 0, sigma) * N * bin_width
pylab.plot(bins[:-1], h_n)
if __name__ == "__main__":
main()

Related

Turning a scatter plot into a histogram in python

I need to plot a histogram from some data in another file.
At the moment I have code that plots a scatter plot and fits a Gaussian.
The x-value is whatever the number is on the corresponding line from the data file it's reading in (after the first 12 lines of other information, i.e. Line 13 is the first event), and the y-value is the number of the line multiplied by a value.
Then plots and fits the scatter, but I need to be able to plot it as a histogram, and be able to change the bin width / number (i.e. add bin 1, 2, 3 and 4 together to have 1/4 of the bins overall with 4 times the numbers of events - so I guess adding together multiple lines from the data), which is where I am stuck.
How would I go about making this into the histogram and adjust the width / numbers?
Code below, didn't know how to make it pretty. Let me know if I can make it a bit easier to read.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Define gaussian
def gaussian(x, amp, cen, wid):
"""1-d gaussian: gaussian(x, amp, cen, wid)"""
return (amp / (sqrt(2*pi) * wid)) * exp(-(x-cen)**2 / (2*wid**2))
## Define constants
stderrThreshold = 10
minimumAmplitude = 0.1
approxcen = 780
MaestroT = 53
## Define paramaters
amps = []; ampserr = []; ts = []
folderToAnalyze = baseFolder + fileToRun + '\\'
## Generate the time array
for n in range(0, numfiles):
## Load text file
x = np.linspace(0, 8191, 8192)
fullprefix = folderToAnalyze + prefix + str(n).zfill(3)
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Coincidence Detections', fontsize=20)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.plot(x, y, 'bo')
ax.set_xlim(600,1000)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x, amp=8, cen=approxcen, wid=1)
## Plot results and save figure
ax.plot(x, result.best_fit, 'r-', label='best fit')
ax.legend(loc='best')
texttoplot = result.fit_report()
ax.text(0.02, 0.5, texttoplot, transform=ax.transAxes)
plt.close()
fig.savefig(fullprefix + ".png", pad_inches='0.5')
Current Output: Scatter plots, that do show the expected distribution and plot of the data (they do however have a crappy reduced chi^2, but one problem at a time)
Expected Output: Histogram plots of the same data, with the same distribution and fitting, when each event is plotted as a separate bin, and hopefully the possibility to add these bins together to reduce error bars
Errors: N/A
Data: It's basically a standard distribution over 8192 lines. Full data for 1 file is here. Also the original .Spe file, the scatter plot and the full version of the code
2020-11-23 Update From Answer Comments:
Hi, I've been trying to implement this for a little while and come unstuck. I've tried to follow your example closely, however I am getting out histogram still with a bin width of 1 (i.e. not adding together). I also get a second blank graph in the printouts, and the reports only printout in the IDE (though i am working on that one, and reckon i will have it soon). Also for some reason, it seems to stop after 3 out of 50 iterations of the loop.
This is the code in it's current state:
This is the output I'm getting:
This is the original output:
And just in case it's useful, this is the raw data. I seem to be having trouble replicating your last 2 figure
The ideal would just be able to alter the constant on line 30, to whatever the desired bin width is, and have it run with that bin width on that occasion.
The scatter plot, in this case, is a histogram, except with dots instead of bars.
.Spe is a bin count for each event.
x = np.linspace(0, 8191, 8192) defines the bins, and the bin width is 1.
Construct a bar plot instead of a scatter plot
ax.bar(x, y) instead of ax.plot(x, y, 'bo')
As a result of the existing data, the following plot is a histogram with a very wide distribution.
There are values ranging from 321 to 1585
ax.set_xlim(300, 1800)
The benefit of this data is, it's easy to recreate the raw distribution based on x, a bin size of 1, and y being the respective count for each x.
np.repeat can create an array with repeat elements
import numpy
import matplotlib.pyplot as plt
# given x and y from the loop
# set the type as int
y = y.astype(int)
x = x.astype(int)
# create the data
data = np.repeat(x, y)
# determine the range of x
x_range = range(min(data), max(data)+1)
# determine the length of x
x_len = len(x_range)
# plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
ax2.boxplot(data, vert=False)
plt.show()
Given data, you can now perform whatever analysis is required.
SO: Fit gaussian to noisy data with lmfit
LMFIT Docs
Cross Validated might be a better site for diving into the model
All of the error calculation parameters come from the model result. If you calculate new x and y from np.histogram for different bin widths, that may affect the errors.
approxcen = 780 is also an input to result
# given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
# determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
# Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
# result
print(result.fit_report())
[out]:
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 314
# data points = 158
# variables = 3
chi-square = 397.702574
reduced chi-square = 2.56582306
Akaike info crit = 151.851284
Bayesian info crit = 161.039069
[[Variables]]
amp: 1174.80608 +/- 37.1663147 (3.16%) (init = 8)
cen: 775.535731 +/- 0.46232727 (0.06%) (init = 780)
wid: 12.6563219 +/- 0.46232727 (3.65%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(amp, wid) = 0.577
# plot
plt.figure(figsize=(10, 6))
plt.bar(x[:-1], y)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.figure(figsize=(20, 8))
plt.bar(x[:-1], y)
plt.xlim(700, 850)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.grid()
As we can see from the next code block, the error is related to the following parameters
stderrThreshold = 10
minimumAmplitude = 0.1
MaestroT = 53
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)

Python, connect two data points at the beginning and end of a series

I would really like to know how I can plot the mean of two points on a chart with Python. I have stock data with 200 data points, and I want to take the mean of the first 20 points and the mean of the last 20 points, and then plot a line connecting those two points. I do not want any of the data points between those two to be taken into account.
my entire program is as such
stock = web.get_data_yahoo('clh.ax', '10/01/2017', interval='d')
stock['ema']=stock['Adj Close'].ewm(span=100,min_periods=0).mean()
stock['std']=stock['Adj Close'].rolling(window = 20,min_periods=0).std()
# bollinger bands
stock['close 20 day mean'] = stock['Close'].rolling(20,min_periods=0).mean()
# upper band
stock['upper'] = stock['close 20 day mean'] + 2 * (stock['Close'].rolling(20, min_periods=0).std())
# lower band
stock['lower'] = stock['close 20 day mean'] - 2 * (stock['Close'].rolling(20, min_periods=0).std())
# end bollinger bands
fig,axes = plt.subplots(nrows=3, ncols =1, figsize=(10,6))
axes[0].plot(stock['Close'], color='red')
axes[0].plot(stock['ema'], color='blue')
axes[0].plot(stock['close 20 day mean'], color='black')
axes[0].plot(stock['upper'], color='black')
axes[0].plot(stock['lower'], color='black')
axes[1].plot(stock['Volume'],color='purple')
axes[2].plot(stock['std'], color='black')
Not 100% sure i understood the question right, but:
a) Take the mean of the first 20 points,
b) Take the mean of the last 20 points.
c) Plots a line between those two values.
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot([stock["Close"].iloc[:20].mean(), stock["Close"].iloc[-20:].mean()])
This plots:

Plotting data points on where they fall in a distribution

Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()
plt.show()

making the y-axis of a histogram probability, python

I have plotted a histogram in python, using matplotlib and I need the y-axis to be the probability, I cannot find how to do this. For example i want it to look similar to this http://www.mathamazement.com/images/Pre-Calculus/10_Sequences-Series-and-Summation-Notation/10_07_Probability/10-coin-toss-histogram.JPG
Here is my code, I will attached my plot aswell if needed
plt.figure(figsize=(10,10))
mu = np.mean(a) #mean of distribution
sigma = np.std(a) # standard deviation of distribution
n, bins,patches=plt.hist(a,bin, normed=True, facecolor='white')
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins,y,'r--')
print np.sum(n*np.diff(bins))# proved the intergal over bars is unity
plt.show()
Just divide all your sample counts by the total number of samples. This gives the probability rather than the count.
As #SteveBarnes points out, divide the sample counts by the total number of samples to get the probabilities for each bin. To get a plot like the one you linked to, your "bins" should just be the integers from 0 to 10. A simple way to compute the histogram for a sample from a discrete distribution is np.bincount.
Here's a snippet that creates a plot like the one you linked to:
import numpy as np
import matplotlib.pyplot as plt
n = 10
num_samples = 10000
# Generate a random sample.
a = np.random.binomial(n, 0.5, size=num_samples)
# Count the occurrences in the sample.
b = np.bincount(a, minlength=n+1)
# p is the array of probabilities.
p = b / float(b.sum())
plt.bar(np.arange(len(b)) - 0.5, p, width=1, facecolor='white')
plt.xlim(-0.5, n + 0.5)
plt.xlabel("Number of heads (k)")
plt.ylabel("P(k)")
plt.show()

matplotlib reducing y axis by a factor to represent percent frequency

so i have 3 lists of fractions and i used a histogram to show how often each fraction showed up. The problem is that there are 100000 of each and i need to reduce the y vaues by that much to get a frequency percentage. Here is my code now
bins = numpy.linspace(0, 1, 50)
z = np.linspace(0,1,50)
g = (lambda z: 2 * np.exp((-2)*(z**2)*(1000000000)))
w = g(z)
plt.plot(z,w)
pyplot.hist(Vrand, bins, alpha=0.5)
pyplot.hist(Vfirst, bins, alpha=0.5)
pyplot.hist(Vmin, bins, alpha=0.2)
pyplot.show()
it is the last chunk of code i need the y axis divided by 100000
Update:
when i try to divide by 100000 using np histograms all the values =0 except the line above
bins = numpy.linspace(0, 1, 50)
z = np.linspace(0,1,50)
g = (lambda z: 2 * np.exp((-2)*(z**2)*(100000)))
w = g(z)
plt.plot(z,w)
hist, bins = np.histogram(Vrand, bins)
hist /= 100000.0
widths = np.diff(bins)
pyplot.bar(bins[:-1], hist, widths)
matplotlib histogram has a "normed" parameter that you can use to scale everything to [0,1] interval
pyplot.hist(Vrand, bins, normed=1)
or use weights parameter to scale it by different coefficient.
You can also use the retuning value of numpy histogram and scale it whatever you want (tested in python 3.x)
hist, bins = np.histogram(Vrand, bins)
hist /= 100000.0
widths = np.diff(bins)
pyplot.bar(bins[:-1], hist, widths)
First two solutions are in my opinion better, as we should not "reinvent the wheel" and implement by hand what is already done in library.
Firstly I would recommend you think about your style, use either plt or pyplot not both and you should include in example code some fake data to illustrate the problem and your imports.
So, the issue is that in the following example the counts are very large:
bins = np.linspace(0, 1, 50)
data = np.random.normal(0.5, 0.1, size=100000)
plt.hist(data, bins)
plt.show()
You tried to fix this by dividing the bin count by an integer:
hist, bins = plt.histogram(data, bins)
hist_divided = hist/10000
The issue here is that hist is an array of int's and dividing integers is tricky. For example
>>> 2/3
0
>>> 3/2
1
This is what gives you a row of 0's if you pick too large a value to divide by. Instead you can divide by a float as suggested by #lejlot, notice you need to divide by 10000.0 and not 10000.
Or the other suggestion made by #lejlot just use the normed argument in the call to 'hist'. This rescales all the numbs in hist such that the sum of their squares is 1, very useful when comparing values.
I also notice you appear to be having this issue because your plotting a line plot on the same axis as the histogram, if this line plot is outside of the [0,1] range you will again encounter the same issue, instead of rescale the histogram axis you should twin the x axis.

Categories