I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.
I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.
You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:
import numpy as np
import matplotlib.pyplot as plt
# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')
plt.show()
Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:
import numpy as np
import matplotlib.pyplot as plt
# Some fake data:
data = np.random.randn(1000)
sorted_data = np.sort(data) # Or data.sort(), if data can be modified
# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size)) # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size)) # From the number of data points-1 to 0
plt.show()
Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.
The result is:
You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).
PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:
plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
np.arange(sorted_data.size+1))
There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.
After conclusive discussion with #EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:
import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt
def pdf(x, mu=0, sigma=1):
"""
Calculates the normal distribution's probability density
function (PDF).
"""
term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
return term1 * term2
# Drawing sample date poi
##################################################
# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()
min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))
##################################################
fig = plt.gcf()
fig.set_size_inches(12,11)
# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Cumulative distributions, smooth:
plt.subplot(2,2,2)
plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Probability densities of the sample points function
plt.subplot(2,2,3)
pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
# Probability density function
plt.subplot(2,2,4)
x = np.arange(min_val, max_val, 0.05)
pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')
plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
plt.show()
In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :
def hist(data, bins, title, labels, range = None):
fig = plt.figure(figsize=(15, 8))
ax = plt.axes()
plt.ylabel("Proportion")
values, base, _ = plt.hist( data , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
ax_bis = ax.twinx()
values = np.append(values,0)
ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
plt.xlabel(labels)
plt.ylabel("Proportion")
plt.title(title)
ax_bis.legend();
ax.legend();
plt.show()
return
if anyone wonders how it looks like, please take a look (with seaborn activated):
Also, concerning the double grid (the white lines), I always used to struggle to have nice double grid. Here is an interesting way to circumvent the problem: How to put grid lines from the secondary axis behind the primary plot?
The simplest way to generate this graph is with seaborn:
import seaborn as sns
sns.ecdfplot()
Here is the documentation
Related
I'm trying to reproduce the following plot from this paper :
The plot show the mean accuracy across five runs and the vertical values shows the min and max accuracy.
How can I add those verticale scatter with specific values ?
My current code :
def plot_losses(losses: Dict[float, Dict[float, List[float]]]) -> None:
"""
Plot the evolution of the loss regarding the sparsity level and iteration step
Args:
losses (Dict[float, Dict[float, List[float]]]): Dict containing the losses regarding the sparsity level and iteration step
"""
plt.clf()
plt.figure(figsize=(20, 10))
plt.tight_layout()
sparsity_levels = [round(sparsity_level, 2) for sparsity_level in losses.keys()]
for sparsity_level, key in zip(sparsity_levels, losses.keys()):
plt.plot(list(losses[key].keys()), list(losses[key].values()), '+--', label=f"{100 - sparsity_level:.2f}%")
plt.show()
Prefer plt.errorbar (over plt.plot inside the for loop of plot_losses) and use the argument yerr to add the vertical bars with min and max values.
Here is an example:
import numpy as np
import matplotlib.pyplot as plt
# Generate data
x = np.arange(10) + 1
y1 = x/20
y2 = x/25
# Generate data for pseudo-errorbars
y1_err = np.array([y1[0::2]/20, y1[1::2]/7]).reshape(1, 10)
y2_err = np.array([y2[0::2]/30, y1[1::2]/13]).reshape(1, 10)
# Plot data
plt.errorbar(x, y1, yerr=y1_err, label="100", capsize=3, capthick=3.5)
plt.errorbar(x, y2, yerr=y2_err, label="51.3", capsize=3, capthick=3.5)
plt.legend(bbox_to_anchor=(0.95, 1.1), ncol=3)
plt.show()
This gives:
I have a bunch of minute data that have many seasonalities (day, week, month, year).
I want to make a long forecast of it using wavelet transforms.
By this moment I've made a fourier transform and got an expression for predicting the daily seasonality.
So I've extracted frequencies and amplitudes for sine functions.
Seasonal trend (orange) and reconstructed seasonal trend(blue)
The blue curve is the sum of 5 sine functions with known amplitudes and frequencies. And if I fit x-values to this equation I get y-values.
Also I'm trying to make the same thing with wavelet transformation (using pywt).
import numpy as np
import pywt
x = np.linspace(0, 1, num=2048)
chirp_signal = np.sin(250 * np.pi * x**2)
data = chirp_signal
fig, ax = plt.subplots(figsize=(40,10))
ax.set_title("Original Signal: ")
ax.plot(data)
plt.show()
waveletname = 'sym5'
fig, axarr = plt.subplots(nrows=5, ncols=2, figsize=(40,10))
for ii in range(5):
(data, coeff_d) = pywt.dwt(data, waveletname)
axarr[ii, 0].plot(data, 'r')
axarr[ii, 1].plot(coeff_d, 'g')
axarr[ii, 0].set_ylabel("Level {}".format(ii + 1), fontsize=20, rotation=90)
axarr[ii, 0].set_yticklabels([])
if ii == 0:
axarr[ii, 0].set_title("Approximation coefficients", fontsize=20)
axarr[ii, 1].set_title("Detail coefficients", fontsize=20)
axarr[ii, 1].set_yticklabels([])
plt.tight_layout()
plt.show()
The example is from this guide.
And the question is: Is it possible (and if "yes" then how) to make a function with received coefficients, fit x-values to it and get corresponding y-values?
Thanks!
I need to plot a histogram from some data in another file.
At the moment I have code that plots a scatter plot and fits a Gaussian.
The x-value is whatever the number is on the corresponding line from the data file it's reading in (after the first 12 lines of other information, i.e. Line 13 is the first event), and the y-value is the number of the line multiplied by a value.
Then plots and fits the scatter, but I need to be able to plot it as a histogram, and be able to change the bin width / number (i.e. add bin 1, 2, 3 and 4 together to have 1/4 of the bins overall with 4 times the numbers of events - so I guess adding together multiple lines from the data), which is where I am stuck.
How would I go about making this into the histogram and adjust the width / numbers?
Code below, didn't know how to make it pretty. Let me know if I can make it a bit easier to read.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Define gaussian
def gaussian(x, amp, cen, wid):
"""1-d gaussian: gaussian(x, amp, cen, wid)"""
return (amp / (sqrt(2*pi) * wid)) * exp(-(x-cen)**2 / (2*wid**2))
## Define constants
stderrThreshold = 10
minimumAmplitude = 0.1
approxcen = 780
MaestroT = 53
## Define paramaters
amps = []; ampserr = []; ts = []
folderToAnalyze = baseFolder + fileToRun + '\\'
## Generate the time array
for n in range(0, numfiles):
## Load text file
x = np.linspace(0, 8191, 8192)
fullprefix = folderToAnalyze + prefix + str(n).zfill(3)
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Coincidence Detections', fontsize=20)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.plot(x, y, 'bo')
ax.set_xlim(600,1000)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x, amp=8, cen=approxcen, wid=1)
## Plot results and save figure
ax.plot(x, result.best_fit, 'r-', label='best fit')
ax.legend(loc='best')
texttoplot = result.fit_report()
ax.text(0.02, 0.5, texttoplot, transform=ax.transAxes)
plt.close()
fig.savefig(fullprefix + ".png", pad_inches='0.5')
Current Output: Scatter plots, that do show the expected distribution and plot of the data (they do however have a crappy reduced chi^2, but one problem at a time)
Expected Output: Histogram plots of the same data, with the same distribution and fitting, when each event is plotted as a separate bin, and hopefully the possibility to add these bins together to reduce error bars
Errors: N/A
Data: It's basically a standard distribution over 8192 lines. Full data for 1 file is here. Also the original .Spe file, the scatter plot and the full version of the code
2020-11-23 Update From Answer Comments:
Hi, I've been trying to implement this for a little while and come unstuck. I've tried to follow your example closely, however I am getting out histogram still with a bin width of 1 (i.e. not adding together). I also get a second blank graph in the printouts, and the reports only printout in the IDE (though i am working on that one, and reckon i will have it soon). Also for some reason, it seems to stop after 3 out of 50 iterations of the loop.
This is the code in it's current state:
This is the output I'm getting:
This is the original output:
And just in case it's useful, this is the raw data. I seem to be having trouble replicating your last 2 figure
The ideal would just be able to alter the constant on line 30, to whatever the desired bin width is, and have it run with that bin width on that occasion.
The scatter plot, in this case, is a histogram, except with dots instead of bars.
.Spe is a bin count for each event.
x = np.linspace(0, 8191, 8192) defines the bins, and the bin width is 1.
Construct a bar plot instead of a scatter plot
ax.bar(x, y) instead of ax.plot(x, y, 'bo')
As a result of the existing data, the following plot is a histogram with a very wide distribution.
There are values ranging from 321 to 1585
ax.set_xlim(300, 1800)
The benefit of this data is, it's easy to recreate the raw distribution based on x, a bin size of 1, and y being the respective count for each x.
np.repeat can create an array with repeat elements
import numpy
import matplotlib.pyplot as plt
# given x and y from the loop
# set the type as int
y = y.astype(int)
x = x.astype(int)
# create the data
data = np.repeat(x, y)
# determine the range of x
x_range = range(min(data), max(data)+1)
# determine the length of x
x_len = len(x_range)
# plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
ax2.boxplot(data, vert=False)
plt.show()
Given data, you can now perform whatever analysis is required.
SO: Fit gaussian to noisy data with lmfit
LMFIT Docs
Cross Validated might be a better site for diving into the model
All of the error calculation parameters come from the model result. If you calculate new x and y from np.histogram for different bin widths, that may affect the errors.
approxcen = 780 is also an input to result
# given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
# determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
# Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
# result
print(result.fit_report())
[out]:
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 314
# data points = 158
# variables = 3
chi-square = 397.702574
reduced chi-square = 2.56582306
Akaike info crit = 151.851284
Bayesian info crit = 161.039069
[[Variables]]
amp: 1174.80608 +/- 37.1663147 (3.16%) (init = 8)
cen: 775.535731 +/- 0.46232727 (0.06%) (init = 780)
wid: 12.6563219 +/- 0.46232727 (3.65%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(amp, wid) = 0.577
# plot
plt.figure(figsize=(10, 6))
plt.bar(x[:-1], y)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.figure(figsize=(20, 8))
plt.bar(x[:-1], y)
plt.xlim(700, 850)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.grid()
As we can see from the next code block, the error is related to the following parameters
stderrThreshold = 10
minimumAmplitude = 0.1
MaestroT = 53
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
Given that regplot calculates means in intervals and bootstraps to find confidence intervals for each bin, it seems like a waste to have to recalculate them manually for further study, so:
Question: How do I access the calculated means and confidence intervals of a regplot?
Example: This code produces a nice plot of bin means with CIs:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# just some random numbers to get started
fig, ax = plt.subplots()
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth=4./10
x_bins=np.arange(-2+binwidth/2,2,binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
plt.show()
Result:
Regplot showing binned data w. CIs
Not that calculating the means bin by bin isn't easily doable, but the CIs are calculated using random numbers. It would be nice to have the exact same numbers accessible as are plotted, so how do I access them? There must be some sort of get_*-method I'm overlooking.
Set-up
Setting up as in your MWE:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Random numbers for plotting
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth = 4 / 10
x_bins = np.arange(binwidth/2 - 2, 2, binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
This gives our starting point as:
Extracting the Confidence Intervals
We can extract the confidence intervals by looping over the plotted lines and extracting the miniumum and maximum values (corresponding to the upper and lower CIs respectively):
ax = plt.gca()
lower = [line.get_ydata().min() for line in ax.lines]
upper = [line.get_ydata().max() for line in ax.lines]
As a sanity check we can plot these extracted points on top of our original data (shown here by red crosses):
plt.scatter(x_bins, lower, marker='x', color='C3', zorder=3)
plt.scatter(x_bins, upper, marker='x', color='C3', zorder=3)
Extracting the Means
The values of the means can be extracted from ax.collections as:
means = ax.collections[0].get_offsets()[:, 1]
Again, as a sanity check we can overlay our extracted values on the original plot:
plt.scatter(x_bins, means, color='C1', marker='x', zorder=3)
I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:
The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)
My strategy is to use the inverse CDF (or Smirnov transform I believe):
Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.
The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.
Here is my code:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm
# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)
# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)
# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)
# Generate random uniform data
input = np.random.uniform(size=10000)
# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)
Which produces the output:
(223.99999999999997, 2480.0000000000009)
... and not the (100, 1000) that I would like to see. Any advice appreciated!
First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 102.5 = e5.77.
Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.
If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:
Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8
where CDF is expressed via error function (which is pretty much common function found in any library)
So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling
Severin's approach is much leaner than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its quite trivial to do it manually):
# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
from scipy.stats import lognorm
import math
target_modal_value = 320
# Define function to find roots of
def equation(s):
# From Wikipedia: Mode = exp(m - s*s) = 320
m = math.log(target_modal_value) + s**2
# Get probability mass from CDF at 100 and 1000, should equal to 0.8.
# Rearange equation so that =0, to find root (value of s)
return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)
# Solve non-linear equation to find s
s_initial_guess = 1
s = fsolve(equation, s_initial_guess)
# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))
# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')