I have data distribution that I want to fit Poisson distribution to it. my data looks like that:
I try to fit :
mu = herd_size["COW_NUM"].mean()
ax=sns.displot(data=herd_size["COW_NUM"], kde=True)
ax.set(xlabel='Size',title='Herd size distribution & poisson distribution')
plt.plot(np.arange(0, 2000, 80), [st.poisson.pmf(np.arange(i, i+80), mu).sum()*len(herd_size["COW_NUM"])
for i in np.arange(0, 2000, 80)], color='red')
#every bin contain approximatly 80 observes
plt.show()
but I get something not at the same scale:
UPDATE
I try to apply negative binom distribution with the code:
n=len(herd_size["COW_NUM"])
p =herd_size["COW_NUM"].mean()/(herd_size["COW_NUM"].mean()+2)
ax=sns.displot(data=herd_size["COW_NUM"], kde=True)
ax.set(xlabel='Size',title='Herd size distribution & geometry distribution')
plt.plot(np.arange(0, 2000, 80), [st.nbinom.pmf(np.arange(i, i+80), n,p).sum()*len(herd_size["COW_NUM"])
for i in np.arange(0, 2000, 80)], color='red')
#every bin contain approximatly 80 observes
plt.show()
but I got this:
nbinom
For what you need to plot, might be easier to provide the bins to make your histogram:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import poisson
herd_size = pd.DataFrame({'COW_NUM':np.random.poisson(200,2000)})
binwidth = 10
xstart = 150
xend = 280
bins = np.arange(xstart,xend,binwidth)
o = sns.histplot(data=herd_size["COW_NUM"], kde=True,bins = bins)
Then calculate your mean and total number:
mu = herd_size["COW_NUM"].mean()
n = len(herd_size)
The expected frequency is the difference of the start and end of cdf on your left and right intervals:
plt.plot(bins + binwidth/2 , n*(poisson.cdf(bins+binwidth,mu) - poisson.cdf(bins,mu)), color='red')
Your data is overdispersed, because for a poisson you don't expect data to be so spread. so what you need to do is to use a gamma or a negative binomial to fit it, for example:
from scipy.stats import nbinom
herd_size = pd.DataFrame({'COW_NUM':nbinom.rvs(n=2,p=0.1,loc=240,size=2000)})
binwidth = 50
xstart = 0
xend = 2000
bins = np.arange(xstart,xend,binwidth)
herd_size = pd.DataFrame({'COW_NUM':nbinom.rvs(n=1,p=0.004,size=2000)})
Var = herd_size["COW_NUM"].var()
mu = herd_size["COW_NUM"].mean()
p = (mu/Var)
r = mu**2 / (Var-mu)
n = len(herd_size)
o = sns.histplot(data=herd_size["COW_NUM"], kde=True,bins=bins)
plt.plot(bins + binwidth/2 ,
n*(nbinom.cdf(bins+binwidth,r,p) - nbinom.cdf(bins,r,p)),
color='red')
Your plot is (at least approximately) correct, the problem is with modeling your data as Poisson. As lambda grows large the Poisson looks more and more like a normal distribution — see this plot from Wikipedia. A Poisson distribution has its variance equal to its mean, so with a mean of around ~240 you have a standard deviation of ~15.5. The net result is that outcomes for a Poisson(240) should overwhelmingly fall between 210 and 270, which is what your red plot shows. Try fitting a different distribution to your data.
I just spotted StupidWolf's answer. Other than using a mean of 200 rather than 240, his histogram shows the same behavior described above.
Related
I am trying to fit Gamma CDF using scipy.stats.gamma but I do not know what exactly is the a parameter and how the location and scale parameters are calculated. Different literatures give different ways to calculate them and its very frustrating. I am using below code which is not giving correct CDF. Thanks in advance.
from scipy.stats import gamma
loc = (np.mean(jan))**2/np.var(jan)
scale = np.var(jan)/np.mean(jan)
Jancdf = gamma.cdf(jan,a,loc = loc, scale = scale)
a is the shape. What you have tried works only in the case where loc = 0. First we start with two examples, with shape (or a) = 10 and scale = 5, and the second d1plus50 differs from the first by 50, and you can see the shift which is dictated by loc:
from scipy.stats import gamma
import matplotlib.pyplot as plt
d1 = gamma.rvs(a = 10, scale=5,size=1000,random_state=99)
plt.hist(d1,bins=50,label='loc=0,shape=10,scale=5',density=True)
d1plus50 = gamma.rvs(a = 10, loc= 50,scale=5,size=1000,random_state=99)
plt.hist(d1plus50,bins=50,label='loc=50,shape=10,scale=5',density=True)
plt.legend(loc='upper right')
So you have 3 parameters to estimate from the data, one way is use gamma.fit, we apply this on the simulated distribution with loc=0 :
xlin = np.linspace(0,160,50)
fit_shape, fit_loc, fit_scale=gamma.fit(d1)
print([fit_shape, fit_loc, fit_scale])
[11.135335235456457, -1.9431969603988053, 4.693776771991816]
plt.hist(d1,bins=50,label='loc=0,shape=10,scale=5',density=True)
plt.plot(xlin,gamma.pdf(xlin,a=fit_shape,loc = fit_loc, scale = fit_scale)
And if we do it for the distribution we simulated with loc, and you can see the loc is estimated correctly, as well as shape and scale:
fit_shape, fit_loc, fit_scale=gamma.fit(d1plus50)
print([fit_shape, fit_loc, fit_scale])
[11.135287555530564, 48.05688649976989, 4.693789434095116]
plt.hist(d1plus50,bins=50,label='loc=0,shape=10,scale=5',density=True)
plt.plot(xlin,gamma.pdf(xlin,a=fit_shape,loc = fit_loc, scale = fit_scale))
Problem
In a paper I am reading now, it defines a new metric and authors claim some advantages over previous metrics. They verify their claim by some synthetic data, which looks like following
The implementation of their metric is pretty straightforward. However, I am not sure how they create this kind of synthetic data.
What I Have Done
This looks like Gaussian where x is only within certain intervals, I tried with following code but did not get anything similar to the plot presented in the paper.
import numpy as np
def generate_gaussian(size=1000, lb=-0.1, up=0.1):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 0.3
background_neg = base + 0.7
Now I am wondering if the authors create these data using some special distribution (other than Gaussian) I do not know?
Numpy has a numpy.random.normal that draws random samples from a normal (Gaussian) distribution.
import numpy as np
import matplotlib.pyplot as plt
sigma = 0.05
s0 = np.random.normal(0.2, sigma, 5000)
s1 = np.random.normal(0.6, sigma, 5000)
plt.hist(s0, 300, density=True, color="b")
plt.hist(s1, 300, density=True, color="r")
plt.xlim(0, 1)
plt.show()
You can change the values of the mu (mean) and sigma to alter the distributions
mu = 0.55
sigma = 0.1
dist = np.random.normal(mu, sigma, 5000)
You have cut off the data at +/- 0.1. A normalised Gausian distribution only 'looks Gaussian' if you look over the range approximately +/- 3. Try this:
import numpy as np
def generate_gaussian(size=1000, lb=-3, up=3):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 5
background_neg = base + 15
You can use scipy.stats.norm (info).
import libraries
>>> from scipy.stats import norm
>>> from matplotlib import pyplot
plot
>>> pyplot.hist(norm.rvs(loc=1, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_1')
>>> pyplot.hist(norm.rvs(loc=5, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_2')
>>> pyplot.legend()
>>> pyplot.show()
Clarification:
A normal distribution is defined by mean (loc, distribution center) and standard distribution (scale, measure of distribution dispersion or width). rvs generates random samples of the desired normal distribution of size size. For example next code generates 4 random elements of a normal distribution (mean = 1, SD = 1).
>>> norm.rvs(loc=1, scale=1, size=4)
array([ 0.52154255, 1.40873701, 1.55959291, -0.01730568])
I am trying to plot normal distribution curve using Python. First I did it manually by using the normal probability density function and then I found there's an exiting function pdf in scipy under stats module. However, the results I get are quite different.
Below is the example that I tried:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
mean = 5
std_dev = 2
num_dist = 50
# Draw random samples from a normal (Gaussion) distribution
normalDist_dataset = np.random.normal(mean, std_dev, num_dist)
# Sort these values.
normalDist_dataset = sorted(normalDist_dataset)
# Create the bins and histogram
plt.figure(figsize=(15,7))
count, bins, ignored = plt.hist(normalDist_dataset, num_dist, density=True)
new_mean = np.mean(normalDist_dataset)
new_std = np.std(normalDist_dataset)
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
plt.plot(normalDist_dataset, normal_curve1, linewidth=4, linestyle='dashed')
plt.plot(bins, normal_curve2, linewidth=4, color='y')
The result shows how the two curves I get are very different from each other.
My guess is that it is has something to do with bins or pdf behaves differently than usual formula. I have used the same and new mean and standard deviation for both the plots. So, how do I change my code to match what stats.norm.pdf is doing?
I don't know yet which curve is correct.
Function plot simply connects the dots with line segments. Your bins do not have enough dots to show a smooth curve. Possible solution:
....
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
bins = normalDist_dataset # Add this line
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
....
I have a distribution
This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.
BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test
data = np.concatenate((d1,d2))
xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))
# lets try the normal distribution first
m, s = stats.norm.fit(data) # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it
# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))
plt.show()
But I can't figure out how to use the default KS test for, that is
KS_D, KS_p = stats.kstest(data, "norm"),
as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.
How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?
"norm" uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m and s for that, which are quite different. It is telling you they are very different from this standard reference distribution.
You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:
data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")
I want to add some random noise to some 100 bin signal that I am simulating in Python - to make it more realistic.
On a basic level, my first thought was to go bin by bin and just generate a random number between a certain range and add or subtract this from the signal.
I was hoping (as this is python) that there might a more intelligent way to do this via numpy or something. (I suppose that ideally a number drawn from a gaussian distribution and added to each bin would be better also.)
Thank you in advance of any replies.
I'm just at the stage of planning my code, so I don't have anything to show. I was just thinking that there might be a more sophisticated way of generating the noise.
In terms out output, if I had 10 bins with the following values:
Bin 1: 1
Bin 2: 4
Bin 3: 9
Bin 4: 16
Bin 5: 25
Bin 6: 25
Bin 7: 16
Bin 8: 9
Bin 9: 4
Bin 10: 1
I just wondered if there was a pre-defined function that could add noise to give me something like:
Bin 1: 1.13
Bin 2: 4.21
Bin 3: 8.79
Bin 4: 16.08
Bin 5: 24.97
Bin 6: 25.14
Bin 7: 16.22
Bin 8: 8.90
Bin 9: 4.02
Bin 10: 0.91
If not, I will just go bin-by-bin and add a number selected from a gaussian distribution to each one.
Thank you.
It's actually a signal from a radio telescope that I am simulating. I want to be able to eventually choose the signal to noise ratio of my simulation.
You can generate a noise array, and add it to your signal
import numpy as np
noise = np.random.normal(0,1,100)
# 0 is the mean of the normal distribution you are choosing from
# 1 is the standard deviation of the normal distribution
# 100 is the number of elements you get in array noise
For those trying to make the connection between SNR and a normal random variable generated by numpy:
[1] , where it's important to keep in mind that P is average power.
Or in dB:
[2]
In this case, we already have a signal and we want to generate noise to give us a desired SNR.
While noise can come in different flavors depending on what you are modeling, a good start (especially for this radio telescope example) is Additive White Gaussian Noise (AWGN). As stated in the previous answers, to model AWGN you need to add a zero-mean gaussian random variable to your original signal. The variance of that random variable will affect the average noise power.
For a Gaussian random variable X, the average power , also known as the second moment, is
[3]
So for white noise, and the average power is then equal to the variance .
When modeling this in python, you can either
1. Calculate variance based on a desired SNR and a set of existing measurements, which would work if you expect your measurements to have fairly consistent amplitude values.
2. Alternatively, you could set noise power to a known level to match something like receiver noise. Receiver noise could be measured by pointing the telescope into free space and calculating average power.
Either way, it's important to make sure that you add noise to your signal and take averages in the linear space and not in dB units.
Here's some code to generate a signal and plot voltage, power in Watts, and power in dB:
# Signal Generation
# matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
t = np.linspace(1, 100, 1000)
x_volts = 10*np.sin(t/(2*np.pi))
plt.subplot(3,1,1)
plt.plot(t, x_volts)
plt.title('Signal')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
x_watts = x_volts ** 2
plt.subplot(3,1,2)
plt.plot(t, x_watts)
plt.title('Signal Power')
plt.ylabel('Power (W)')
plt.xlabel('Time (s)')
plt.show()
x_db = 10 * np.log10(x_watts)
plt.subplot(3,1,3)
plt.plot(t, x_db)
plt.title('Signal Power in dB')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
Here's an example for adding AWGN based on a desired SNR:
# Adding noise using target SNR
# Set a target SNR
target_snr_db = 20
# Calculate signal power and convert to dB
sig_avg_watts = np.mean(x_watts)
sig_avg_db = 10 * np.log10(sig_avg_watts)
# Calculate noise according to [2] then convert to watts
noise_avg_db = sig_avg_db - target_snr_db
noise_avg_watts = 10 ** (noise_avg_db / 10)
# Generate an sample of white noise
mean_noise = 0
noise_volts = np.random.normal(mean_noise, np.sqrt(noise_avg_watts), len(x_watts))
# Noise up the original signal
y_volts = x_volts + noise_volts
# Plot signal with noise
plt.subplot(2,1,1)
plt.plot(t, y_volts)
plt.title('Signal with noise')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
# Plot in dB
y_watts = y_volts ** 2
y_db = 10 * np.log10(y_watts)
plt.subplot(2,1,2)
plt.plot(t, 10* np.log10(y_volts**2))
plt.title('Signal with noise (dB)')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
And here's an example for adding AWGN based on a known noise power:
# Adding noise using a target noise power
# Set a target channel noise power to something very noisy
target_noise_db = 10
# Convert to linear Watt units
target_noise_watts = 10 ** (target_noise_db / 10)
# Generate noise samples
mean_noise = 0
noise_volts = np.random.normal(mean_noise, np.sqrt(target_noise_watts), len(x_watts))
# Noise up the original signal (again) and plot
y_volts = x_volts + noise_volts
# Plot signal with noise
plt.subplot(2,1,1)
plt.plot(t, y_volts)
plt.title('Signal with noise')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
# Plot in dB
y_watts = y_volts ** 2
y_db = 10 * np.log10(y_watts)
plt.subplot(2,1,2)
plt.plot(t, 10* np.log10(y_volts**2))
plt.title('Signal with noise')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
... And for those who - like me - are very early in their numpy learning curve,
import numpy as np
pure = np.linspace(-1, 1, 100)
noise = np.random.normal(0, 1, 100)
signal = pure + noise
For those who want to add noise to a multi-dimensional dataset loaded within a pandas dataframe or even a numpy ndarray, here's an example:
import pandas as pd
# create a sample dataset with dimension (2,2)
# in your case you need to replace this with
# clean_signal = pd.read_csv("your_data.csv")
clean_signal = pd.DataFrame([[1,2],[3,4]], columns=list('AB'), dtype=float)
print(clean_signal)
"""
print output:
A B
0 1.0 2.0
1 3.0 4.0
"""
import numpy as np
mu, sigma = 0, 0.1
# creating a noise with the same dimension as the dataset (2,2)
noise = np.random.normal(mu, sigma, [2,2])
print(noise)
"""
print output:
array([[-0.11114313, 0.25927152],
[ 0.06701506, -0.09364186]])
"""
signal = clean_signal + noise
print(signal)
"""
print output:
A B
0 0.888857 2.259272
1 3.067015 3.906358
"""
AWGN Similar to Matlab Function
def awgn(sinal):
regsnr=54
sigpower=sum([math.pow(abs(sinal[i]),2) for i in range(len(sinal))])
sigpower=sigpower/len(sinal)
noisepower=sigpower/(math.pow(10,regsnr/10))
noise=math.sqrt(noisepower)*(np.random.uniform(-1,1,size=len(sinal)))
return noise
In real life you wish to simulate a signal with white noise. You should add to your signal random points that have Normal Gaussian distribution. If we speak about a device that have sensitivity given in unit/SQRT(Hz) then you need to devise standard deviation of your points from it. Here I give function "white_noise" that does this for you, an the rest of a code is demonstration and check if it does what it should.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
"""
parameters:
rhp - spectral noise density unit/SQRT(Hz)
sr - sample rate
n - no of points
mu - mean value, optional
returns:
n points of noise signal with spectral noise density of rho
"""
def white_noise(rho, sr, n, mu=0):
sigma = rho * np.sqrt(sr/2)
noise = np.random.normal(mu, sigma, n)
return noise
rho = 1
sr = 1000
n = 1000
period = n/sr
time = np.linspace(0, period, n)
signal_pure = 100*np.sin(2*np.pi*13*time)
noise = white_noise(rho, sr, n)
signal_with_noise = signal_pure + noise
f, psd = signal.periodogram(signal_with_noise, sr)
print("Mean spectral noise density = ",np.sqrt(np.mean(psd[50:])), "arb.u/SQRT(Hz)")
plt.plot(time, signal_with_noise)
plt.plot(time, signal_pure)
plt.xlabel("time (s)")
plt.ylabel("signal (arb.u.)")
plt.show()
plt.semilogy(f[1:], np.sqrt(psd[1:]))
plt.xlabel("frequency (Hz)")
plt.ylabel("psd (arb.u./SQRT(Hz))")
#plt.axvline(13, ls="dashed", color="g")
plt.axhline(rho, ls="dashed", color="r")
plt.show()
Awesome answers from Akavall and Noel (that's what worked for me). Also, I saw some comments about different distributions. A solution that I also tried was to make test over my variable and find what distribution it was closer.
numpy.random
has different distributions that can be used, it can be seen in its documentation:
documentation numpy.random
As an example from a different distribution (example referenced from Noel's answer):
import numpy as np
pure = np.linspace(-1, 1, 100)
noise = np.random.lognormal(0, 1, 100)
signal = pure + noise
print(pure[:10])
print(signal[:10])
I hope this can help someone looking for this specific branch from the original question.
You can try this:
import numpy as np
x = np.arange(-5.0, 5.0, 0.1)
y = np.power(x,2)
noise = 2 * np.random.normal(size=x.size)
ydata = y + noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('y data')
plt.xlabel('x data')
plt.show()
Awesome answers above. I recently had a need to generate simulated data and this is what I landed up using. Sharing in-case helpful to others as well,
import logging
__name__ = "DataSimulator"
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
import numpy as np
import pandas as pd
def generate_simulated_data(add_anomalies:bool=True, random_state:int=42):
rnd_state = np.random.RandomState(random_state)
time = np.linspace(0, 200, num=2000)
pure = 20*np.sin(time/(2*np.pi))
# concatenate on the second axis; this will allow us to mix different data
# distribution
data = np.c_[pure]
mu = np.mean(data)
sd = np.std(data)
logger.info(f"Data shape : {data.shape}. mu: {mu} with sd: {sd}")
data_df = pd.DataFrame(data, columns=['Value'])
data_df['Index'] = data_df.index.values
# Adding gaussian jitter
jitter = 0.3*rnd_state.normal(mu, sd, size=data_df.shape[0])
data_df['with_jitter'] = data_df['Value'] + jitter
index_further_away = None
if add_anomalies:
# As per the 68-95-99.7 rule(also known as the empirical rule) mu+-2*sd
# covers 95.4% of the dataset.
# Since, anomalies are considered to be rare and typically within the
# 5-10% of the data; this filtering
# technique might work
#for us(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
indexes_furhter_away = np.where(np.abs(data_df['with_jitter']) > (mu +
2*sd))[0]
logger.info(f"Number of points further away :
{len(indexes_furhter_away)}. Indexes: {indexes_furhter_away}")
# Generate a point uniformly and embed it into the dataset
random = rnd_state.uniform(0, 5, 1)
data_df.loc[indexes_furhter_away, 'with_jitter'] +=
random*data_df.loc[indexes_furhter_away, 'with_jitter']
return data_df, indexes_furhter_away