Generating synthetic data with Gaussian distribution - python

Problem
In a paper I am reading now, it defines a new metric and authors claim some advantages over previous metrics. They verify their claim by some synthetic data, which looks like following
The implementation of their metric is pretty straightforward. However, I am not sure how they create this kind of synthetic data.
What I Have Done
This looks like Gaussian where x is only within certain intervals, I tried with following code but did not get anything similar to the plot presented in the paper.
import numpy as np
def generate_gaussian(size=1000, lb=-0.1, up=0.1):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 0.3
background_neg = base + 0.7
Now I am wondering if the authors create these data using some special distribution (other than Gaussian) I do not know?

Numpy has a numpy.random.normal that draws random samples from a normal (Gaussian) distribution.
import numpy as np
import matplotlib.pyplot as plt
sigma = 0.05
s0 = np.random.normal(0.2, sigma, 5000)
s1 = np.random.normal(0.6, sigma, 5000)
plt.hist(s0, 300, density=True, color="b")
plt.hist(s1, 300, density=True, color="r")
plt.xlim(0, 1)
plt.show()
You can change the values of the mu (mean) and sigma to alter the distributions
mu = 0.55
sigma = 0.1
dist = np.random.normal(mu, sigma, 5000)

You have cut off the data at +/- 0.1. A normalised Gausian distribution only 'looks Gaussian' if you look over the range approximately +/- 3. Try this:
import numpy as np
def generate_gaussian(size=1000, lb=-3, up=3):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 5
background_neg = base + 15

You can use scipy.stats.norm (info).
import libraries
>>> from scipy.stats import norm
>>> from matplotlib import pyplot
plot
>>> pyplot.hist(norm.rvs(loc=1, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_1')
>>> pyplot.hist(norm.rvs(loc=5, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_2')
>>> pyplot.legend()
>>> pyplot.show()
Clarification:
A normal distribution is defined by mean (loc, distribution center) and standard distribution (scale, measure of distribution dispersion or width). rvs generates random samples of the desired normal distribution of size size. For example next code generates 4 random elements of a normal distribution (mean = 1, SD = 1).
>>> norm.rvs(loc=1, scale=1, size=4)
array([ 0.52154255, 1.40873701, 1.55959291, -0.01730568])

Related

fitting Poisson distribution to data in python

I have data distribution that I want to fit Poisson distribution to it. my data looks like that:
I try to fit :
mu = herd_size["COW_NUM"].mean()
ax=sns.displot(data=herd_size["COW_NUM"], kde=True)
ax.set(xlabel='Size',title='Herd size distribution & poisson distribution')
plt.plot(np.arange(0, 2000, 80), [st.poisson.pmf(np.arange(i, i+80), mu).sum()*len(herd_size["COW_NUM"])
for i in np.arange(0, 2000, 80)], color='red')
#every bin contain approximatly 80 observes
plt.show()
but I get something not at the same scale:
UPDATE
I try to apply negative binom distribution with the code:
n=len(herd_size["COW_NUM"])
p =herd_size["COW_NUM"].mean()/(herd_size["COW_NUM"].mean()+2)
ax=sns.displot(data=herd_size["COW_NUM"], kde=True)
ax.set(xlabel='Size',title='Herd size distribution & geometry distribution')
plt.plot(np.arange(0, 2000, 80), [st.nbinom.pmf(np.arange(i, i+80), n,p).sum()*len(herd_size["COW_NUM"])
for i in np.arange(0, 2000, 80)], color='red')
#every bin contain approximatly 80 observes
plt.show()
but I got this:
nbinom
For what you need to plot, might be easier to provide the bins to make your histogram:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import poisson
herd_size = pd.DataFrame({'COW_NUM':np.random.poisson(200,2000)})
binwidth = 10
xstart = 150
xend = 280
bins = np.arange(xstart,xend,binwidth)
o = sns.histplot(data=herd_size["COW_NUM"], kde=True,bins = bins)
Then calculate your mean and total number:
mu = herd_size["COW_NUM"].mean()
n = len(herd_size)
The expected frequency is the difference of the start and end of cdf on your left and right intervals:
plt.plot(bins + binwidth/2 , n*(poisson.cdf(bins+binwidth,mu) - poisson.cdf(bins,mu)), color='red')
Your data is overdispersed, because for a poisson you don't expect data to be so spread. so what you need to do is to use a gamma or a negative binomial to fit it, for example:
from scipy.stats import nbinom
herd_size = pd.DataFrame({'COW_NUM':nbinom.rvs(n=2,p=0.1,loc=240,size=2000)})
binwidth = 50
xstart = 0
xend = 2000
bins = np.arange(xstart,xend,binwidth)
herd_size = pd.DataFrame({'COW_NUM':nbinom.rvs(n=1,p=0.004,size=2000)})
Var = herd_size["COW_NUM"].var()
mu = herd_size["COW_NUM"].mean()
p = (mu/Var)
r = mu**2 / (Var-mu)
n = len(herd_size)
o = sns.histplot(data=herd_size["COW_NUM"], kde=True,bins=bins)
plt.plot(bins + binwidth/2 ,
n*(nbinom.cdf(bins+binwidth,r,p) - nbinom.cdf(bins,r,p)),
color='red')
Your plot is (at least approximately) correct, the problem is with modeling your data as Poisson. As lambda grows large the Poisson looks more and more like a normal distribution — see this plot from Wikipedia. A Poisson distribution has its variance equal to its mean, so with a mean of around ~240 you have a standard deviation of ~15.5. The net result is that outcomes for a Poisson(240) should overwhelmingly fall between 210 and 270, which is what your red plot shows. Try fitting a different distribution to your data.
I just spotted StupidWolf's answer. Other than using a mean of 200 rather than 240, his histogram shows the same behavior described above.

Write a random number generator that, based on uniformly distributed numbers between 0 and 1, samples from a Lévy-distribution?

I'm completely new to Python. Could someone show me how can I write a random number generator which samples from the Levy Distribution? I've written the function for the distribution, but I'm confused about how to proceed further!
The random numbers generated by this distribution I want to use them to simulate a 2D random walk.
I'm aware that from scipy.stats I can use the Levy class, but I want to write the sampler myself.
import numpy as np
import matplotlib.pyplot as plt
# Levy distribution
"""
f(x) = 1/(2*pi*x^3)^(1/2) exp(-1/2x)
"""
def levy(x):
return 1 / np.sqrt(2*np.pi*x**3) * np.exp(-1/(2*x))
N = 50
foo = levy(N)
#pjs code looks ok to me, but there is a discrepancy between his code and what SciPy thinks about Levy - basically, sampling is different from PDF.
Code, Python 3.8 Windows 10 x64
import numpy as np
from scipy.stats import levy
from scipy.stats import norm
import matplotlib.pyplot as plt
rng = np.random.default_rng(312345)
# Arguments
# u: a uniform[0,1) random number
# c: scale parameter for Levy distribution (defaults to 1)
# mu: location parameter (offset) for Levy (defaults to 0)
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (2.0 * (norm.ppf(1.0 - u))**2)
fig, ax = plt.subplots()
rnge=(0, 20.0)
x = np.linspace(rnge[0], rnge[1], 1001)
N = 200000
q = np.empty(N)
for k in range(0, N):
u = rng.random()
q[k] = my_levy(u)
nrm = levy.cdf(rnge[1])
ax.plot(x, levy.pdf(x)/nrm, 'r-', lw=5, alpha=0.6, label='levy pdf')
ax.hist(q, bins=100, range=rnge, density=True, alpha=0.2)
plt.show()
produce graph
UPDATE
Well, I tried to use home-made PDF, same output, same problem
# replace levy.pdf(x) with PDF(x)
def PDF(x):
return np.where(x <= 0.0, 0.0, 1.0 / np.sqrt(2*np.pi*x**3) * np.exp(-1./(2.*x)))
UPDATE II
After applying #pjs corrected sampling routine, sampling and PDF are aligned perfectly. New graph
Here's a straightforward implementation of the generating algorithm for the Levy distribution found on Wikipedia:
import random
from scipy.stats import norm
# Arguments
# u: a uniform[0,1) random number
# c: scale parameter for Levy distribution (defaults to 1)
# mu: location parameter (offset) for Levy (defaults to 0)
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (2 * norm.ppf(1.0 - u)**2)
# Generate a handful of samples
for _ in range(10):
print(my_levy(random.random()))
I don't normally use Python, so please suggest improvements.
ADDENDUM
Kudos to Severin Pappadeux for the work in his response. I had already noted that a simpler answer would be to take the inverse of a squared Gaussian, but Advaita had asked for an explicit function of U ~ Uniform(0,1) so I didn't pursue that. It turns out that I should have. The Wikipedia cite mentions that, but without the scale factor of 2 in the denominator. When I take the 2 out of the implementation of Wikipedia's generating algorithm, i.e. change the implemention to
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (norm.ppf(1.0 - u)**2)
the resulting histogram aligns beautifully with the normalized plot of the pdf. (Note - I've now also edited the incorrect Wikipedia entry to correct the formula.)

Why are the random samples drawn from my custom distribution not following the pdf?

I have created a custom distribution using scipy's rv_continuous method. I am trying to create the energy distribution of an electron produced by beta decay. Given its pdf:
Which I took from: http://hyperphysics.phy-astr.gsu.edu/hbase/Nuclear/beta2.html#c1
I define my distribution:
import numpy as np
from scipy.stats import rv_continuous
import matplotlib.pyplot as plt
class beta_decay(rv_continuous):
def _pdf(self, x):
return (22.48949986*np.sqrt(x**2 + 2*x*0.511)*((0.6-x)**2)*(x+0.511))
# create distribution from 0 --> Q value = 0.6
beta = beta_decay(a=0, b= 0.6)
# plot pdf
x = np.linspace(0,0.6)
plt.plot(x, beta.pdf(x))
plt.show()
# random sample the distribution and plot histogram
random = beta.rvs(size =100)
plt.hist(random)
plt.show()
Where x = KE, Q = 0.6, C = 22.48... (found by integrating the above expression between 0 --> Q and setting equal to 1 to normalize), and I disregard the Fermi function F(Z',KEe) in the above eqn.
When I graph the pdf, it looks right:
However, when I try to draw random samples from it using .rvs(), the value they take are massively peaked towardes the RHS, not under the peak of the pdf as I'd expect:
Ultimately, my code needs to sample the distribution to get the KE of an electron released by beta decay. Why is my histogram so wrong?
I think your PDF is defined in a wrong way, it is not normalized. After I normalized it and made proper histogram, it seems to work fine
Code (Win10 x64, Anaconda Python 3.7)
#%%
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as integrate
from scipy.stats import rv_continuous
def bd(x):
return (22.48949986*np.sqrt(x**2 + 2*x*0.511)*((0.6-x)**2)*(x+0.511))
a = 0.0
b = 0.6
norm = integrate.quad(bd, a, b) # normalization integral
print(norm)
class beta_decay(rv_continuous):
def _pdf(self, x):
return bd(x)/norm[0]
# create Q distribution in the [0...0.6] interval
beta = beta_decay(a = a, b = b)
# plot pdf
x = np.linspace(a, b)
plt.plot(x, beta.pdf(x))
plt.show()
# sample from pdf
r = beta.rvs(size = 10000)
plt.hist(r, range=(a, b), density=True)
plt.show()
And plots
sampling

How to properly use Kolmogorov Smirnoff test in SciPy?

I have a distribution
This one looks pretty gaussian, and we also can't reject the idea with such a high p-value from the KS test.
BUT, the test distribution is actually also a generated one with a finite sample size and not the CDF itself, as you'll notice in the code. So that's kind of cheating, compared to using the CDF for a smooth gaussian function.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
d1 = np.random.normal(loc = 3, scale = 2, size = 1000)
d2 = np.random.normal(loc = 3, scale = 0.5, size = 250) # Vary this to test
data = np.concatenate((d1,d2))
xmin, xmax = min(data), max(data)
lnspc = np.linspace(xmin, xmax, len(data))
# lets try the normal distribution first
m, s = stats.norm.fit(data) # get mean and standard deviation from fit
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
plt.hist(data, color = "lightgrey", normed = True, bins = 50)
plt.plot(lnspc, pdf_g, color = "black", label="Gaussian") # plot it
# Test how not-gaussian our distribution is by generating a distribution from the fit
test_dist = np.random.normal(m, s, len(data))
KS_D, KS_p = stats.ks_2samp(data, test_dist)
plt.title("D = {0:.2f}, p = {1:.2f}".format(KS_D, KS_p))
plt.show()
But I can't figure out how to use the default KS test for, that is
KS_D, KS_p = stats.kstest(data, "norm"),
as it always returns a p-value of 0, i.e. my gaussian data must be in the wrong format.
How should I normalize my data to properly use the KS test? And is simulating the comparison distribution a valid usage, or more incorrect than testing against the continuous CDF for the distribution?
"norm" uses a normal distribution that defaults to be zero-mean, with standard deviation 1 [ref]. Your data have values m and s for that, which are quite different. It is telling you they are very different from this standard reference distribution.
You could still use this test to check if the data look Gaussian if you first normalize (haha) your data appropriately:
data_n = (data - m) / s
KS_D, KS_p = stats.kstest(data_n, "norm")

adding noise to a signal in python

I want to add some random noise to some 100 bin signal that I am simulating in Python - to make it more realistic.
On a basic level, my first thought was to go bin by bin and just generate a random number between a certain range and add or subtract this from the signal.
I was hoping (as this is python) that there might a more intelligent way to do this via numpy or something. (I suppose that ideally a number drawn from a gaussian distribution and added to each bin would be better also.)
Thank you in advance of any replies.
I'm just at the stage of planning my code, so I don't have anything to show. I was just thinking that there might be a more sophisticated way of generating the noise.
In terms out output, if I had 10 bins with the following values:
Bin 1: 1
Bin 2: 4
Bin 3: 9
Bin 4: 16
Bin 5: 25
Bin 6: 25
Bin 7: 16
Bin 8: 9
Bin 9: 4
Bin 10: 1
I just wondered if there was a pre-defined function that could add noise to give me something like:
Bin 1: 1.13
Bin 2: 4.21
Bin 3: 8.79
Bin 4: 16.08
Bin 5: 24.97
Bin 6: 25.14
Bin 7: 16.22
Bin 8: 8.90
Bin 9: 4.02
Bin 10: 0.91
If not, I will just go bin-by-bin and add a number selected from a gaussian distribution to each one.
Thank you.
It's actually a signal from a radio telescope that I am simulating. I want to be able to eventually choose the signal to noise ratio of my simulation.
You can generate a noise array, and add it to your signal
import numpy as np
noise = np.random.normal(0,1,100)
# 0 is the mean of the normal distribution you are choosing from
# 1 is the standard deviation of the normal distribution
# 100 is the number of elements you get in array noise
For those trying to make the connection between SNR and a normal random variable generated by numpy:
[1] , where it's important to keep in mind that P is average power.
Or in dB:
[2]
In this case, we already have a signal and we want to generate noise to give us a desired SNR.
While noise can come in different flavors depending on what you are modeling, a good start (especially for this radio telescope example) is Additive White Gaussian Noise (AWGN). As stated in the previous answers, to model AWGN you need to add a zero-mean gaussian random variable to your original signal. The variance of that random variable will affect the average noise power.
For a Gaussian random variable X, the average power , also known as the second moment, is
[3]
So for white noise, and the average power is then equal to the variance .
When modeling this in python, you can either
1. Calculate variance based on a desired SNR and a set of existing measurements, which would work if you expect your measurements to have fairly consistent amplitude values.
2. Alternatively, you could set noise power to a known level to match something like receiver noise. Receiver noise could be measured by pointing the telescope into free space and calculating average power.
Either way, it's important to make sure that you add noise to your signal and take averages in the linear space and not in dB units.
Here's some code to generate a signal and plot voltage, power in Watts, and power in dB:
# Signal Generation
# matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
t = np.linspace(1, 100, 1000)
x_volts = 10*np.sin(t/(2*np.pi))
plt.subplot(3,1,1)
plt.plot(t, x_volts)
plt.title('Signal')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
x_watts = x_volts ** 2
plt.subplot(3,1,2)
plt.plot(t, x_watts)
plt.title('Signal Power')
plt.ylabel('Power (W)')
plt.xlabel('Time (s)')
plt.show()
x_db = 10 * np.log10(x_watts)
plt.subplot(3,1,3)
plt.plot(t, x_db)
plt.title('Signal Power in dB')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
Here's an example for adding AWGN based on a desired SNR:
# Adding noise using target SNR
# Set a target SNR
target_snr_db = 20
# Calculate signal power and convert to dB
sig_avg_watts = np.mean(x_watts)
sig_avg_db = 10 * np.log10(sig_avg_watts)
# Calculate noise according to [2] then convert to watts
noise_avg_db = sig_avg_db - target_snr_db
noise_avg_watts = 10 ** (noise_avg_db / 10)
# Generate an sample of white noise
mean_noise = 0
noise_volts = np.random.normal(mean_noise, np.sqrt(noise_avg_watts), len(x_watts))
# Noise up the original signal
y_volts = x_volts + noise_volts
# Plot signal with noise
plt.subplot(2,1,1)
plt.plot(t, y_volts)
plt.title('Signal with noise')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
# Plot in dB
y_watts = y_volts ** 2
y_db = 10 * np.log10(y_watts)
plt.subplot(2,1,2)
plt.plot(t, 10* np.log10(y_volts**2))
plt.title('Signal with noise (dB)')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
And here's an example for adding AWGN based on a known noise power:
# Adding noise using a target noise power
# Set a target channel noise power to something very noisy
target_noise_db = 10
# Convert to linear Watt units
target_noise_watts = 10 ** (target_noise_db / 10)
# Generate noise samples
mean_noise = 0
noise_volts = np.random.normal(mean_noise, np.sqrt(target_noise_watts), len(x_watts))
# Noise up the original signal (again) and plot
y_volts = x_volts + noise_volts
# Plot signal with noise
plt.subplot(2,1,1)
plt.plot(t, y_volts)
plt.title('Signal with noise')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
# Plot in dB
y_watts = y_volts ** 2
y_db = 10 * np.log10(y_watts)
plt.subplot(2,1,2)
plt.plot(t, 10* np.log10(y_volts**2))
plt.title('Signal with noise')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
... And for those who - like me - are very early in their numpy learning curve,
import numpy as np
pure = np.linspace(-1, 1, 100)
noise = np.random.normal(0, 1, 100)
signal = pure + noise
For those who want to add noise to a multi-dimensional dataset loaded within a pandas dataframe or even a numpy ndarray, here's an example:
import pandas as pd
# create a sample dataset with dimension (2,2)
# in your case you need to replace this with
# clean_signal = pd.read_csv("your_data.csv")
clean_signal = pd.DataFrame([[1,2],[3,4]], columns=list('AB'), dtype=float)
print(clean_signal)
"""
print output:
A B
0 1.0 2.0
1 3.0 4.0
"""
import numpy as np
mu, sigma = 0, 0.1
# creating a noise with the same dimension as the dataset (2,2)
noise = np.random.normal(mu, sigma, [2,2])
print(noise)
"""
print output:
array([[-0.11114313, 0.25927152],
[ 0.06701506, -0.09364186]])
"""
signal = clean_signal + noise
print(signal)
"""
print output:
A B
0 0.888857 2.259272
1 3.067015 3.906358
"""
AWGN Similar to Matlab Function
def awgn(sinal):
regsnr=54
sigpower=sum([math.pow(abs(sinal[i]),2) for i in range(len(sinal))])
sigpower=sigpower/len(sinal)
noisepower=sigpower/(math.pow(10,regsnr/10))
noise=math.sqrt(noisepower)*(np.random.uniform(-1,1,size=len(sinal)))
return noise
In real life you wish to simulate a signal with white noise. You should add to your signal random points that have Normal Gaussian distribution. If we speak about a device that have sensitivity given in unit/SQRT(Hz) then you need to devise standard deviation of your points from it. Here I give function "white_noise" that does this for you, an the rest of a code is demonstration and check if it does what it should.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
"""
parameters:
rhp - spectral noise density unit/SQRT(Hz)
sr - sample rate
n - no of points
mu - mean value, optional
returns:
n points of noise signal with spectral noise density of rho
"""
def white_noise(rho, sr, n, mu=0):
sigma = rho * np.sqrt(sr/2)
noise = np.random.normal(mu, sigma, n)
return noise
rho = 1
sr = 1000
n = 1000
period = n/sr
time = np.linspace(0, period, n)
signal_pure = 100*np.sin(2*np.pi*13*time)
noise = white_noise(rho, sr, n)
signal_with_noise = signal_pure + noise
f, psd = signal.periodogram(signal_with_noise, sr)
print("Mean spectral noise density = ",np.sqrt(np.mean(psd[50:])), "arb.u/SQRT(Hz)")
plt.plot(time, signal_with_noise)
plt.plot(time, signal_pure)
plt.xlabel("time (s)")
plt.ylabel("signal (arb.u.)")
plt.show()
plt.semilogy(f[1:], np.sqrt(psd[1:]))
plt.xlabel("frequency (Hz)")
plt.ylabel("psd (arb.u./SQRT(Hz))")
#plt.axvline(13, ls="dashed", color="g")
plt.axhline(rho, ls="dashed", color="r")
plt.show()
Awesome answers from Akavall and Noel (that's what worked for me). Also, I saw some comments about different distributions. A solution that I also tried was to make test over my variable and find what distribution it was closer.
numpy.random
has different distributions that can be used, it can be seen in its documentation:
documentation numpy.random
As an example from a different distribution (example referenced from Noel's answer):
import numpy as np
pure = np.linspace(-1, 1, 100)
noise = np.random.lognormal(0, 1, 100)
signal = pure + noise
print(pure[:10])
print(signal[:10])
I hope this can help someone looking for this specific branch from the original question.
You can try this:
import numpy as np
x = np.arange(-5.0, 5.0, 0.1)
y = np.power(x,2)
noise = 2 * np.random.normal(size=x.size)
ydata = y + noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('y data')
plt.xlabel('x data')
plt.show()
Awesome answers above. I recently had a need to generate simulated data and this is what I landed up using. Sharing in-case helpful to others as well,
import logging
__name__ = "DataSimulator"
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
import numpy as np
import pandas as pd
def generate_simulated_data(add_anomalies:bool=True, random_state:int=42):
rnd_state = np.random.RandomState(random_state)
time = np.linspace(0, 200, num=2000)
pure = 20*np.sin(time/(2*np.pi))
# concatenate on the second axis; this will allow us to mix different data
# distribution
data = np.c_[pure]
mu = np.mean(data)
sd = np.std(data)
logger.info(f"Data shape : {data.shape}. mu: {mu} with sd: {sd}")
data_df = pd.DataFrame(data, columns=['Value'])
data_df['Index'] = data_df.index.values
# Adding gaussian jitter
jitter = 0.3*rnd_state.normal(mu, sd, size=data_df.shape[0])
data_df['with_jitter'] = data_df['Value'] + jitter
index_further_away = None
if add_anomalies:
# As per the 68-95-99.7 rule(also known as the empirical rule) mu+-2*sd
# covers 95.4% of the dataset.
# Since, anomalies are considered to be rare and typically within the
# 5-10% of the data; this filtering
# technique might work
#for us(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
indexes_furhter_away = np.where(np.abs(data_df['with_jitter']) > (mu +
2*sd))[0]
logger.info(f"Number of points further away :
{len(indexes_furhter_away)}. Indexes: {indexes_furhter_away}")
# Generate a point uniformly and embed it into the dataset
random = rnd_state.uniform(0, 5, 1)
data_df.loc[indexes_furhter_away, 'with_jitter'] +=
random*data_df.loc[indexes_furhter_away, 'with_jitter']
return data_df, indexes_furhter_away

Categories