Probability density function in SciPy behaves differently than expected

Probability density function in SciPy behaves differently than expected - python

I am trying to plot normal distribution curve using Python. First I did it manually by using the normal probability density function and then I found there's an exiting function pdf in scipy under stats module. However, the results I get are quite different.
Below is the example that I tried:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
mean = 5
std_dev = 2
num_dist = 50
# Draw random samples from a normal (Gaussion) distribution
normalDist_dataset = np.random.normal(mean, std_dev, num_dist)
# Sort these values.
normalDist_dataset = sorted(normalDist_dataset)
# Create the bins and histogram
plt.figure(figsize=(15,7))
count, bins, ignored = plt.hist(normalDist_dataset, num_dist, density=True)
new_mean = np.mean(normalDist_dataset)
new_std = np.std(normalDist_dataset)
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
plt.plot(normalDist_dataset, normal_curve1, linewidth=4, linestyle='dashed')
plt.plot(bins, normal_curve2, linewidth=4, color='y')
The result shows how the two curves I get are very different from each other.
My guess is that it is has something to do with bins or pdf behaves differently than usual formula. I have used the same and new mean and standard deviation for both the plots. So, how do I change my code to match what stats.norm.pdf is doing?
I don't know yet which curve is correct.

Function plot simply connects the dots with line segments. Your bins do not have enough dots to show a smooth curve. Possible solution:
....
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
bins = normalDist_dataset # Add this line
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
....

Related

Fitting a single gaussian to 'noisy' data yields a poor fit in some cases

I have some noisy data that can contain 0 and n gaussian shapes, I am trying to implement an algorithm that takes the highest data points and fits a gaussian to that as per the following 'scheme':
New attempt, steps:
fit a spline through all data points
get first derivative of spline function
get both data points (left/right) where f'(x) = around 0 the data point with max intensity
fit a gaussian through the data points returned from 3
4a. Plot the gaussian (stopping at baseline) in the pdf
Calculate area under gaussian curve
Calculate area under raw data points
Calculate percentage of total area explained by gaussian area
I have implemented this concept using the following code (minimal working example):
#! /usr/bin/env python
from scipy.interpolate import InterpolatedUnivariateSpline
from scipy.optimize import curve_fit
from scipy.signal import argrelextrema
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data = [(9.60380153195,187214),(9.62028167623,181023),(9.63676350256,174588),(9.65324602212,169389),(9.66972824591,166921),(9.68621215187,167597),(9.70269675106,170838),(9.71918105436,175816),(9.73566703995,181552),(9.75215371878,186978),(9.76864010158,191718),(9.78512816681,194473),(9.80161692526,194169),(9.81810538757,191203),(9.83459553243,186603),(9.85108637051,180273),(9.86757691233,171996),(9.88406913682,163653),(9.90056205454,156032),(9.91705467586,149928),(9.93354897998,145410),(9.95004397733,141818),(9.96653867816,139042),(9.98303506191,137546),(9.99953213889,138724)]
data2 = [(9.60476933166,163571),(9.62125990879,156662),(9.63775225872,150535),(9.65424539203,146960),(9.67073831905,146794),(9.68723301904,149326),(9.70372850238,152616),(9.72022377931,155420),(9.73672082933,156151),(9.75321866271,154633),(9.76971628954,151549),(9.78621568961,148298),(9.80271587303,146333),(9.81921584976,146734),(9.83571759987,150351),(9.85222013334,156612),(9.86872245996,164192),(9.88522656011,171199),(9.90173144362,175697),(9.91823612015,176867),(9.93474257034,175029),(9.95124980389,171762),(9.96775683032,168449),(9.98426563055,165026)]
def gaussFunction(x, *p):
""" TODO
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
def quantify(data):
""" TODO
"""
backGround = 105000 # Normally this is dynamically determined but this value is fine for testing on the provided data
time,intensity = zip(*data)
x_data = np.array(time)
y_data = np.array(intensity)
newX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
f = InterpolatedUnivariateSpline(x_data, y_data)
fPrime = f.derivative()
newY = f(newX)
newPrimeY = fPrime(newX)
maxm = argrelextrema(newPrimeY, np.greater)
minm = argrelextrema(newPrimeY, np.less)
breaks = maxm[0].tolist() + minm[0].tolist()
maxPoint = 0
for index,j in enumerate(breaks):
try:
if max(newY[breaks[index]:breaks[index+1]]) > maxPoint:
maxPoint = max(newY[breaks[index]:breaks[index+1]])
xData = newX[breaks[index]:breaks[index+1]]
yData = [x - backGround for x in newY[breaks[index]:breaks[index+1]]]
except:
pass
# Gaussian fit on main points
newGaussX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
p0 = [np.max(yData), xData[np.argmax(yData)],0.1]
try:
coeff, var_matrix = curve_fit(gaussFunction, xData, yData, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
newGaussY = [x + backGround for x in newGaussY]
# Generate plot for visual confirmation
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_data, y_data, 'b*')
plt.plot((newX[0],newX[-1]),(backGround,backGround),'red')
plt.plot(newX,newY, color='blue',linestyle='dashed')
plt.plot(newGaussX, newGaussY, color='green',linestyle='dashed')
plt.title("Test")
plt.xlabel("rt [m]")
plt.ylabel("intensity [au]")
plt.savefig("Test.pdf",bbox_inches="tight")
plt.close(fig)
except:
pass
# Call the test
#quantify(data)
quantify(data2)
where normally the background (red line in below pictures) is dynamically determined, but for the sake of this example I have set it to a fixed number. The problem that I have is that for some data it works really well:
Corresponding f'(x):
However, for some other data it fails horrendously:
Corresponding f'(x):
Therefore, I would like to hear some suggestions or ideas on why this happens and on potential approaches to fix it. I have included the data that is shown in the picture below (in case anyone wants to try it):

The error lied in the following bit:
breaks = maxm[0].tolist() + minm[0].tolist()
for index,j in enumerate(breaks):
The breaks list now contains both the maxima and minima, but they are not sorted by time. Resulting in the list yielding the following data points for the poor fit: 9.78, 9.62 and 9.86.
The program would then examine data from 9.78 to 9.62 and 9.62 to 9.86, which meant that 9.62 to 9.86 contained the highest intensity data point yielding the fit that is shown in the second graph.
The fix was rather simple by just adding a sort on the breaks in between, as follows:
breaks = maxm[0].tolist() + minm[0].tolist()
breaks = sorted(breaks)
for index,j in enumerate(breaks):
The program then yielded a fit more closely resembling what I would expect:

Unequal width binned histogram in python

I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements

The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output

How to get and plot a signal envelope

I would like to know if there is a function envelope in Python to have the same result as this
I have already tried an envelope function in Python but there is this result and it doesn't correspond with what I want.

Though you don't mention exactly what function you use, it seems like you are using two different kinds of envelopes.
The way you call envelope in matlab, the relevant description is:
[yupper,ylower] = envelope(x) returns the upper and lower envelopes of
the input sequence, x, as the magnitude of its analytic signal. The
analytic signal of x is found using the discrete Fourier transform as
implemented in hilbert. The function initially removes the mean of x
and adds it back after computing the envelopes. If x is a matrix, then
envelope operates independently over each column of x.
Based on this, I suppose you would be looking for a way to get the Hilber transform in python. An example of this can be found here:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import hilbert, chirp
duration = 1.0
fs = 400.0
samples = int(fs*duration)
t = np.arange(samples) / fs
signal = chirp(t, 20.0, t[-1], 100.0)
signal *= (1.0 + 0.5 * np.sin(2.0*np.pi*3.0*t) )
analytic_signal = hilbert(signal)
amplitude_envelope = np.abs(analytic_signal)
instantaneous_phase = np.unwrap(np.angle(analytic_signal))
instantaneous_frequency = np.diff(instantaneous_phase) / (2.0*np.pi) * fs
fig = plt.figure()
ax0 = fig.add_subplot(211)
ax0.plot(t, signal, label='signal')
ax0.plot(t, amplitude_envelope, label='envelope')
ax0.set_xlabel("time in seconds")
ax0.legend()
ax1 = fig.add_subplot(212)
ax1.plot(t[1:], instantaneous_frequency)
ax1.set_xlabel("time in seconds")
ax1.set_ylim(0.0, 120.0)
Resulting in:

Sometimes I would use obspy.signal.filter.envelope(data_array); But you can only get the upper line in your given example.
Obspy is a very useful package dealing with seismogram.

Plancks Formula for Blackbody spectrum

I am trying to write a simple python code for a plot of intensity vs wavelength for a given temperature, T=200K.
So far I have this...
import scipy as sp
import math
import matplotlib.pyplot as plt
import numpy as np
pi = np.pi
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck(wav, T):
a = 2.0*h*pi*c**2
b = h*c/(wav*k*T)
intensity = a/ ( (wav**5)*(math.e**b - 1.0) )
return intensity
I don't know how to define wavelength(wav) and thus produce the plot of Plancks Formula. Any help would be appreciated.

Here's a basic plot. To plot using plt.plot(x, y, fmt) you need two arrays x and y of the same size, where x is the x coordinate of each point to plot and y is the y coordinate, and fmt is a string describing how to plot the numbers.
So all you need to do is create an evenly spaced array of wavelengths (an np.array which I named wavelengths). This can be done with arange(start, end, spacing) which will create an array from start to end (not inclusive) spaced at spacing apart.
Then compute the intensity using your function at each of those points in the array (which will be stored in another np.array), and then call plt.plot to plot them. Note numpy let's you do mathematical operations on arrays quickly in a vectorized form which will be computationally efficient.
import matplotlib.pyplot as plt
import numpy as np
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck(wav, T):
a = 2.0*h*c**2
b = h*c/(wav*k*T)
intensity = a/ ( (wav**5) * (np.exp(b) - 1.0) )
return intensity
# generate x-axis in increments from 1nm to 3 micrometer in 1 nm increments
# starting at 1 nm to avoid wav = 0, which would result in division by zero.
wavelengths = np.arange(1e-9, 3e-6, 1e-9)
# intensity at 4000K, 5000K, 6000K, 7000K
intensity4000 = planck(wavelengths, 4000.)
intensity5000 = planck(wavelengths, 5000.)
intensity6000 = planck(wavelengths, 6000.)
intensity7000 = planck(wavelengths, 7000.)
plt.plot(wavelengths*1e9, intensity4000, 'r-')
# plot intensity4000 versus wavelength in nm as a red line
plt.plot(wavelengths*1e9, intensity5000, 'g-') # 5000K green line
plt.plot(wavelengths*1e9, intensity6000, 'b-') # 6000K blue line
plt.plot(wavelengths*1e9, intensity7000, 'k-') # 7000K black line
# show the plot
plt.show()
And you see:
You probably will want to clean up the axes labels, add a legend, plot the intensity at multiple temperatures on the same plot, among other things. Consult the relevant matplotlib documentation.

You may also want to use the RADIS library, which allows you to plot the Planck function against wavelengths, or against frequency / wavenumber, if needed !
from radis import sPlanck
sPlanck(wavelength_min=135, wavelength_max=3000, T=4000).plot()
sPlanck(wavelength_min=135, wavelength_max=3000, T=5000).plot(nfig='same')
sPlanck(wavelength_min=135, wavelength_max=3000, T=6000).plot(nfig='same')
sPlanck(wavelength_min=135, wavelength_max=3000, T=7000).plot(nfig='same')

Just want to point out that there seems to be an equivalent of what OP wants to do in astropy:
https://docs.astropy.org/en/stable/api/astropy.modeling.physical_models.BlackBody.html
Unfortunately, it is not very clear to me yet how to get wavelength vs frequency based expression.

adding noise to a signal in python

I want to add some random noise to some 100 bin signal that I am simulating in Python - to make it more realistic.
On a basic level, my first thought was to go bin by bin and just generate a random number between a certain range and add or subtract this from the signal.
I was hoping (as this is python) that there might a more intelligent way to do this via numpy or something. (I suppose that ideally a number drawn from a gaussian distribution and added to each bin would be better also.)
Thank you in advance of any replies.
I'm just at the stage of planning my code, so I don't have anything to show. I was just thinking that there might be a more sophisticated way of generating the noise.
In terms out output, if I had 10 bins with the following values:
Bin 1: 1
Bin 2: 4
Bin 3: 9
Bin 4: 16
Bin 5: 25
Bin 6: 25
Bin 7: 16
Bin 8: 9
Bin 9: 4
Bin 10: 1
I just wondered if there was a pre-defined function that could add noise to give me something like:
Bin 1: 1.13
Bin 2: 4.21
Bin 3: 8.79
Bin 4: 16.08
Bin 5: 24.97
Bin 6: 25.14
Bin 7: 16.22
Bin 8: 8.90
Bin 9: 4.02
Bin 10: 0.91
If not, I will just go bin-by-bin and add a number selected from a gaussian distribution to each one.
Thank you.
It's actually a signal from a radio telescope that I am simulating. I want to be able to eventually choose the signal to noise ratio of my simulation.

You can generate a noise array, and add it to your signal
import numpy as np
noise = np.random.normal(0,1,100)
# 0 is the mean of the normal distribution you are choosing from
# 1 is the standard deviation of the normal distribution
# 100 is the number of elements you get in array noise

For those trying to make the connection between SNR and a normal random variable generated by numpy:
[1] , where it's important to keep in mind that P is average power.
Or in dB:
[2]
In this case, we already have a signal and we want to generate noise to give us a desired SNR.
While noise can come in different flavors depending on what you are modeling, a good start (especially for this radio telescope example) is Additive White Gaussian Noise (AWGN). As stated in the previous answers, to model AWGN you need to add a zero-mean gaussian random variable to your original signal. The variance of that random variable will affect the average noise power.
For a Gaussian random variable X, the average power , also known as the second moment, is
[3]
So for white noise, and the average power is then equal to the variance .
When modeling this in python, you can either
1. Calculate variance based on a desired SNR and a set of existing measurements, which would work if you expect your measurements to have fairly consistent amplitude values.
2. Alternatively, you could set noise power to a known level to match something like receiver noise. Receiver noise could be measured by pointing the telescope into free space and calculating average power.
Either way, it's important to make sure that you add noise to your signal and take averages in the linear space and not in dB units.
Here's some code to generate a signal and plot voltage, power in Watts, and power in dB:
# Signal Generation
# matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
t = np.linspace(1, 100, 1000)
x_volts = 10*np.sin(t/(2*np.pi))
plt.subplot(3,1,1)
plt.plot(t, x_volts)
plt.title('Signal')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
x_watts = x_volts ** 2
plt.subplot(3,1,2)
plt.plot(t, x_watts)
plt.title('Signal Power')
plt.ylabel('Power (W)')
plt.xlabel('Time (s)')
plt.show()
x_db = 10 * np.log10(x_watts)
plt.subplot(3,1,3)
plt.plot(t, x_db)
plt.title('Signal Power in dB')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
Here's an example for adding AWGN based on a desired SNR:
# Adding noise using target SNR
# Set a target SNR
target_snr_db = 20
# Calculate signal power and convert to dB
sig_avg_watts = np.mean(x_watts)
sig_avg_db = 10 * np.log10(sig_avg_watts)
# Calculate noise according to [2] then convert to watts
noise_avg_db = sig_avg_db - target_snr_db
noise_avg_watts = 10 ** (noise_avg_db / 10)
# Generate an sample of white noise
mean_noise = 0
noise_volts = np.random.normal(mean_noise, np.sqrt(noise_avg_watts), len(x_watts))
# Noise up the original signal
y_volts = x_volts + noise_volts
# Plot signal with noise
plt.subplot(2,1,1)
plt.plot(t, y_volts)
plt.title('Signal with noise')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
# Plot in dB
y_watts = y_volts ** 2
y_db = 10 * np.log10(y_watts)
plt.subplot(2,1,2)
plt.plot(t, 10* np.log10(y_volts**2))
plt.title('Signal with noise (dB)')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()
And here's an example for adding AWGN based on a known noise power:
# Adding noise using a target noise power
# Set a target channel noise power to something very noisy
target_noise_db = 10
# Convert to linear Watt units
target_noise_watts = 10 ** (target_noise_db / 10)
# Generate noise samples
mean_noise = 0
noise_volts = np.random.normal(mean_noise, np.sqrt(target_noise_watts), len(x_watts))
# Noise up the original signal (again) and plot
y_volts = x_volts + noise_volts
# Plot signal with noise
plt.subplot(2,1,1)
plt.plot(t, y_volts)
plt.title('Signal with noise')
plt.ylabel('Voltage (V)')
plt.xlabel('Time (s)')
plt.show()
# Plot in dB
y_watts = y_volts ** 2
y_db = 10 * np.log10(y_watts)
plt.subplot(2,1,2)
plt.plot(t, 10* np.log10(y_volts**2))
plt.title('Signal with noise')
plt.ylabel('Power (dB)')
plt.xlabel('Time (s)')
plt.show()

... And for those who - like me - are very early in their numpy learning curve,
import numpy as np
pure = np.linspace(-1, 1, 100)
noise = np.random.normal(0, 1, 100)
signal = pure + noise

For those who want to add noise to a multi-dimensional dataset loaded within a pandas dataframe or even a numpy ndarray, here's an example:
import pandas as pd
# create a sample dataset with dimension (2,2)
# in your case you need to replace this with
# clean_signal = pd.read_csv("your_data.csv")
clean_signal = pd.DataFrame([[1,2],[3,4]], columns=list('AB'), dtype=float)
print(clean_signal)
"""
print output:
A B
0 1.0 2.0
1 3.0 4.0
"""
import numpy as np
mu, sigma = 0, 0.1
# creating a noise with the same dimension as the dataset (2,2)
noise = np.random.normal(mu, sigma, [2,2])
print(noise)
"""
print output:
array([[-0.11114313, 0.25927152],
[ 0.06701506, -0.09364186]])
"""
signal = clean_signal + noise
print(signal)
"""
print output:
A B
0 0.888857 2.259272
1 3.067015 3.906358
"""

AWGN Similar to Matlab Function
def awgn(sinal):
regsnr=54
sigpower=sum([math.pow(abs(sinal[i]),2) for i in range(len(sinal))])
sigpower=sigpower/len(sinal)
noisepower=sigpower/(math.pow(10,regsnr/10))
noise=math.sqrt(noisepower)*(np.random.uniform(-1,1,size=len(sinal)))
return noise

In real life you wish to simulate a signal with white noise. You should add to your signal random points that have Normal Gaussian distribution. If we speak about a device that have sensitivity given in unit/SQRT(Hz) then you need to devise standard deviation of your points from it. Here I give function "white_noise" that does this for you, an the rest of a code is demonstration and check if it does what it should.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
"""
parameters:
rhp - spectral noise density unit/SQRT(Hz)
sr - sample rate
n - no of points
mu - mean value, optional
returns:
n points of noise signal with spectral noise density of rho
"""
def white_noise(rho, sr, n, mu=0):
sigma = rho * np.sqrt(sr/2)
noise = np.random.normal(mu, sigma, n)
return noise
rho = 1
sr = 1000
n = 1000
period = n/sr
time = np.linspace(0, period, n)
signal_pure = 100*np.sin(2*np.pi*13*time)
noise = white_noise(rho, sr, n)
signal_with_noise = signal_pure + noise
f, psd = signal.periodogram(signal_with_noise, sr)
print("Mean spectral noise density = ",np.sqrt(np.mean(psd[50:])), "arb.u/SQRT(Hz)")
plt.plot(time, signal_with_noise)
plt.plot(time, signal_pure)
plt.xlabel("time (s)")
plt.ylabel("signal (arb.u.)")
plt.show()
plt.semilogy(f[1:], np.sqrt(psd[1:]))
plt.xlabel("frequency (Hz)")
plt.ylabel("psd (arb.u./SQRT(Hz))")
#plt.axvline(13, ls="dashed", color="g")
plt.axhline(rho, ls="dashed", color="r")
plt.show()

Awesome answers from Akavall and Noel (that's what worked for me). Also, I saw some comments about different distributions. A solution that I also tried was to make test over my variable and find what distribution it was closer.
numpy.random
has different distributions that can be used, it can be seen in its documentation:
documentation numpy.random
As an example from a different distribution (example referenced from Noel's answer):
import numpy as np
pure = np.linspace(-1, 1, 100)
noise = np.random.lognormal(0, 1, 100)
signal = pure + noise
print(pure[:10])
print(signal[:10])
I hope this can help someone looking for this specific branch from the original question.

You can try this:
import numpy as np
x = np.arange(-5.0, 5.0, 0.1)
y = np.power(x,2)
noise = 2 * np.random.normal(size=x.size)
ydata = y + noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('y data')
plt.xlabel('x data')
plt.show()

Awesome answers above. I recently had a need to generate simulated data and this is what I landed up using. Sharing in-case helpful to others as well,
import logging
__name__ = "DataSimulator"
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
import numpy as np
import pandas as pd
def generate_simulated_data(add_anomalies:bool=True, random_state:int=42):
rnd_state = np.random.RandomState(random_state)
time = np.linspace(0, 200, num=2000)
pure = 20*np.sin(time/(2*np.pi))
# concatenate on the second axis; this will allow us to mix different data
# distribution
data = np.c_[pure]
mu = np.mean(data)
sd = np.std(data)
logger.info(f"Data shape : {data.shape}. mu: {mu} with sd: {sd}")
data_df = pd.DataFrame(data, columns=['Value'])
data_df['Index'] = data_df.index.values
# Adding gaussian jitter
jitter = 0.3*rnd_state.normal(mu, sd, size=data_df.shape[0])
data_df['with_jitter'] = data_df['Value'] + jitter
index_further_away = None
if add_anomalies:
# As per the 68-95-99.7 rule(also known as the empirical rule) mu+-2*sd
# covers 95.4% of the dataset.
# Since, anomalies are considered to be rare and typically within the
# 5-10% of the data; this filtering
# technique might work
#for us(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
indexes_furhter_away = np.where(np.abs(data_df['with_jitter']) > (mu +
2*sd))[0]
logger.info(f"Number of points further away :
{len(indexes_furhter_away)}. Indexes: {indexes_furhter_away}")
# Generate a point uniformly and embed it into the dataset
random = rnd_state.uniform(0, 5, 1)
data_df.loc[indexes_furhter_away, 'with_jitter'] +=
random*data_df.loc[indexes_furhter_away, 'with_jitter']
return data_df, indexes_furhter_away

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Probability density function in SciPy behaves differently than expected - python

Related

Fitting a single gaussian to 'noisy' data yields a poor fit in some cases

Unequal width binned histogram in python

How to get and plot a signal envelope

Plancks Formula for Blackbody spectrum

adding noise to a signal in python

Categories

Resources