How to truncate a numpy/scipy exponential distribution in an efficient way? - python

I'm currently building a neuroscience experiment. Basically, a stimulus is presented for 3 seconds every x seconds (x = inter-trial interval). I would like x to be rather short (mean = 2.5) and unpredictable.
My idea is to draw random samples from an exponential distribution truncated at 1 (lower bound) and 10 (upper bound). I would like the resulting bounded exponential distr. to have an expected mean of 2.5. How could I do that in an efficient way?

There are two ways to do this:
The first is to generate an exponentially distributed random variable and then limit the values into (1,10).
In [14]:
import matplotlib.pyplot as plt
import scipy.stats as ss
Lambda = 2.5 #expected mean of exponential distribution is lambda in Scipy's parameterization
Size = 1000
trc_ex_rv = ss.expon.rvs(scale=Lambda, size=Size)
trc_ex_rv = trc_ex_rv[(trc_ex_rv>1)&(trc_ex_rv<10)]
In [15]:
plt.hist(trc_ex_rv)
plt.xlim(0, 12)
Out[15]:
(0, 12)
In [16]:
trc_ex_rv
Out[16]:
array([...]) #a lot of numbers
Of course, the problem is you are not going to get the exact number of random numbers (defined by Size here).
The other way to do it is to use Inverse transform sampling, and you will get the exact number of replicates as specified:
In [17]:
import numpy as np
def trunc_exp_rv(low, high, scale, size):
rnd_cdf = np.random.uniform(ss.expon.cdf(x=low, scale=scale),
ss.expon.cdf(x=high, scale=scale),
size=size)
return ss.expon.ppf(q=rnd_cdf, scale=scale)
In [18]:
plt.hist(trunc_exp_rv(1, 10, Lambda, Size))
plt.xlim(0, 12)
Out[18]:
(0, 12)
If you want the resulting bounded distribution to have an expected mean of a given value, say 2.5, you need to solve for the scale parameter that resulting the expected mean.
import scipy.optimize as so
def solve_for_l(low, high, ept_mean):
A = np.array([low, high])
return 1/so.fmin(lambda L: ((np.diff(np.exp(-A*L)*(A*L+1)/L)/np.diff(np.exp(-A*L)))-ept_mean)**2,
x0=0.5,
full_output=False, disp=False)
def F(low, high, ept_mean, size):
return trunc_exp_rv(low, high,
solve_for_l(low, high, ept_mean),
size)
rv_data = F(1, 10, 2.5, 1e5)
plt.hist(rv_data, bins=50)
plt.xlim(0, 12)
print rv_data.mean()
Result:
2.50386617882

In addition to #CT Zhu's great answer, it appears that scipy now has a truncated exponential distribution built-in.
from scipy.stats import truncexpon
r = truncexpon.rvs(b, size=1000)

Related

Write a random number generator that, based on uniformly distributed numbers between 0 and 1, samples from a Lévy-distribution?

I'm completely new to Python. Could someone show me how can I write a random number generator which samples from the Levy Distribution? I've written the function for the distribution, but I'm confused about how to proceed further!
The random numbers generated by this distribution I want to use them to simulate a 2D random walk.
I'm aware that from scipy.stats I can use the Levy class, but I want to write the sampler myself.
import numpy as np
import matplotlib.pyplot as plt
# Levy distribution
"""
f(x) = 1/(2*pi*x^3)^(1/2) exp(-1/2x)
"""
def levy(x):
return 1 / np.sqrt(2*np.pi*x**3) * np.exp(-1/(2*x))
N = 50
foo = levy(N)
#pjs code looks ok to me, but there is a discrepancy between his code and what SciPy thinks about Levy - basically, sampling is different from PDF.
Code, Python 3.8 Windows 10 x64
import numpy as np
from scipy.stats import levy
from scipy.stats import norm
import matplotlib.pyplot as plt
rng = np.random.default_rng(312345)
# Arguments
# u: a uniform[0,1) random number
# c: scale parameter for Levy distribution (defaults to 1)
# mu: location parameter (offset) for Levy (defaults to 0)
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (2.0 * (norm.ppf(1.0 - u))**2)
fig, ax = plt.subplots()
rnge=(0, 20.0)
x = np.linspace(rnge[0], rnge[1], 1001)
N = 200000
q = np.empty(N)
for k in range(0, N):
u = rng.random()
q[k] = my_levy(u)
nrm = levy.cdf(rnge[1])
ax.plot(x, levy.pdf(x)/nrm, 'r-', lw=5, alpha=0.6, label='levy pdf')
ax.hist(q, bins=100, range=rnge, density=True, alpha=0.2)
plt.show()
produce graph
UPDATE
Well, I tried to use home-made PDF, same output, same problem
# replace levy.pdf(x) with PDF(x)
def PDF(x):
return np.where(x <= 0.0, 0.0, 1.0 / np.sqrt(2*np.pi*x**3) * np.exp(-1./(2.*x)))
UPDATE II
After applying #pjs corrected sampling routine, sampling and PDF are aligned perfectly. New graph
Here's a straightforward implementation of the generating algorithm for the Levy distribution found on Wikipedia:
import random
from scipy.stats import norm
# Arguments
# u: a uniform[0,1) random number
# c: scale parameter for Levy distribution (defaults to 1)
# mu: location parameter (offset) for Levy (defaults to 0)
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (2 * norm.ppf(1.0 - u)**2)
# Generate a handful of samples
for _ in range(10):
print(my_levy(random.random()))
I don't normally use Python, so please suggest improvements.
ADDENDUM
Kudos to Severin Pappadeux for the work in his response. I had already noted that a simpler answer would be to take the inverse of a squared Gaussian, but Advaita had asked for an explicit function of U ~ Uniform(0,1) so I didn't pursue that. It turns out that I should have. The Wikipedia cite mentions that, but without the scale factor of 2 in the denominator. When I take the 2 out of the implementation of Wikipedia's generating algorithm, i.e. change the implemention to
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (norm.ppf(1.0 - u)**2)
the resulting histogram aligns beautifully with the normalized plot of the pdf. (Note - I've now also edited the incorrect Wikipedia entry to correct the formula.)

Generating synthetic data with Gaussian distribution

Problem
In a paper I am reading now, it defines a new metric and authors claim some advantages over previous metrics. They verify their claim by some synthetic data, which looks like following
The implementation of their metric is pretty straightforward. However, I am not sure how they create this kind of synthetic data.
What I Have Done
This looks like Gaussian where x is only within certain intervals, I tried with following code but did not get anything similar to the plot presented in the paper.
import numpy as np
def generate_gaussian(size=1000, lb=-0.1, up=0.1):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 0.3
background_neg = base + 0.7
Now I am wondering if the authors create these data using some special distribution (other than Gaussian) I do not know?
Numpy has a numpy.random.normal that draws random samples from a normal (Gaussian) distribution.
import numpy as np
import matplotlib.pyplot as plt
sigma = 0.05
s0 = np.random.normal(0.2, sigma, 5000)
s1 = np.random.normal(0.6, sigma, 5000)
plt.hist(s0, 300, density=True, color="b")
plt.hist(s1, 300, density=True, color="r")
plt.xlim(0, 1)
plt.show()
You can change the values of the mu (mean) and sigma to alter the distributions
mu = 0.55
sigma = 0.1
dist = np.random.normal(mu, sigma, 5000)
You have cut off the data at +/- 0.1. A normalised Gausian distribution only 'looks Gaussian' if you look over the range approximately +/- 3. Try this:
import numpy as np
def generate_gaussian(size=1000, lb=-3, up=3):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 5
background_neg = base + 15
You can use scipy.stats.norm (info).
import libraries
>>> from scipy.stats import norm
>>> from matplotlib import pyplot
plot
>>> pyplot.hist(norm.rvs(loc=1, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_1')
>>> pyplot.hist(norm.rvs(loc=5, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_2')
>>> pyplot.legend()
>>> pyplot.show()
Clarification:
A normal distribution is defined by mean (loc, distribution center) and standard distribution (scale, measure of distribution dispersion or width). rvs generates random samples of the desired normal distribution of size size. For example next code generates 4 random elements of a normal distribution (mean = 1, SD = 1).
>>> norm.rvs(loc=1, scale=1, size=4)
array([ 0.52154255, 1.40873701, 1.55959291, -0.01730568])

How can i generate random variables using np.random.zipf for a given range of values?

I have a given price range and i had used random uniform to get random generated random results from it. How can i introduce np.random.zipf to do the same ?
i have tried the following :
a = np.random.zipf((randint(1, 6000000)), size=None)
print(a)
But it seems to be providing no return values, and it keeps running the code without any termination
order_total_price_range1 = round(random.uniform(850, 560000), 5)
order_total_price_range2 = round(random.uniform(850, 560000), 5)
I expected to get max and min values from the zipf distribution, but currently not getting any results returned.
While #RobinNicole is right wrt Zipf distribution, you could simulate truncated Zipf using discrete sampling. Along the lines
import numpy as np
from matplotlib import pyplot as plt
def Zipf(a: np.float64, min: np.uint64, max: np.uint64, size=None):
"""
Generate Zipf-like random variables,
but in inclusive [min...max] interval
"""
if min == 0:
raise ZeroDivisionError("")
v = np.arange(min, max+1) # values to sample
p = 1.0 / np.power(v, a) # probabilities
p /= np.sum(p) # normalized
return np.random.choice(v, size=size, replace=True, p=p)
min = np.uint64(3)
max = np.uint64(8)
q = Zipf(1.2, min, max, 10000)
print(q)
h, bins = np.histogram(q, bins = int(max-min+1),range=(min-0.5,max+0.5))
print(h)
print(bins)
plt.hist(q, bins = bins)
plt.title("Zipf")
plt.show()
Will make graph like this
You cannot tune the parameter of the Zipf law to restrict it to a given interval as you do it with the uniform distribution. The reason for that is that the Zipf distribution is always defined on the set of all the positive integers independently of its parameters.

How can I fit this sinusoidal wave with my current data?

I have some data I gathered analyzing the change of acceleration regarding time. But when I wrote the code below to have a good fit for the sinusoidal wave, this was the result. Is this because I don't have enough data or am I doing something wrong here?
Here you can see my graph:
Measurements plotted directly(no fit)
Fit with horizontal and vertical shift (curve_fit)
Increased data by linspace
Manually manipulated amplitude
Edit: I increased the data size by using the linspace function and plotting it but I am not sure why the amplitude doesn't match, is it because there are very few data to analyze? (I was able to manipulate the amplitude manually but I don't understand why it can't do it)
The code I am using for the fit
def model(x, a, b):
return a * np.sin(b * x)
param, parav_cov = cf(model, time, z_values)
array_x = np.linspace(800, 1400, 1000)
fig = plt.figure(figsize = (9, 4))
plt.scatter(time, z_values, color = "#3333cc", label = "Data")
plt.plot(array_x, model(array_x, param[0], param[1], param[2], param[3]), label = "Sin Fit")
I'd use an FFT to get a first guess at parameters, as this sort of thing is highly non-linear and curve_fit is unlikely to get very far otherwise. the reason for using a FFT is to get an initial idea of the frequency involved, not much more. 3Blue1Brown has a great video on FFTs if you've not seem it
I used web plot digitizer to get your data out of your plots, then pulled into Python and made sure it looked OK with:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('sinfit2.csv')
print(df.head())
giving me:
x y
0 809.3 0.3
1 820.0 0.3
2 830.3 19.6
3 839.9 19.6
4 849.6 0.4
I started by doing a basic FFT with NumPy (SciPy has the full fftpack which is more complete, but not needed here):
import numpy as np
from numpy.fft import fft
d = fft(df.y)
plt.plot(np.abs(d)[:len(d)//2], '.')
the np.abs(d) is because you get a complex number back containing both phase and amplitude, and [:len(d)//2] is because (for real valued input) the output is symmetric about the midpoint, i.e. d[5] == d[-5].
this says the largest component was 18, I tried plotting this by hand and it looked OK:
x = np.linspace(0, np.pi * 2, len(df))
plt.plot(df.x, df.y, '.-', lw=1)
plt.plot(df.x, np.sin(x * 18) * 10 + 10)
I'm multiplying by 10 and adding 10 is because the range of a sine is (-1, +1) and we need to take it to (0, 20).
next I passed these to curve_fit with a simplified model to help it along:
from scipy.optimize import curve_fit
def model(x, a, b):
return np.sin(x * a + b) * 10 + 10
(a, b), cov = curve_fit(model, x, df.y, [18, 0])
again I'm hardcoding the * 10 + 10 to get the range to match your data, which gives me a=17.8 and b=2.97
finally I plot the function sampled at a higher frequency to make sure all is OK:
plt.plot(df.x, df.y)
plt.plot(
np.linspace(810, 1400, 501),
model(np.linspace(0, np.pi*2, 501), a, b)
)
giving me:
which seems to look OK. note you might want to change these parameters so they fit your original X, and note my df.x starts at 810, so I might have missed the first point.

binning data in python with scipy/numpy

is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.
from scipy import *
from numpy import *
def get_bin_mean(a, b_start, b_end):
ind_upper = nonzero(a >= b_start)[0]
a_upper = a[ind_upper]
a_range = a_upper[nonzero(a_upper < b_end)[0]]
mean_val = mean(a_range)
return mean_val
data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []
n = 0
for n in range(0, len(bins)-1):
b_start = bins[n]
b_end = bins[n+1]
binned_data.append(get_bin_mean(data, b_start, b_end))
print binned_data
It's probably faster and easier to use numpy.digitize():
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram():
bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :)
The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.
For the same example as in the previous answers, the Scipy solution would be
import numpy as np
from scipy.stats import binned_statistic
data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]
Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:
import numpy as np
data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)
mean = np.add.reduceat(data, slices[:-1]) / counts
print mean
The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:
import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))
This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)
I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data
import numpy as np
from scipy.stats import binned_statistic_2d
x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic
the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets
Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices.
We can get the bin position for each datapoint using the searchsorted method.
Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.
np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)
histogram = np.zeros_like(bins)
bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)

Categories