I am trying to simulate the performance of a real life process. The variables that have been measured historically shows a fixed interval, so been lower o greater that those values is physically impossible.
To simulate the process output, each input variable historical data was represented as the best fit probability distribution, respectively (using this approach: Fitting empirical distribution to theoretical ones with Scipy (Python)?).
However, the resulting theoretical distribution when is simulated n-times do not represent the real life expected min and maximum values. I am thinking to apply a try-except test each simulation to check if each simulated value is between the expected interval, but I am not sure if this is the best way to handle this due to, experimental mean and variance is not achieved.
You can use a boolean mask in numpy for regenerating the values that are outside the required boundaries. For example:
def random_with_bounds(func, size, bounds):
x = func(size=size)
r = (x < bounds[0]) | (x > bounds[1])
while r.any():
x[r] = func(size=r.sum())
r[r] = (x[r] < bounds[0]) | (x[r] > bounds[1])
return x
Then you can use it like:
random_with_bounds(np.random.normal, 1000, (-1, 1))
Another version using index arrays via np.argwhere gives slightly increased performance:
def random_with_bounds_2(func, size, bounds):
x = func(size=size)
r = np.argwhere((x < bounds[0]) | (x > bounds[1])).ravel()
while r.size > 0:
x[r] = func(size=r.size)
r = r[(x[r] < bounds[0]) | (x[r] > bounds[1])]
return x
Related
I am trying to generate a list of 12 random weights for a stock portfolio in order to determine how the portfolio would have performed in the past given different weights assigned to each stock. The sum of the weights must of course be 1 and there is an additional restriction: each stock must have a weight between 1/24 and 1/4.
Although I am able to generate random numbers such that they all fall within the interval by using random.uniform(), as well as guarantee their sum is 1 by dividing each weighting by the sum of the weightings, I'm finding that
a) each subsequent array of weightings is very similar. I am rarely getting values for weightings that are near the upper boundary of 1/4
b) random.seed() does not seem to be working properly, whether I put it in the randweight() function or at the beginning of the for loop. I'm confused as to why because I thought that generating a random seed value would make my array of weights unique for each iteration. Currently, it's cyclical, with a period of 3.
The following is my code:
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
def rand_weight(e):
random.seed()
return e + np.random.uniform(min_weight, max_weight)
for i in range(100):
weights = np.empty(12)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array(list(map(rand_weight, weights)))
weights /= np.sum(weights)
I have already tried scattering the weights by changing the min_weight and max_weight inside the for loop so that rand_weight generates newer values, but this makes the runtime really slow because the "not" condition in the while loop takes longer to evaluate to false (since the probability of all the numbers being in the range decreases).
Lets start with simple facts first. If you want numbers to be in the range [0.042...0.25] and 12 iid numbers in total summed to one, then for mean value
Sum(Xi)=1
E[Sum(Xi)]=Sum(E[Xi])=N E[Xi] = 1
E[Xi]=1/N = 1/12 = 0.083
One corollary is that it would be hard to get numbers close to upper range boundary.
And instead doing things like sampling whatever and then normalizing to get sum to 1, better to use known distribution where sum of values is 1 to begin with.
So lets use Dirichlet distribution, and sample points uniformly in simplex, which means alpha (concentration) vector is all ones.
import numpy as np
N = 12
s = np.random.dirichlet(N*[1.0], 1)
print(np.sum(s))
Some value would be larger (or smaller), and you could reject them
def sampleWeights(alpha, lo, hi):
while True:
s = np.random.dirichlet(alpha, 1)[0]
if np.any(s > hi):
continue # reject
if np.any(s < lo):
continue # reject
return s # accept
and call it like this
N=12
alpha = N*[1.0]
q = sampleWeights(alpha, 1./24., 1./4.)
if you could check it, a lot of rejections happens at low bound, rather then high bound.
BEauty of using known Dirichlet distribution is that you could "concentrate" sampled values around mean, e.g.
alpha = N*[10.0]
q = sampleWeights(alpha, 1./24., 1./4.)
will produce same iid with mean of 1/12 but a lot smaller std.deviation, RV a lot more concentrated around mean
And if you want non-identically distributed RVs, use different alphas
alpha = [1.,2.,3.,4.,5.,6.,6.,5.,4.,3.,2.,1.]
q = sampleWeights(alpha, 1./24., 1./4.)
then some of RVs would be close to upper boundary, and some close to lower boundary. Lots of advantages to use known distribution
The following works. Particularly confusing to me is that np.empty(12) seemed to always return the same array. So once it had been initialized, it stayed the same.
This seems to produce numbers above 0.22 reasonably often.
import numpy as np
from random import random, seed
# boundaries on weightings
n = 12
min_weight = (1/(2*n))
max_weight = 25 / 100
seed(666)
for i in range(100):
weights = np.zeros(n)
while not (np.all(weights > min_weight) and np.all(weights < max_weight)):
weights = np.array([random() for _ in range(n)])
weights /= np.sum(weights) - min_weight * n
weights += min_weight
print(weights)
I'm looking to generate random normally distributed numbers between 1 and 0, but as the mean moves closer to 1 or 0, the right or left side respectively becomes "squished".
After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:
Next I needed to create a method in python which would generate random samples that would be distributed according to this PDF.
Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller proof (which I got by following along with this tutorial).
However, I thought there might be an easier way to do this by using the numpy library's np.random.choice() method.
After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).
So with that I wrote the following script:
# Standard libs
import math
# Third party libs
import numpy as np
from alive_progress import alive_bar
from matplotlib import pyplot as plt
class RandomNumberGenerator:
def __init__(self):
pass
def clamped_normal_distribution(self, mu: float,
stddev: float, x: float):
""" Computes a value from the clamped normal distribution """
divideByZeroAvoider = 1e-5
if x < 0 or x > 1:
return 0
elif x >= 0 and x <= mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/(x**2 + divideByZeroAvoider)))
elif x <= 1 and x > mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/((1-x)**2 + divideByZeroAvoider)))
else:
print("This shouldn't happen!: {}".format(x))
return 0
if __name__ == '__main__':
rng = RandomNumberGenerator()
mu = 0.7
stddev = 1
stepSize = 1e-3
x = np.linspace(stepSize,1, int(1/stepSize) - 1)
# Determine the total area under the curve
samples = []
print("Generating samples...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
samples.append(rng.clamped_normal_distribution(
mu, stddev, i))
bar()
area = np.trapz(samples, dx=stepSize)
print("Area = {}".format(area))
# Determine the probability of x falling in a specific interval
probabilities = []
print("Generating probabilties...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
lead = rng.clamped_normal_distribution(mu,
stddev, i)
lag = rng.clamped_normal_distribution(mu,
stddev, i - stepSize)
probability = np.trapz(
np.array([lag, lead]),
dx=stepSize)
# Divide by the area because this isn't a standard normal
probabilities.append(probability / area)
bar()
# Should be approximately 1
print("Probability: {}".format(sum(probabilities)))
plt.plot(x, probabilities)
plt.show()
y = []
print("Performing distribution test...")
testSize = int(10e3)
with alive_bar(testSize) as bar:
for _ in range(testSize):
randSamp = np.random.choice(samples, p=probabilities)
y.append(randSamp)
bar()
plt.hist(y,300)
plt.show()
The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:
However, if we use these samples as choices with given probabilities, we get the following histogram:
I have no idea why this isn't working correctly.
I've tried other (smaller) examples like the ones listed on the numpy website, and they produce histograms of the according to the given probabilities array.
I'd really appreciate some advice/intuition if at all possible :).
It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities). The first argument should be x, not samples.
ADDITION BY AUTHOR:
The reason for this is the samples are the values of the curve (i.e. the y-axis and NOT the x-axis).
Thus the values with the highest probabilities (i.e. the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.
Changing this to x gives us the following graphs (for 10e3 samples):
Working as expected, very nice.
I have done some searching but I cannot seem to be able to find a reasonable way to sample from a truncated normal distribution.
Without truncation I was doing:
samples = [np.random.normal(loc=x,scale=d) for (x,d) in zip(X,D)]
X and D being lists of floats.
Currently I am implementing truncation as such:
def truncnorm(loc,scale,bounds):
s = np.random.normal(loc,scale)
if s > bounds[1]:
return bounds[1]
elif s < bounds[0]:
return bounds[0]
return s
samples = [truncnorm(loc=x,scale=d,bounds=b) for (x,d,b) in zip(X,D,bounds)]
bounds being a list of tuples (min,max)
This approach feels a little awkward, so I'm wondering if there is a better way?
Returning the value of the bounds for samples outside them, will result in too many samples falling on the bounds. This is not representative of the actual distribution. The values on the bounds need to be rejected and replaced by a new sample. Such code could be:
def test_truncnorm(loc, scale, bounds):
while True:
s = np.random.normal(loc, scale)
if bounds[0] <= s <= bounds[1]:
break
return s
This can be extremely slow given narrow bounds.
Scipy's truncnorm handles such cases more efficiently. A bit surprisingly, the bounds are expressed in function of the standard normal, so your call would be:
s = scipy.stats.truncnorm.rvs((bounds[0]-loc)/scale, (bounds[1]-loc)/scale, loc=loc, scale=scale)
Note that scipy works much faster when making use of numpy's vectorization and broadcasting. And once you're used to the notation, it also looks simpler to write and read. All samples can be calculated in one go as:
X = np.array(X)
D = np.array(D)
bounds = np.array(bounds)
samples = scipy.stats.truncnorm.rvs((bounds[:, 0] - X) / D, (bounds[:, 1] - X) / D, loc=X, scale=D)
I am trying to estimate the integral below using the Monte-Carlo method (in python):
I am using 1000 random points to estimate the integral. Here's my code:
N = 1000 #total number of points to be generated
def f(x):
return x*np.cos(x)
##Points between the x-axis and the curve will be stored in these empty lists.
red_points_x = []
red_points_y = []
blue_points_x = []
blue_points_y = []
##The loop checks if a point is between the x-axis and the curve or not.
i = 0
while i < N:
x = random.uniform(0, 2*np.pi)
y = random.uniform(3.426*np.cos(3.426), 2*np.pi*np.cos(2*np.pi))
if (0<= x <= np.pi and 0<= y <= f(x)) or (np.pi/2 <= x <= 3*np.pi/2 and f(x) <= y <= 0) or (3*np.pi/2 <= x <= 2*np.pi and 0 <= y <= f(x)):
red_points_x.append(x)
red_points_y.append(y)
else:
blue_points_x.append(x)
blue_points_y.append(y)
i +=1
area_of_rectangle= (2*np.pi)*(2*np.pi*np.cos(2*np.pi))
area= area_of_rectangle*(len(red_points_x))/N
print(area)
Output:
7.658813015245341
But that's far from 0 (the analytic solution)
Here's a visual representation of the area I am trying to plot:
Am I doing something wrong or missing something in my code? Please help, your help will be much appreciated. Thank you so much in advance.
TLDR: I believe the way you calculate the approximation is slightly wrong.
Looking a the wikipedia definition of the Monte Carlo integration the following definition is made:
https://en.wikipedia.org/wiki/Monte_Carlo_integration#Example
V corresponds the volume (area in this case) of the region of interest, x = [0, 2pi], y = [3.426*cos(3.426), 2pi*cos(2pi)].
So Q_N is the volume divided by N times the sum of the function evaluated at the randomly generated points. Hence:
total = 0
while i < N:
x = random.uniform(0, 2 * np.pi)
total += f(x)
i += 1
area_of_rectangle = (2*np.pi)*(2*np.pi*np.cos(2*np.pi)-3.426 * np.cos(3.426))
area = (area_of_rectangle * total) / N
This code yielded an average result of 0.0603 for 1000 runs with N=1000 (to remove the influence of randomly generated values). As you increase N the accuracy increases.
You are on the right track!
A couple pointers to put you on course...
Make your bounding box bigger in the y dimension to alleviate some of the confusing math. Yes, it will converge faster if you get it to "just touch" the max and min, but don't shoot for that yet. Heck, just make it -5 < y < 10 and you will have a nice (larger) box that covers the area you want to integrate. So, change your y generation to that and also change the area of your box calculation
Don't change x, you have it right 0 < x < 2*pi
When you are comparing the point to see if it is "under the curve" you do NOT need to check the x value... right? Just check if y is between f(x) and the axis. More on this in next point.... if so, it is "red"
Also on the point above, you will also need another category for the points that are BELOW the x-axis, because you will want to reduce your total by that amount. An alternate "trick" is to shift your whole function up by some constant such that the entire integral is positive, and then reduce your total by the size of that rectangle (constant * width)
Also, as you work on this, plot your points with matplotlib, it should be very easy the way you have your points gathered to overlay scatter plots with what you have and see if it looks visually accurate!
Comment me back w/ further q's... you got this!
Imagine a simulation experiment in which the output is n total numbers, where k of them are sampled from an exponential random variable with rate a and n-k are sampled from an exponential random variable with rate b. The constraints are that 0 < a ≤ b and 0 ≤ k ≤ n, but a, b, and k are all unknown. Further, because of details of the simulation experiment, when a << b, k ≈ 0, and when a = b, k ≈ n/2.
My goal is to estimate either a or b (don't care about k, and I don't need to estimate both a and b: just one of the two is fine). From speculation, it seems as though estimating just b might be the easiest path (when a << b, there is pretty much nothing to use to estimate a and plenty to estimate b, and when a = b, both there is still plenty to estimate b). I want to do it in Python ideally, but I am open to any free software.
My first approach was to use sklearn.optimize to optimize a likelihood function where, for each number in my dataset, I compute P(X=x) for an exponential with rate a, compute the same for an exponential with rate b, and simply choose the larger of the two:
from sys import stdin
from math import exp,log
from scipy.optimize import fmin
DATA = None
def pdf(x,l): # compute P(X=x) for an exponential rv X with rate l
return l*exp(-1*l*x)
def logML(X,la,lb): # compute the log-ML of data points X given two exponentials with rates la and lb where la < lb
ml = 0.0
for x in X:
ml += log(max(pdf(x,la),pdf(x,lb)))
return ml
def f(x): # objective function to minimize
assert DATA is not None, "DATA cannot be None"
la,lb = x
if la > lb: # force la <= lb
return float('inf')
elif la <= 0 or lb <= 0:
return float('inf') # force la and lb > 0
return -1*logML(DATA,la,lb)
if __name__ == "__main__":
DATA = [float(x) for x in stdin.read().split()] # read input data
Xbar = sum(DATA)/len(DATA) # compute mean
x0 = [1/Xbar,1/Xbar] # start with la = lb = 1/mean
result = fmin(f,x0,disp=DISP)
print("ML Rates: la = %f and lb = %f" % tuple(result))
This unfortunately didn't work very well. For some selections of the parameters, it's within an order of magnitude, but for others, it's absurdly off. Given my problem (with its constraints) and my goal of estimating the larger parameter of the two exponentials (without caring about the smaller parameter nor the number of points that came from either), any ideas?
I posted the question in more general statistical terms on the stats Stack Exchange, and it got an answer:
https://stats.stackexchange.com/questions/291642/how-to-estimate-parameters-of-mixture-of-2-exponential-random-variables-ideally
Also, I tried the following, which worked decently well:
First, for every single integer percentile (1st percentile, 2nd percentile, ..., 99th percentile), I compute the estimate of b using the quantile closed-form equation (where the i-th quantile is the (i *100)-th percentile) for an exponential distribution (the i-th quantile = −ln(1 − i) / λ, so λ = −ln(1 − i) / (i-th quantile)). The result is a list where each i-th element corresponds to the b estimate using the (i+1)-th percentile.
Then, I perform peak-calling on this list using the Python implementation of the Matlab peak-calling function. Then, I take the list of resulting peaks and return the minimum. It seems to work fairly well.
I will implement the EM solution in the Stack Exchange post as well and see which works better.
EDIT: I implemented the EM solution, and it seems to work decently well in my simulations (n = 1000, various a and b).