SciPy: generating custom random variable from PMF

SciPy: generating custom random variable from PMF - python

I'm trying to generate random variables according to a certain ugly distribution, in Python. I have an explicit expression for the PMF, but it involves some products which makes it unpleasant to obtain and invert the CDF (see below code for explicit form of PMF).
In essence, I'm trying to define a random variable in Python by its PMF and then have built-in code do the hard work of sampling from the distribution. I know how to do this if the support of the RV is finite, but here the support is countably infinite.
The code I am currently trying to run as per #askewchan's advice below is:
import scipy as sp
import numpy as np
class x_gen(sp.stats.rv_discrete):
def _pmf(self,k,param):
num = np.arange(1+param, k+param, 1)
denom = np.arange(3+2*param, k+3+2*param, 1)
p = (2+param)*(np.prod(num)/np.prod(denom))
return p
pa_limit = limitrv_gen()
print pa_limit.rvs(alpha,n=1)
However, this returns the error while running:
File "limiting_sim.py", line 42, in _pmf
num = np.arange(1+param, k+param, 1)
TypeError: only length-1 arrays can be converted to Python scalars
Basically, it seems that the np.arange() list isn't working somehow inside the def _pmf() function. I'm at a loss to see why. Can anyone enlighten me here and/or point out a fix?
EDIT 1: cleared up some questions by askewchan, edits reflected above.
EDIT 2: askewchan suggested an interesting approximation using the factorial function, but I'm looking more for an exact solution such as the one that I'm trying to get work with np.arange.

You should be able to subclass rv_discrete like so:
class mydist_gen(rv_discrete):
def _pmf(self, n, param):
return yourpmf(n, param)
Then you can create a distribution instance with:
mydist = mydist_gen()
And generate samples with:
mydist.rvs(param, size=1000)
Or you can then create a frozen distribution object with:
mydistp = mydist(param)
And finally generate samples with:
mydistp.rvs(1000)
With your example, this should work, since factorial automatically broadcasts. But, it might fail for large enough alpha:
import scipy as sp
import numpy as np
from scipy.misc import factorial
class limitrv_gen(sp.stats.rv_discrete):
def _pmf(self, k, alpha):
#num = np.prod(np.arange(1+alpha, k+alpha))
num = factorial(k+alpha-1) / factorial(alpha)
#denom = np.prod(np.arange(3+2*alpha, k+3+2*alpha))
denom = factorial(k + 2 + 2*alpha) / factorial(2 + 2*alpha)
return (2+alpha) * num / denom
pa_limit = limitrv_gen()
alpha = 100
pa_limit.rvs(alpha, size=10)

Related

Statsmodels: vector_ar and IRAnalysis

I'm trying to estimate impulse response functions of a -1 standard-deviation shock to a 3-dimension VAR using statsmodels.tsa, however I'm currently having issues with setting the shock magnitude.
This gives me the IRFs for a 1 s.d. shock, the default:
import numpy as np
import statsmodels.tsa as sm
model = sm.vector_ar.var_model.VAR(endog = data)
fitted = model.fit()
shock= -1*fitted.sigma_u
irf = sm.vector_ar.irf.IRAnalysis(model = fitted)
The function IRAnalysis takes an argument P, an upper diagonal matrix that sets the shocks, I found this looking at the source code. However inputting P as shown below doesn't seem to be doing anything.
irf = statsmodels.tsa.vector_ar.irf.IRAnalysis(model = fitted, P = -np.linalg.cholesky(model.fitted_U))
I would really appreciate some help.
Thanks in advance.

I have had the same question and finally found something that works on my end.
instead of using the IRAnalysis explicitly, I found that transforming the VAR model into it's MA representation was the best way to adjust the size of the shock.
from statsmodels.tsa.vector_ar.irf import IRAnalysis
J = fitted.ma_rep(T)
J = shock*np.array(J)
This will give you the output of the irfs for T periods.
I also wanted the standard error bands on my plots, so I did something similar to that particular function as well.
G, H = fitted.irf_errband_mc(orth=False, repl=1000, steps=T, signif=0.05, seed=None, burn=100, cum=False)
Hope this helps

Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures

I am using the Python version of the Shogun Toolbox.
I want to use the LinearTimeMMD, which accepts data under the streaming interface CStreamingFeatures. I have the data in the form of two RealFeatures objects: feat_p and feat_q. These work just fine with the QuadraticTimeMMD.
In order to use it with the LinearTimeMMD, I need to create StreamingFeatures objects from these - In this case, these would be StreamingRealFeatures, as far as I know.
My first approach was using this:
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space() returns -1. Also, if I try calling gen_p.get_streamed_features(100) this results in a Memory Access Error.
I tried another approach using StreamingFileFromFeatures:
streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)
gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)
But this results in the same situation with the same described problems.
It seems that in both cases, the contents of the RealFeatures object handed to the StreamingRealFeatures object cannot be accessed.
What am I doing wrong?
EDIT: I was asked for a small working example to show the error:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import laplace, norm
def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):
# sample from both distributions
X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
Y=laplace.rvs(size=n, loc=mu, scale=b)
return X,Y
# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())
test_features = gen_p.get_streamed_features(1)
print("success")
EDIT 2: The Output of the working example:
Dimensions: -1
Number of features: -1
Number of vectors: 1
Speicherzugriffsfehler (Speicherabzug geschrieben)
EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05
# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)
EDIT 4: Additional code sample showing the growing mmd problem:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
def mmd(n):
X = [(1.0,i) for i in range(n)]
Y = [(2.0,i) for i in range(n)]
X = np.array(X)
Y = np.array(Y)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(2, len(X)))
feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(100)
mmd.set_num_samples_q(100)
alpha = 0.05
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("N =", n)
print("MMD_l[X,Y]^2=%.2f" % statistic)
print()
for n in [1000, 10000, 15000, 20000, 25000, 30000]:
mmd(n)
Output:
N = 1000
MMD_l[X,Y]^2=-12.69
N = 10000
MMD_l[X,Y]^2=-40.14
N = 15000
MMD_l[X,Y]^2=-49.16
N = 20000
MMD_l[X,Y]^2=-56.77
N = 25000
MMD_l[X,Y]^2=-63.47
N = 30000
MMD_l[X,Y]^2=-69.52

For some reason, the pythonenv in my machine is broken. So, I couldn't give a snippet in Python. But let me point to a working example in C++ which attempts to address the issues (https://gist.github.com/lambday/983830beb0afeb38b9447fd91a143e67).
I think the easiest way is to create a StreamingRealFeatures instance directly from RealFeatures instance (like you tried the first time). Check test1() and test2() methods in the gist which shows the equivalence of using RealFeatures and StreamingRealFeatures in the use-case in question. The reason you were getting weird results when streaming directly is that in order to start the streaming process we need to call the start_parser method in the StreamingRealFeatures class. We handle these technicalities internally inside MMD classes. But when trying to use it directly, we need to invoke that separately (See test3() method in my attached example).
Please note that the compute_statistic() method doesn't return MMD directly, but rather returns \frac{n_x\times n_y}{n_x+n_y}\times MMD^2 (as mentioned in the doc http://shogun.ml/api/latest/classshogun_1_1CMMD.html). With that in mind, maybe the results you are getting for varying number of samples make sense.
Hope it helps.

lmfit for exponential data returns linear function

I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()

This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.

Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.

Solving nonlinear overdetermined system using python

I'm trying to find a good way to solve a nonlinear overdetermined system with python. I looked into optimization tools here http://docs.scipy.org/doc/scipy/reference/optimize.nonlin.html but I can't figure out how to use them. What I have so far is
#overdetermined nonlinear system that I'll be using
'''
a = cos(x)*cos(y)
b = cos(x)*sin(y)
c = -sin(y)
d = sin(z)*sin(y)*sin(x) + cos(z)*cos(y)
e = cos(x)*sin(z)
f = cos(z)*sin(x)*cos(z) + sin(z)*sin(x)
g = cos(z)*sin(x)*sin(y) - sin(z)*cos(y)
h = cos(x)*cos(z)
a-h will be random int values in the range 0-10 inclusive
'''
import math
from random import randint
import scipy.optimize
def system(p):
x, y, z = p
return(math.cos(x)*math.cos(y)-randint(0,10),
math.cos(x)*math.sin(y)-randint(0,10),
-math.sin(y)-randint(0,10),
math.sin(z)*math.sin(y)*math.sin(x)+math.cos(z)*math.cos(y)-randint(0,10),
math.cos(x)*math.sin(z)-randint(0,10),
math.cos(z)*math.sin(x)*math.cos(z)+math.sin(z)*math.sin(x)-randint(0,10),
math.cos(z)*math.sin(x)*math.sin(y)-math.sin(z)*math.cos(y)-randint(0,10),
math.cos(x)*math.cos(z)-randint(0,10))
x = scipy.optimize.broyden1(system, [1,1,1], f_tol=1e-14)
could you help me out a bit here?

If I understand you right, you want to find an approximate solution to the non-linear system of equations f(x) = b where b is the vector containing the random values b=[a,...,h].
In order to do this you will first need to remove the random values from the system function, because otherwise in each iteration the solver will try to solve a different equation system. Moreover, I think that the basic Broyden method only works for a system with as many unknowns as equations. Alternatively you could use scipy.optimize.leastsq. A possible solution looks like this:
# I am using numpy because it's more convenient for the generation of
# random numbers.
import numpy as np
from numpy.random import randint
import scipy.optimize
# Removed random right-hand side values and changed nomenclature a bit.
def f(x):
x1, x2, x3 = x
return np.asarray((math.cos(x1)*math.cos(x2),
math.cos(x1)*math.sin(x2),
-math.sin(x2),
math.sin(x3)*math.sin(x2)*math.sin(x1)+math.cos(x3)*math.cos(x2),
math.cos(x1)*math.sin(x3),
math.cos(x3)*math.sin(x1)*math.cos(x3)+math.sin(x3)*math.sin(x1),
math.cos(x3)*math.sin(x1)*math.sin(x2)-math.sin(x3)*math.cos(x2),
math.cos(x1)*math.cos(x3)))
# The second parameter is used to set the solution vector using the args
# argument of leastsq.
def system(x,b):
return (f(x)-b)
b = randint(0, 10, size=8)
x = scipy.optimize.leastsq(system, np.asarray((1,1,1)), args=b)[0]
I hope this is of help for you. However, note that it is extremely unlikely that you will find a solution, especially when you generate random integers in the interval [0,10] while the range of f is limited to [-2,2]

How to pick points under the curve?

What I'm trying to do is make a gaussian function graph. then pick random numbers anywhere in a space say y=[0,1] (because its normalized) & x=[0,200]. Then, I want it to ignore all values above the curve and only keep the values underneath it.
import numpy
import random
import math
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from math import sqrt
from numpy import zeros
from numpy import numarray
variance = input("Input variance of the star:")
mean = input("Input mean of the star:")
x=numpy.linspace(0,200,1000)
sigma = sqrt(variance)
z = max(mlab.normpdf(x,mean,sigma))
foo = (mlab.normpdf(x,mean,sigma))/z
plt.plot(x,foo)
zing = random.random()
random = random.uniform(0,200)
import random
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.random(), random.uniform(0,200)))
return ret
size = input("Input number of simulations:")
foos = set(foo)
xx = set(x)
method = method2(size)
def undercurve(xx,foos,method):
Upper = numpy.where(foos<(method))
Lower = numpy.where(foos[Upper]>(method[Upper]))
return (xx[Upper])[Lower],(foos[Upper])[Lower]
When I try to print undercurve, I get an error:
TypeError: 'set' object has no attribute '__getitem__'
and I have no idea how to fix it.
As you can all see, I'm quite new at python and programming in general, but any help is appreciated and if there are any questions I'll do my best to answer them.

The immediate cause of the error you're seeing is presumably this line (which should be identified by the full traceback -- it's generally quite helpful to post that):
Lower = numpy.where(foos[Upper]>(method[Upper]))
because the confusingly-named variable method is actually a set, as returned by your function method2. Actually, on second thought, foos is also a set, so it's probably failing on that first. Sets don't support indexing with something like the_set[index]; that's what the complaint about __getitem__ means.
I'm not entirely sure what all the parts of your code are intended to do; variable names like "foos" don't really help like that. So here's how I might do what you're trying to do:
# generate sample points
num_pts = 500
sample_xs = np.random.uniform(0, 200, size=num_pts)
sample_ys = np.random.uniform(0, 1, size=num_pts)
# define distribution
mean = 50
sigma = 10
# figure out "normalized" pdf vals at sample points
max_pdf = mlab.normpdf(mean, mean, sigma)
sample_pdf_vals = mlab.normpdf(sample_xs, mean, sigma) / max_pdf
# which ones are under the curve?
under_curve = sample_ys < sample_pdf_vals
# get pdf vals to plot
x = np.linspace(0, 200, 1000)
pdf_vals = mlab.normpdf(x, mean, sigma) / max_pdf
# plot the samples and the curve
colors = np.array(['cyan' if b else 'red' for b in under_curve])
scatter(sample_xs, sample_ys, c=colors)
plot(x, pdf_vals)
Of course, you should also realize that if you only want the points under the curve, this is equivalent to (but much less efficient than) just sampling from the normal distribution and then randomly selecting a y for each sample uniformly from 0 to the pdf value there:
sample_xs = np.random.normal(mean, sigma, size=num_pts)
max_pdf = mlab.normpdf(mean, mean, sigma)
sample_pdf_vals = mlab.normpdf(sample_xs, mean, sigma) / max_pdf
sample_ys = np.array([np.random.uniform(0, pdf_val) for pdf_val in sample_pdf_vals])

It's hard to read your code.. Anyway, you can't access a set using [], that is, foos[Upper], method[Upper], etc are all illegal. I don't see why you convert foo, x into set. In addition, for a point produced by method2, say (x0, y0), it is very likely that x0 is not present in x.
I'm not familiar with numpy, but this is what I'll do for the purpose you specified:
def undercurve(size):
result = []
for i in xrange(size):
x = random()
y = random()
if y < scipy.stats.norm(0, 200).pdf(x): # here's the 'undercurve'
result.append((x, y))
return results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SciPy: generating custom random variable from PMF - python

Related

Statsmodels: vector_ar and IRAnalysis

Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures

lmfit for exponential data returns linear function

Solving nonlinear overdetermined system using python

How to pick points under the curve?

Categories

Resources