PYMC3: How to use math.switch for high dimensional random variables - python

I am currently trying to implement change point detection using this guide: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_PyMC3.ipynb
It uses a switch statement to decide between the parameters of distributions for before and after the change point.
lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)
I am also trying to find a changepoint, but using data that is assumed to come from a multivariate distribution.
Here is my code:
tau = pm.Uniform("tau_", lower = x_data[0], upper = x_data[-1])
mus_1 = pm.Uniform("mus1", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_2 = pm.Uniform("mus2", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_ = pm.math.switch(tau > x_data, mus_1, mus_2)
I put shape as 10 for the distribution assumed is a multivariate normal distribution with 10 variables.
I assumed that the switch statement would assign the shape 10 random variable element wise to the x_data (7919 points)
However, I get the following error:
ValueError: Input dimension mis-match. (input[0].shape[0] = 7919, input[1].shape[0] = 10)
It seems like the switch statement only allows you to switch between one-dimensional random variables, how do I work around this?

I don't have access to the rest of your model, but I had run into this same issue and was able to develop a work around by choosing the mus1 and mus2 index during the switch function. So assuming you have some index array idx, the code would look as follows,
tau = pm.Uniform("tau_", lower = x_data[0], upper = x_data[-1])
mus_1 = pm.Uniform("mus1", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_2 = pm.Uniform("mus2", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_ = pm.math.switch(tau > x_data, mus_1[idx], mus_2[idx])

Related

Python - unexpected shape parameter behavior in scipy genextreme fit

I've been trying to fit the GEV distribution to some annual maximum river discharge using Scipy's stats.genextreme function, but I've found some weird behavior of the fit. Depending on how small your data is (i.e., 1e-5 vs. 1e-1), the shape parameter that is returned can be dramatically different. For example:
import scipy as scipy
import numpy as np
from scipy.stats import genextreme as gev
from scipy.stats import gumbel_r as gumbel
#Set up arrays of values to fit curve to
sample=np.random.rand(1,30) #Random set of decimal values
smallVals = sample*1e-5 #Scale to smaller values
#If the above is not creating different values, this instance of random numbers has:
bugArr = np.array([[0.25322987, 0.81952358, 0.94497455, 0.36295543, 0.72272746, 0.49482558,0.65674877, 0.40876558, 0.64952248, 0.23171052, 0.24645658, 0.35359126,0.27578928, 0.24820775, 0.69789187, 0.98876361, 0.22104156,0.40019593,0.0756707, 0.12342556, 0.3601186, 0.54137089,0.43477705, 0.44622486,0.75483338, 0.69766687, 0.1508741, 0.75428996, 0.93706003, 0.1191987]])
bugArr_small = bugArr*1e-5
#This array of random numbers gives the same shape parameter regardless
fineArr = np.array([[0.7449611, 0.82376693, 0.32601009, 0.18544293, 0.56779629, 0.30495415,
0.04670362, 0.88106521, 0.34013959, 0.84598841, 0.24454428, 0.57981437,
0.57129427, 0.8857514, 0.96254429, 0.64174078, 0.33048637, 0.17124045,
0.11512589, 0.31884749, 0.48975204, 0.87988863, 0.86898236, 0.83513966,
0.05858769, 0.25889509, 0.13591874, 0.89106616, 0.66471263, 0.69786708]])
fineArr_small = fineArr*1e-5
#GEV fit for both arrays - shouldn't dramatically change distribution
gev_fit = gev.fit(sample)
gevSmall_fit = gev.fit(smallVals)
gevBug = gev.fit(bugArr)
gevSmallBug = gev.fit(bugArr_small)
gevFine = gev.fit(fineArr)
gevSmallFine = gev.fit(fineArr_small)
I get the following output for the GEV parameters estimated for the bugArr/bugArr_small and fineArr/fineArr_small:
Known bug array
Random values: (0.12118250540401079, 0.36692231766996053, 0.23142400358716353)
Random values scaled: (-0.8446554391074808, 3.0751769299431084e-06, 2.620390405092363e-06)
Known fine array
Random values: (0.6745399522587823, 0.47616297212022757, 0.34117425062278584)
Random values scaled: (0.6745399522587823, 4.761629721202293e-06, 3.411742506227867e-06)
Why would the shape parameter change so dramatically when the only difference in the data is a change in scaling? I would've expected the behavior to be consistent with the FineArr results (no change in shape parameter, and appropriate scaling of location and scale parameters). I've repeated the test in Matlab, but the results there are in line with what I expected (i.e., no change in shape parameter).
I think I know why this might be happening. It is possible to pass initial shape parameter estimates when fitting, see the documentation for scipy.stats.rv_continuous.fit where it states "Starting value(s) for any shape-characterizing arguments (those not provided will be determined by a call to _fitstart(data)). No default value." Here is some extremely ugly, functional, code using my pyeq3 statistical distribution fitter which internally attempts to use different estimates, fit them, and return the parameters for best nnlf of the different fits. This example code does not show the behavior you observe, and gives the same shape parameters regardless of scaling. You would need to install pyeq3 with "pip3 install pyeq3" to run this code. The pyeq3 code is designed for text input from a web interface on zunzun.com, so hold you nose - here is the example code:
import numpy as np
#Set up arrays of values to fit curve to
sample=np.random.rand(1,30) #Random set of decimal values
smallVals = sample*1e-5 #Scale to smaller values
#If the above is not creating different values, this instance of random numbers has:
bugArr = np.array([0.25322987, 0.81952358, 0.94497455, 0.36295543, 0.72272746, 0.49482558,0.65674877, 0.40876558, 0.64952248, 0.23171052, 0.24645658, 0.35359126,0.27578928, 0.24820775, 0.69789187, 0.98876361, 0.22104156,0.40019593,0.0756707, 0.12342556, 0.3601186, 0.54137089,0.43477705, 0.44622486,0.75483338, 0.69766687, 0.1508741, 0.75428996, 0.93706003, 0.1191987])
bugArr_small = bugArr*1e-5
#This array of random numbers gives the same shape parameter regardless
fineArr = np.array([0.7449611, 0.82376693, 0.32601009, 0.18544293, 0.56779629, 0.30495415,
0.04670362, 0.88106521, 0.34013959, 0.84598841, 0.24454428, 0.57981437,
0.57129427, 0.8857514, 0.96254429, 0.64174078, 0.33048637, 0.17124045,
0.11512589, 0.31884749, 0.48975204, 0.87988863, 0.86898236, 0.83513966,
0.05858769, 0.25889509, 0.13591874, 0.89106616, 0.66471263, 0.69786708])
fineArr_small = fineArr*1e-5
bugArr_str = ''
for i in range(len(bugArr)):
bugArr_str += str(bugArr[i]) + '\n'
bugArr_small_str = ''
for i in range(len(bugArr_small)):
bugArr_small_str += str(bugArr_small[i]) + '\n'
fineArr_str = ''
for i in range(len(fineArr)):
fineArr_str += str(fineArr[i]) + '\n'
fineArr_small_str = ''
for i in range(len(fineArr_small)):
fineArr_small_str += str(fineArr_small[i]) + '\n'
import pyeq3
simpleObject_bugArr = pyeq3.IModel.IModel()
simpleObject_bugArr._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(bugArr_str, simpleObject_bugArr, False)
solver = pyeq3.solverService()
result_bugArr = solver.SolveStatisticalDistribution('genextreme', simpleObject_bugArr.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
simpleObject_bugArr_small = pyeq3.IModel.IModel()
simpleObject_bugArr_small._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(bugArr_small_str, simpleObject_bugArr_small, False)
solver = pyeq3.solverService()
result_bugArr_small = solver.SolveStatisticalDistribution('genextreme', simpleObject_bugArr_small.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
simpleObject_fineArr = pyeq3.IModel.IModel()
simpleObject_fineArr._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(fineArr_str, simpleObject_fineArr, False)
solver = pyeq3.solverService()
result_fineArr = solver.SolveStatisticalDistribution('genextreme', simpleObject_fineArr.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
simpleObject_fineArr_small = pyeq3.IModel.IModel()
simpleObject_fineArr_small._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(fineArr_small_str, simpleObject_fineArr_small, False)
solver = pyeq3.solverService()
result_fineArr_small = solver.SolveStatisticalDistribution('genextreme', simpleObject_fineArr_small.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
print('ba',result_bugArr[1]['fittedParameters'])
print('ba_s',result_bugArr_small[1]['fittedParameters'])
print()
print('fa',result_fineArr[1]['fittedParameters'])
print('fa_s',result_fineArr_small[1]['fittedParameters'])

Indexing and vectorizing a nested-loop

I am running a particular script that will calculate the fractal dimension of the input data. While the script does run fine, it is very slow, and a look into it using cProfile showed that the function boxcount is accounting for around 90% of the run time. I have had similar issues in a previous questions,More efficient way to loop?, and Vectorization of a nested for-loop. While looking at cProfile, the function itself does not run slow, but in the script is needs to be called a large number of times. I'm struggling to find a way to re-write this to eliminate the large number of function calls. Here is the code below:
for j in range(starty, endy):
jmin=j-half_tile
jmax=j+half_tile+1
# Loop over columns
for i in range(startx, endx):
imin=i-half_tile
imax=i+half_tile+1
# Extract a subset of points from the input grid, centered on the current
# point. The size of tile is given by the current entry of the tile list.
z = surface[imin:imax, jmin:jmax]
# print 'Tile created. Size:', z.shape
# Calculate fractal dimension of the tile using 3D box-counting
fd, intercept = boxcount(z,dx,nside,cell,slice_size,box_size)
FractalDim[i,j] = fd
Lacunarity[i,j] = intercept
My real problem is that for each loop through i,j, it finds the values of imin,imax,jmin,jmax, which is basically creating a subset of the input data, centered around the values of imin,imax,jmin,jmax. The function of interest, boxcount is evaluated over the range of imin,imax,jmin,jmax as well. For this example, the value of half_tile is 6, and the values for starty,endy,startx,endx are 6,271,5,210 respectively. The values of dx,cell,nside,slice_size,box_size are all just constants used in the boxcount function.
I have done problems similar to this, just not with the added complication of centering the slice of data around a particular point. Can this be vectorized? or improved at all?
EDIT
Here is the code for the function boxcount as requested.
def boxcount(z,dx,nside,cell,slice_size,box_size):
# fractal dimension calculation using box-counting method
n = 5 # number of graph points for simple linear regression
gx = [] # x coordinates of graph points
gy = [] # y coordinates of graph points
boxCount = np.zeros((5))
cell_set = np.reshape(np.zeros((5*(nside**3))), (nside**3,5))
nslice=nside**2
# Box is centered at the mid-point of the tile. Calculate for each point in the
# tile, which voxel the contains the point
z0 = z[nside/2,nside/2]-dx*nside/2
for j in range(1,13):
for i in range(1,13):
ij = (j-1)*12 + i
# print 'i, j:', i, j
delz1 = z[i-1,j-1]-z0
delz2 = z[i-1,j]-z0
delz3 = z[i,j-1]-z0
delz4 = z[i,j]-z0
delz = 0.25*(delz1+delz2+delz3+delz4)
if delz < 0.0:
break
slice = ceil(delz)
# print " delz:",delz," slice:",slice
# Identify the voxel occupied by current point
ijk = int(slice-1.)*nslice + (j-1)*nside + i
for k in range(5):
if cell_set[cell[ijk,k],k] != 1:
cell_set[cell[ijk,k],k] = 1
# Set any cells deeper than this one equal to one aswell
# index = cell[ijk,k]
# for l in range(int(index),box_size[k],slice_size[k]):
# cell_set[l,k] = 1
# Count number of filled boxes for each box size
boxCount = np.sum(cell_set,axis=0)
# print "boxCount:", boxCount
for ib in range(1,n+1):
# print "ib:",ib," x(ib):",math.log(1.0/ib)," y(ib):",math.log(boxCount[ib-1])
gx.append( math.log(1.0/ib) )
gy.append( math.log(boxCount[ib-1]) )
# simple linear regression
m, b = np.polyfit(gx,gy,1)
# print "Polyfit: Slope:", m,' Intercept:', b
# fd = m-1
fd = max(2.,m)
return(fd,b)

python return array from iteration

I want to plot an approximation of the number "pi" which is generated by a function of two uniformly distributed random variables. The goal is to show that with a higher sample draw the function value approximates "pi".
Here is my function for pi:
def pi(n):
x = rnd.uniform(low = -1, high = 1, size = n) #n = size of draw
y = rnd.uniform(low = -1, high = 1, size = n)
a = x**2 + y**2 <= 1 #1 if rand. draw is inside the unit cirlce, else 0
ac = np.count_nonzero(a) #count 1's
af = np.float(ac) #create float for precision
pi = (af/n)*4 #compute p dependent on size of draw
return pi
My problem:
I want to create a lineplot that plots the values from pi() dependent on n.
My fist attempt was:
def pipl(n):
for i in np.arange(1,n):
plt.plot(np.arange(1,n), pi(i))
print plt.show()
pipl(100)
which returns:
ValueError: x and y must have same first dimension
My seocond guess was to start an iterator:
def y(n):
n = np.arange(1,n)
for i in n:
y = pi(i)
print y
y(1000)
which results in:
3.13165829146
3.16064257028
3.06519558676
3.19839679359
3.13913913914
so the algorithm isn't far off, however i need the output as a data type which matplotlib can read.
I read:
http://docs.scipy.org/doc/numpy/reference/routines.array-creation.html#routines-array-creation
and tried tom implement the function like:
...
y = np.array(pi(i))
...
or
...
y = pi(i)
y = np.array(y)
...
and all the other functions that are available from the website. However, I can't seem to get my iterated y values into one that matplotlib can read.
I am fairly new to python so please be considerate with my simple request. I am really stuck here and can't seem to solve this issue by myself.
Your help is really appreciated.
You can try with this
def pipl(n):
plt.plot(np.arange(1,n), [pi(i) for i in np.arange(1,n)])
print plt.show()
pipl(100)
that give me this plot
If you want to stay with your iterable approach you can use Numpy's fromiter() to collect the results to an array. Like:
def pipl(n):
for i in np.arange(1,n):
yield pi(i)
n = 100
plt.plot(np.arange(1,n), np.fromiter(pipl(n), dtype='f32'))
But i think Numpy's vectorize would be even better in this case, it makes the resulting code much more readable (to me). With this approach you dont need the pipl function anymore.
# vectorize the function pi
pi_vec = np.vectorize(pi)
# define all n's
n = np.arange(1,101)
# and plot
plt.plot(n, pi_vec(n))
A little side note, naming a function pi which does not return a true pi seems kinda tricky to me.

How to pick points under the curve?

What I'm trying to do is make a gaussian function graph. then pick random numbers anywhere in a space say y=[0,1] (because its normalized) & x=[0,200]. Then, I want it to ignore all values above the curve and only keep the values underneath it.
import numpy
import random
import math
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from math import sqrt
from numpy import zeros
from numpy import numarray
variance = input("Input variance of the star:")
mean = input("Input mean of the star:")
x=numpy.linspace(0,200,1000)
sigma = sqrt(variance)
z = max(mlab.normpdf(x,mean,sigma))
foo = (mlab.normpdf(x,mean,sigma))/z
plt.plot(x,foo)
zing = random.random()
random = random.uniform(0,200)
import random
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.random(), random.uniform(0,200)))
return ret
size = input("Input number of simulations:")
foos = set(foo)
xx = set(x)
method = method2(size)
def undercurve(xx,foos,method):
Upper = numpy.where(foos<(method))
Lower = numpy.where(foos[Upper]>(method[Upper]))
return (xx[Upper])[Lower],(foos[Upper])[Lower]
When I try to print undercurve, I get an error:
TypeError: 'set' object has no attribute '__getitem__'
and I have no idea how to fix it.
As you can all see, I'm quite new at python and programming in general, but any help is appreciated and if there are any questions I'll do my best to answer them.
The immediate cause of the error you're seeing is presumably this line (which should be identified by the full traceback -- it's generally quite helpful to post that):
Lower = numpy.where(foos[Upper]>(method[Upper]))
because the confusingly-named variable method is actually a set, as returned by your function method2. Actually, on second thought, foos is also a set, so it's probably failing on that first. Sets don't support indexing with something like the_set[index]; that's what the complaint about __getitem__ means.
I'm not entirely sure what all the parts of your code are intended to do; variable names like "foos" don't really help like that. So here's how I might do what you're trying to do:
# generate sample points
num_pts = 500
sample_xs = np.random.uniform(0, 200, size=num_pts)
sample_ys = np.random.uniform(0, 1, size=num_pts)
# define distribution
mean = 50
sigma = 10
# figure out "normalized" pdf vals at sample points
max_pdf = mlab.normpdf(mean, mean, sigma)
sample_pdf_vals = mlab.normpdf(sample_xs, mean, sigma) / max_pdf
# which ones are under the curve?
under_curve = sample_ys < sample_pdf_vals
# get pdf vals to plot
x = np.linspace(0, 200, 1000)
pdf_vals = mlab.normpdf(x, mean, sigma) / max_pdf
# plot the samples and the curve
colors = np.array(['cyan' if b else 'red' for b in under_curve])
scatter(sample_xs, sample_ys, c=colors)
plot(x, pdf_vals)
Of course, you should also realize that if you only want the points under the curve, this is equivalent to (but much less efficient than) just sampling from the normal distribution and then randomly selecting a y for each sample uniformly from 0 to the pdf value there:
sample_xs = np.random.normal(mean, sigma, size=num_pts)
max_pdf = mlab.normpdf(mean, mean, sigma)
sample_pdf_vals = mlab.normpdf(sample_xs, mean, sigma) / max_pdf
sample_ys = np.array([np.random.uniform(0, pdf_val) for pdf_val in sample_pdf_vals])
It's hard to read your code.. Anyway, you can't access a set using [], that is, foos[Upper], method[Upper], etc are all illegal. I don't see why you convert foo, x into set. In addition, for a point produced by method2, say (x0, y0), it is very likely that x0 is not present in x.
I'm not familiar with numpy, but this is what I'll do for the purpose you specified:
def undercurve(size):
result = []
for i in xrange(size):
x = random()
y = random()
if y < scipy.stats.norm(0, 200).pdf(x): # here's the 'undercurve'
result.append((x, y))
return results

Discretization of probability array in Python

I have a numpy array (actually imported from a GIS raster map) which contains
probability values of occurrence of a species like following example:
a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)
Now I want to get a discrete version for that array again. Like if I have
e.g. 100 individuals which are located on the area of that array (1200 cells) how are they
distributed? Of course they should be distributed according to their probability,
meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability
cell. It should be possible that multiple individuals can occupy on cell...
It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.
...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.
Use random.choice with bincount:
np.bincount(np.random.choice(b.size, 100, p=b.flat),
minlength=b.size).reshape(b.shape)
If you don't have NumPy 1.7, you can replace random.choice with:
np.searchsorted(np.cumsum(b), np.random.random(100))
giving:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
minlength=b.size).reshape(b.shape)
So far I think ecatmur's answer seems quite reasonable and simple.
I just want to add may a more "applied" example. Considering a dice
with 6 faces (6 numbers). Each number/result has a probability of 1/6.
Displaying the dice in form of an array could look like:
b = np.array([[1,1,1],[1,1,1]])/6.0
Thus rolling the dice 100 times (n=100) results in following simulation:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)
I think that can be an appropriate approach for such an application.
Thus thank you ecatmur for your help!
/Johannes
this is similar to my question i had earlier this month.
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution
The two main lines that are important are the last two lines in the code above

Categories