I'm doing homework in Monte Carlo course and I'm asked to find a matrix of Markov chain with 6 states namely
0, 1, 2, 3, 4, 5
such that after long enough period of time we have spent time proportional to the numbers
5, 10, 5, 10, 25, 60
in each of the states.
I see that this is the stationary vector that we get if we have the transition matrix. I have to use Metropolis algorithm, but all the explanations and examples I find are based on Metropolis-Hasting algorithm.
Pseudocode of the algorithm that I have:
select x
Loop over repetitions t=1,2...
select y from Nx using density Gx
put h=min(1, f(y)/f(x))
if U ~ U(0, 1) < h then x <- y
end loop
I'm looking for a step by step explanation how to implement the algorithm for the given problem, preferably in python!
The algorithmic approach
The standard approach to computing the stationary distribution of a Markov Chain is the solution of linear equations, e.g. as described here https://stephens999.github.io/fiveMinuteStats/stationary_distribution.html.
The solution to your problem is the same in reverse - solve the same equations, except that in your case, you have the stationary distribution but you don't have the transition probabilities/rates.
The problem with this approach, however, is that you may construct a system of linear equations with much more variables than equations. This severely reduces your options regarding the topology of the constructed Markov Chain.
Fortunately, you seem to have no constraints on the topology of the constructed Markov Chain, so you can make some compromises. What you can do is disable most transitions, i.e. give them zero probability/rate, and only enable one transition per state. This may produce some kind of a ring topology, but it should ensure that your system of linear equations has a solution.
Primitive example
Consider stationary distribution Pi = ( x = 1/3, y = 1/3, z = 1/3)
Construct your system of linear equations as
Pi(x) = 1/3 = Pr(y,x) * Pi(y)
Pi(y) = 1/3 = Pr(z,y) * Pi(z)
Pi(z) = 1/3 = Pr(x,z) * Pi(x)
In this case a solution is Pr(y,x) = Pr(z,y) = Pr(x,z) = 1 and the obtained Markov Chain just boringly loops from x to z to y and back to x with probability 1.
Note that the number of fitting solutions may be infinite (even for the reduced system of linear equations as shown in the example), e.g. in this case the probabilities/rates can be any positive value as long as they are all equal.
So, step by step solution
Construct the system of linear equations as described.
Solve the constructed system of linear equations
The solution of your constructed system of linear equations describes a Markov Chain you are looking for. Trivially reconstruct the entire transition matrix, if you want to.
Related
I am new to linear programming and am hoping to get some help in understanding how to include intercept terms in the objective for a piecewise function (see below code example).
import pulp as pl
# Pieces
pieces = [1, 2]
# Problem
prob = pl.LpProblem('Example', pl.LpMaximize)
# Decision Vars
x_vars = pl.LpVariable.dict('x', pieces, 0, None, pl.LpInteger)
y_vars = pl.LpVariable.dict('y', pieces, 0, None, pl.LpInteger)
# Objective
prob += (-500+10*x_vars[1]) + (150+9*x_vars[2]) + (2+9.1*y_vars[1]) + (4+6*y_vars[2])
# Constraints
prob += pl.lpSum(x_vars[i] for i in pieces) + pl.lpSum(y_vars[i] for i in pieces) <= 1100
prob += x_vars[1] <= 700
prob += x_vars[2] <= 700
prob += y_vars[1] <= 400
prob += y_vars[2] <= 400
# Solve
prob.solve()
# Results
for v in prob.variables():
print(v.name, "=", pl.value(v))
The terms included in the objective function are the piecewise intercepts and coefficients obtained from univariate piecewise regression models. For example, the linear regression model for x is yhat=-500+10*x for the first piece, and yhat=150+9*x for the second piece. Likewise, for y we have yhat=2+9.1*x and yhat=4+6*x for the first and second pieces, respectively.
If I remove and/or change any of the intercept values, I arrive at the same solution. I would have thought that each intercept is required for producing the estimates in the objective function. Have I not specified the objective function properly? ..or are the intercept terms not required (and therefore not taken into account) in this type of LP formulation.
I don't exactly understand what you are trying to achieve. But let me try to explain what we normally do when we talk about piecewise linear functions.
A piecewise linear function is completely determined by its breakpoints. E.g.
The input is just these points
xbar = [1,3,6,10]
ybar = [6,2,8,7]
These points you have to calculate in advance, outside the optimization model. The intercept and slope of the segments are represented in these points. Note that the intercept cannot be ignored: it would lead to a very different segment. Calculations of these breakpoints need care: without correct breakpoints, your model will not function properly.
When using such a piecewise linear function, we want to maintain a mapping between x and y (both decision variables). I.e. we always want to hold for any feasible solution:
y = f(x)
where f represents the piecewise linear function. This means that besides choosing a segment, we need to interpolate between the breakpoints (i.e. we want to trace the blue line). The formulations below essentially form the constraint y=f(x) but in such a way that it is accepted by a MIP (Mixed Integer Programming) solver.
To interpolate between the breakpoints, we can use a lot of different formulations. The simplest is to use SOS2 variables. (SOS2 stands for Special Ordered Sets of Type 2, a construct that is supported by most high-end solvers). The formulation would look like:
Here x,y, and λ are decision variables (and xbar,ybar are data, i.e. constants). k is the set of points (here: k=1,..,4).
Not all solvers and modeling tools support SOS2 variables. Here is another formulation using binary variables:
Here s is the segment index: s=1,2,3. This is sometimes called the incremental formulation.
These are just two formulations. There are many others. Note that some solvers and modeling tools have special constructs to express piecewise linear functions. But all these share the idea of providing a collection of breakpoints.
This is very different from what you did. But this is what we typically do to model piecewise linear functions in Mixed-Integer Programming models.
A well-written reference is: H. Paul Williams, Model Building in Mathematical Programming, Wiley. You are encouraged to consult this practical book: it is very good.
I have the following matrix
I have transformed this to strictly dominant matrix and applied Guass-Siedel and Successive over relaxation rate method with omega=1.1 and tolerance of epsilon=1e-4 with convergence formula as below
By solving this using python manually(not using linear algebra library) i found that both the methods are taking same number of iterations(6), but as per my understanding if the matrix is convergent in Gauss-Siedel and 1<omega<2 for successive over relaxation rate method then SOR method should take less number of iterations which is not happening?
so, is my understanding correct? is it mandatory for SOR method to take less number of iterations?
This is actually a question I had myself as I was trying to solve the same problem. Here I will include my results from the 6th iteration from both GS and SOR methods and will analyze my opinion on why this is the case. For both the initial vector x = (0,0,0,0). Practically speaking we see that the L infinity norm is different for each method (see below).
For Gauss-Seidel:
The solution vector in iteration 6 is:
[[ 1.0001]
[ 2. ]
[-1. ]
[ 1. ]]
The L infinity norm in iteration 6 is: [4.1458e-05]
For SOR:
The solution vector in iteration 6 is:
[[ 1.0002]
[ 2.0001]
[-1.0001]
[ 1. ]]
The L infinity norm in iteration 6 is: [7.8879e-05]
Academically speaking "SOR can provide a convenient means to speed up both the Jacobian and Gauss-Seidel methods of solving the our linear system. The parameter ω is referred to as the relaxation parameter. Clearly for ω = 1 we restore the original equations. If ω < 1 we talk of under-relaxation, and this can be important for some systems which will not converge under normal Jacobian relaxation. If ω > 1, we have over-relaxation, with which we will be more concerned. It was discovered during the years of hand computation that convergence is faster if we go beyond the Gauss-Seidel correction. Roughly speaking, those approximations stay on the same side of the solution x. An overrelaxation factor ω moves us closer to the solution. With ω = 1, we recover Gauss-Seidel; with ω > 1, the method is known as SOR. The optimal choice of ω never exceeds 2. It is often in the neighborhood of 1.9."
For more information on the ω you can also refer to Strang, G., 2006 page 410 of the book "Linear Algebra and its applications" as well as to the paper A rapid finite difference algorithm, utilizing successive over‐relaxation to solve the Poisson–Boltzmann equation.
Based on the academic description above I believe that both of these methods have 6 iterations because 1.1 is not the optimal ω value. Changing ω to a value closer to could yield a better result, as the whole point of overrelaxation is to discover this optimal ω. (My belief again is that this 1.1 is not the optimal omega and will update you once I do the calculation). The image is from Strang, G., 2006 "Linear algebra and its applications" 4th edition page 411.
Edit: Indeed by running a graphical representation of omega - iterations in SOR it seems that my optimal omega is in the range of 1.0300 to 1.0440, and the whole range of these omegas gives me five iterations, which is a more efficient way than pure Gauss-Seidel at omega = 1 that gives 6 iterations.
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
I've implemented the mutual information formula in python using pandas and numpy
def mutual_info(p):
p_x=p.sum(axis=1)
p_y=p.sum(axis=0)
I=0.0
for i_y in p.index:
for i_x in p.columns:
I+=(p.ix[i_y,i_x]*np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x]))).values[0]
return I
However, if a cell in p has a zero probability, then np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x])) is negative infinity, and the whole expression is multiplied by zero and returns NaN.
What is the right way to work around that?
For various theoretical and practical reasons (e.g., see Competitive Distribution Estimation:
Why is Good-Turing Good), you might consider never using a zero probability with the log loss measure.
So, say, if you have a probability vector p, then, for some small scalar α > 0, you would use α 1 + (1 - α) p (where here the first 1 is the uniform vector). Unfortunately, there are no general guidelines for choosing α, and you'll have to assess this further down the calculation.
For the Kullback-Leibler distance, you would of course apply this to each of the inputs.
I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5