I am trying to generate an M by N matrix where N is the number of multivariate normal random variables with M number of samples. I am trying to generate the matrix such that each column is a unit-variance and correlation between any two columns is Y.
I have tried
M = 100
N = 3
Y = 0.5
mean = (0,0,0)
cov = np.array([[0.5,0.5,0.5],[0.5,0.5,0.5],[0.5,0.5,0.5]])
np.random.multivariate_normal(mean, cov, (M,N))
it returns an np array consisting of M array where each ith array consist of N number of values and they are all the same.
Can anyone advise how to generate an M by N matrix such that each column in a unit variance and correlation between any two columns is Y, where N is the number of standard multivariate normal random variable.
So it turns out that I am almost entirely wrong. The reason why your values are not deviating is because you have a correlation value of 1 between your values. As a result, the values no not vary.
Related
Create a matrix like a transtision matrix
How i can create random matrix with sum of values by column = 1 in python ?
(EDIT: added output)
I suggest completing this in two steps:
Create a random matrix
Normalize each column
1. Create random matrix
Let's say you want a 3 by 3 random transition matrix:
M = np.random.rand(3, 3)
Each of M's entries will have a random value between 0 and 1.
Normalize M's columns
By dividing each column by the column sum will achieve what you want. This can be done in several ways, but I prefer to create an array r whose elements is the column sum of M:
r = M.sum(axis=0)
Then, divide M by r:
transition_matrix = M / r
Example output
>>> import numpy as np
>>> M = np.random.rand(3,3 )
>>> r = M.sum(axis=0)
>>> transition_matrix = M / r
>>> M
array([[0.74145687, 0.68389986, 0.37008102],
[0.81869654, 0.0394523 , 0.94880781],
[0.93057194, 0.48279246, 0.15581823]])
>>> r
array([2.49072535, 1.20614462, 1.47470706])
>>> transition_matrix
array([[0.29768713, 0.56701315, 0.25095223],
[0.32869804, 0.03270943, 0.64338731],
[0.37361483, 0.40027743, 0.10566046]])
>>> transition_matrix.sum(axis=0)
array([1., 1., 1.])
You could use KNOWN distribution where each sample would have (by default) summed to one, e.g. Dirichlet distribution.
After that code is basically one liner, Python 3.8, Windows 10 x64
import numpy as np
N = 3
# set alphas array, 1s by default
a = np.empty(N)
a.fill(1.0)
mtx = np.random.dirichlet(a, N).transpose()
print(mtx)
and it will print something like
[[0.56634637 0.04568052 0.79105779]
[0.42542107 0.81892862 0.02465906]
[0.00823256 0.13539087 0.18428315]]
UPDATE
For the case of "sample something and normalize", problem is one would get value from unknown distribution. For Dirichlet there are expressions for mean, std.dev, PDF, CDF, you name it.
Even for the case with Xi sampled from U(0,1) what would be distribution of values for Xi/Sum(i, Xi).
Anything to say about mean? std.dev? PDF? Other stat properties?
You could sample from exponential and get sum normalized to 1, but
question would be even more acute - if Xi is Exp(1), what is the distribution for Xi/Sum(i, Xi) ? PDF? Mean? Std.dev?
I have a problem when i generate random matrices and turn them into positive semidefinite. If we take a random matrix Q and then multiply it with its transpose then the outcome should be positive semidefinite ( no negative eigenvalues) However when i print some random eigenvalues i see that i have negative ones. Does my code has something wrong? Also I want to save the max values of the eigenvalues into a vector. My Q random matrices are integers and the eigenvalues from what i ve seen are real numbers and the complex part is always 0. However i get a warning as well. Let me show you my code first
#here i create a random torch with N matrices of n by n size
Q = torch.randint(0, 10, size=(N, n, n))
#here i initialize my vector for the max eigenvalues
max=np.zeros(N)
#here i create a loop where i multiply each Q matrix in my torch with its transpose
for i in range(0,N):
Q[i] = Q[i]*Q[i].t()
#here i find my eigenvalues and save the max into my vector max
val,vec=lg.eig(Q[i])
max[i]= np.amax(val)
The warning i get is
ComplexWarning: Casting complex values to real discards the imaginary part
and my eigenvalues i print them from the console with the command
lg.eig(Q[0])
However i see for example
(array([120.20198423+0.j, -1.93985888+0.j, 34.73787466+0.j])
Which has a negative value
Your example is a random integer matrix, but your warning message implies Q contains complex values.
As such, in order to create a positive semidefinite matrix you must multiply Q by its conjugate transpose torch.conj(Q).t() (this is equivalent to the transpose across the reals).
Also, you are computing the dot product by using *, for matrix multiplication use #, torch.mm or torch.matmul.
As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?
If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).
One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d times.
A faster approach would be to generate a random Nxd 2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.
You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different
individual dimensions of the data are uncorrelated. These basis
vectors are called principal components, and several related
procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).
I have a m x n rectangular matrix A for which n > m. Given the rank r <= m of A, the reduced QR decomposition yields matrix Q with m x r dimensions, and R with r x n dimensions. The columns of Q are an orthonormal basis for the range of A. R will be upper triangular but in a staircase pattern. Columns in R with a pivot correspond to independent columns in A.
When I apply qr function from numpy.linalg (there is also a version of this function in scipy.linalg, which seems to be the same), it returns matrix Q with m x m dimensions, and R with m x n dimensions, even when the rank of matrix A is less than m. This seems to be the "full" QR decomposition, for which the columns of Q are an orthonormal basis for Re^m. Is it possible to identify the independent columns of A through this R matrix returned by function qr in numpy.linalg;scipy.linalg?
Check for diagonal elements of R that are non-zero:
import numpy as np
min_tol = 1e-9
A = np.array([[1,2,3],[4,3,2],[1,1,1]])
print("Matrix rank of: {}".format(np.linalg.matrix_rank(A)))
Q,R = np.linalg.qr(A)
indep = np.where(np.abs(R.diagonal()) > min_tol)[0]
print(A[:, indep])
print("Independent columns are: {}".format(indep))
see also here:
How to find degenerate rows/columns in a covariance matrix
Suppose I have two vectors of length 25, and I want to compute their covariance matrix. I try doing this with numpy.cov, but always end up with a 2x2 matrix.
>>> import numpy as np
>>> x=np.random.normal(size=25)
>>> y=np.random.normal(size=25)
>>> np.cov(x,y)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
Using the rowvar flag doesn't help either - I get exactly the same result.
>>> np.cov(x,y,rowvar=0)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
How can I get the 25x25 covariance matrix?
You have two vectors, not 25. The computer I'm on doesn't have python so I can't test this, but try:
z = zip(x,y)
np.cov(z)
Of course.... really what you want is probably more like:
n=100 # number of points in each vector
num_vects=25
vals=[]
for _ in range(num_vects):
vals.append(np.random.normal(size=n))
np.cov(vals)
This takes the covariance (I think/hope) of num_vects 1xn vectors
Try this:
import numpy as np
x=np.random.normal(size=25)
y=np.random.normal(size=25)
z = np.vstack((x, y))
c = np.cov(z.T)
Covariance matrix from samples vectors
To clarify the small confusion regarding what is a covariance matrix defined using two N-dimensional vectors, there are two possibilities.
The question you have to ask yourself is whether you consider:
each vector as N realizations/samples of one single variable (for example two 3-dimensional vectors [X1,X2,X3] and [Y1,Y2,Y3], where you have 3 realizations for the variables X and Y respectively)
each vector as 1 realization for N variables (for example two 3-dimensional vectors [X1,Y1,Z1] and [X2,Y2,Z2], where you have 1 realization for the variables X,Y and Z per vector)
Since a covariance matrix is intuitively defined as a variance based on two different variables:
in the first case, you have 2 variables, N example values for each, so you end up with a 2x2 matrix where the covariances are computed thanks to N samples per variable
in the second case, you have N variables, 2 samples for each, so you end up with a NxN matrix
About the actual question, using numpy
if you consider that you have 25 variables per vector (took 3 instead of 25 to simplify example code), so one realization for several variables in one vector, use rowvar=0
# [X1,Y1,Z1]
X_realization1 = [1,2,3]
# [X2,Y2,Z2]
X_realization2 = [2,1,8]
numpy.cov([X,Y],rowvar=0) # rowvar false, each column is a variable
Code returns, considering 3 variables:
array([[ 0.5, -0.5, 2.5],
[-0.5, 0.5, -2.5],
[ 2.5, -2.5, 12.5]])
otherwise, if you consider that one vector is 25 samples for one variable, use rowvar=1 (numpy's default parameter)
# [X1,X2,X3]
X = [1,2,3]
# [Y1,Y2,Y3]
Y = [2,1,8]
numpy.cov([X,Y],rowvar=1) # rowvar true (default), each row is a variable
Code returns, considering 2 variables:
array([[ 1. , 3. ],
[ 3. , 14.33333333]])
Reading the documentation as,
>> np.cov.__doc__
or looking at Numpy Covariance, Numpy treats each row of array as a separate variable, so you have two variables and hence you get a 2 x 2 covariance matrix.
I think the previous post has right solution. I have the explanation :-)
I suppose what youre looking for is actually a covariance function which is a timelag function. I'm doing autocovariance like that:
def autocovariance(Xi, N, k):
Xs=np.average(Xi)
aCov = 0.0
for i in np.arange(0, N-k):
aCov = (Xi[(i+k)]-Xs)*(Xi[i]-Xs)+aCov
return (1./(N))*aCov
autocov[i]=(autocovariance(My_wector, N, h))
You should change
np.cov(x,y, rowvar=0)
onto
np.cov((x,y), rowvar=0)
What you got (2 by 2) is more useful than 25*25. Covariance of X and Y is an off-diagonal entry in the symmetric cov_matrix.
If you insist on (25 by 25) which I think useless, then why don't you write out the definition?
x=np.random.normal(size=25).reshape(25,1) # to make it 2d array.
y=np.random.normal(size=25).reshape(25,1)
cov = np.matmul(x-np.mean(x), (y-np.mean(y)).T) / len(x)
As pointed out above, you only have two vectors so you'll only get a 2x2 cov matrix.
IIRC the 2 main diagonal terms will be sum( (x-mean(x))**2) / (n-1) and similarly for y.
The 2 off-diagonal terms will be sum( (x-mean(x))(y-mean(y)) ) / (n-1). n=25 in this case.
according the document, you should expect variable vector in column:
If we examine N-dimensional samples, X = [x1, x2, ..., xn]^T
though later it says each row is a variable
Each row of m represents a variable.
so you need input your matrix as transpose
x=np.random.normal(size=25)
y=np.random.normal(size=25)
X = np.array([x,y])
np.cov(X.T)
and according to wikipedia: https://en.wikipedia.org/wiki/Covariance_matrix
X is column vector variable
X = [X1,X2, ..., Xn]^T
COV = E[X * X^T] - μx * μx^T // μx = E[X]
you can implement it yourself:
# X each row is variable
X = X - X.mean(axis=0)
h,w = X.shape
COV = X.T # X / (h-1)
i don't think you understand the definition of covariance matrix.
If you need 25 x 25 covariance matrix, you need 25 vectors each with n data points.