Random matrix with sum of values by column = 1 in python - python

Create a matrix like a transtision matrix
How i can create random matrix with sum of values by column = 1 in python ?

(EDIT: added output)
I suggest completing this in two steps:
Create a random matrix
Normalize each column
1. Create random matrix
Let's say you want a 3 by 3 random transition matrix:
M = np.random.rand(3, 3)
Each of M's entries will have a random value between 0 and 1.
Normalize M's columns
By dividing each column by the column sum will achieve what you want. This can be done in several ways, but I prefer to create an array r whose elements is the column sum of M:
r = M.sum(axis=0)
Then, divide M by r:
transition_matrix = M / r
Example output
>>> import numpy as np
>>> M = np.random.rand(3,3 )
>>> r = M.sum(axis=0)
>>> transition_matrix = M / r
>>> M
array([[0.74145687, 0.68389986, 0.37008102],
[0.81869654, 0.0394523 , 0.94880781],
[0.93057194, 0.48279246, 0.15581823]])
>>> r
array([2.49072535, 1.20614462, 1.47470706])
>>> transition_matrix
array([[0.29768713, 0.56701315, 0.25095223],
[0.32869804, 0.03270943, 0.64338731],
[0.37361483, 0.40027743, 0.10566046]])
>>> transition_matrix.sum(axis=0)
array([1., 1., 1.])

You could use KNOWN distribution where each sample would have (by default) summed to one, e.g. Dirichlet distribution.
After that code is basically one liner, Python 3.8, Windows 10 x64
import numpy as np
N = 3
# set alphas array, 1s by default
a = np.empty(N)
a.fill(1.0)
mtx = np.random.dirichlet(a, N).transpose()
print(mtx)
and it will print something like
[[0.56634637 0.04568052 0.79105779]
[0.42542107 0.81892862 0.02465906]
[0.00823256 0.13539087 0.18428315]]
UPDATE
For the case of "sample something and normalize", problem is one would get value from unknown distribution. For Dirichlet there are expressions for mean, std.dev, PDF, CDF, you name it.
Even for the case with Xi sampled from U(0,1) what would be distribution of values for Xi/Sum(i, Xi).
Anything to say about mean? std.dev? PDF? Other stat properties?
You could sample from exponential and get sum normalized to 1, but
question would be even more acute - if Xi is Exp(1), what is the distribution for Xi/Sum(i, Xi) ? PDF? Mean? Std.dev?

Related

Generate M by N matrix

I am trying to generate an M by N matrix where N is the number of multivariate normal random variables with M number of samples. I am trying to generate the matrix such that each column is a unit-variance and correlation between any two columns is Y.
I have tried
M = 100
N = 3
Y = 0.5
mean = (0,0,0)
cov = np.array([[0.5,0.5,0.5],[0.5,0.5,0.5],[0.5,0.5,0.5]])
np.random.multivariate_normal(mean, cov, (M,N))
it returns an np array consisting of M array where each ith array consist of N number of values and they are all the same.
Can anyone advise how to generate an M by N matrix such that each column in a unit variance and correlation between any two columns is Y, where N is the number of standard multivariate normal random variable.
So it turns out that I am almost entirely wrong. The reason why your values are not deviating is because you have a correlation value of 1 between your values. As a result, the values no not vary.

Generating linearly independent columns for a matrix

As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?
If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).
One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d times.
A faster approach would be to generate a random Nxd 2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.
You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different
individual dimensions of the data are uncorrelated. These basis
vectors are called principal components, and several related
procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).

Numpy - row-wise normalization

I've been working on a matrix normalization problem, stated as:
Given a matrix M, normalize its elements such that each element is divided with the corresponding column sum if element is not 0.
cwsums = np.sum(class_matrix,axis=1)
cwsums = np.reciprocal(cwsums.astype(np.float32))
cwsums[cwsums == np.inf] = 0
## this is the problem
final_matrix = np.multiply(final_matrix, cwsums)
I can construct a reciprocal mask, which I would like to apply accross the matrix, as an elementwise product, yet I can't seem to get it right. Thank you!
(Addressing edited question) Looks like you meant to sum across rows using axis=0:
i = 1 / class_matrix.sum(axis=0)
i[~np.isfinite(i)] = 0
class_matrix *= i

Interpolate each row in matrix of x values

I want to interpolate between values in each row of a matrix (x-values) given a fixed vector of y-values. I am using python and essentially I need something like scipy.interpolate.interp1d but with x values being a matrix input. I implemented this by looping, but I want to make the operation as fast as possible.
Edit
Below is an example of a code of what I am doing right now, note that my matrix has more rows on order of millions:
import numpy as np
x = np.linspace(0,1,100).reshape(10,10)
results = np.zeros(10)
for i in range(10):
results[i] = np.interp(0.1,x[i],range(10))
As #Joe Kington suggested you can use map_coordinates:
import scipy.ndimage as nd
# your data - make sure is float/double
X = np.arange(100).reshape(10,10).astype(float)
# the points where you want to interpolate each row
y = np.random.rand(10) * (X.shape[1]-1)
# the rows at which you want the data interpolated -- all rows
r = np.arange(X.shape[0])
result = nd.map_coordinates(X, [r, y], order=1, mode='nearest')
The above, for the following y:
array([ 8.00091648, 0.46124587, 7.03994936, 1.26307275, 1.51068952,
5.2981205 , 7.43509764, 7.15198457, 5.43442468, 0.79034372])
Note, each value indicates the position in which the value is going to be interpolated for each row.
Gives the following result:
array([ 8.00091648, 10.46124587, 27.03994936, 31.26307275,
41.51068952, 55.2981205 , 67.43509764, 77.15198457,
85.43442468, 90.79034372])
which makes sense considering the nature of the aranged data, and the columns (y) at which it is interpolated.

numpy covariance matrix

Suppose I have two vectors of length 25, and I want to compute their covariance matrix. I try doing this with numpy.cov, but always end up with a 2x2 matrix.
>>> import numpy as np
>>> x=np.random.normal(size=25)
>>> y=np.random.normal(size=25)
>>> np.cov(x,y)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
Using the rowvar flag doesn't help either - I get exactly the same result.
>>> np.cov(x,y,rowvar=0)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
How can I get the 25x25 covariance matrix?
You have two vectors, not 25. The computer I'm on doesn't have python so I can't test this, but try:
z = zip(x,y)
np.cov(z)
Of course.... really what you want is probably more like:
n=100 # number of points in each vector
num_vects=25
vals=[]
for _ in range(num_vects):
vals.append(np.random.normal(size=n))
np.cov(vals)
This takes the covariance (I think/hope) of num_vects 1xn vectors
Try this:
import numpy as np
x=np.random.normal(size=25)
y=np.random.normal(size=25)
z = np.vstack((x, y))
c = np.cov(z.T)
 Covariance matrix from samples vectors
To clarify the small confusion regarding what is a covariance matrix defined using two N-dimensional vectors, there are two possibilities.
The question you have to ask yourself is whether you consider:
each vector as N realizations/samples of one single variable (for example two 3-dimensional vectors [X1,X2,X3] and [Y1,Y2,Y3], where you have 3 realizations for the variables X and Y respectively)
each vector as 1 realization for N variables (for example two 3-dimensional vectors [X1,Y1,Z1] and [X2,Y2,Z2], where you have 1 realization for the variables X,Y and Z per vector)
Since a covariance matrix is intuitively defined as a variance based on two different variables:
in the first case, you have 2 variables, N example values for each, so you end up with a 2x2 matrix where the covariances are computed thanks to N samples per variable
in the second case, you have N variables, 2 samples for each, so you end up with a NxN matrix
About the actual question, using numpy
if you consider that you have 25 variables per vector (took 3 instead of 25 to simplify example code), so one realization for several variables in one vector, use rowvar=0
# [X1,Y1,Z1]
X_realization1 = [1,2,3]
# [X2,Y2,Z2]
X_realization2 = [2,1,8]
numpy.cov([X,Y],rowvar=0) # rowvar false, each column is a variable
Code returns, considering 3 variables:
array([[ 0.5, -0.5, 2.5],
[-0.5, 0.5, -2.5],
[ 2.5, -2.5, 12.5]])
otherwise, if you consider that one vector is 25 samples for one variable, use rowvar=1 (numpy's default parameter)
# [X1,X2,X3]
X = [1,2,3]
# [Y1,Y2,Y3]
Y = [2,1,8]
numpy.cov([X,Y],rowvar=1) # rowvar true (default), each row is a variable
Code returns, considering 2 variables:
array([[ 1. , 3. ],
[ 3. , 14.33333333]])
Reading the documentation as,
>> np.cov.__doc__
or looking at Numpy Covariance, Numpy treats each row of array as a separate variable, so you have two variables and hence you get a 2 x 2 covariance matrix.
I think the previous post has right solution. I have the explanation :-)
I suppose what youre looking for is actually a covariance function which is a timelag function. I'm doing autocovariance like that:
def autocovariance(Xi, N, k):
Xs=np.average(Xi)
aCov = 0.0
for i in np.arange(0, N-k):
aCov = (Xi[(i+k)]-Xs)*(Xi[i]-Xs)+aCov
return (1./(N))*aCov
autocov[i]=(autocovariance(My_wector, N, h))
You should change
np.cov(x,y, rowvar=0)
onto
np.cov((x,y), rowvar=0)
What you got (2 by 2) is more useful than 25*25. Covariance of X and Y is an off-diagonal entry in the symmetric cov_matrix.
If you insist on (25 by 25) which I think useless, then why don't you write out the definition?
x=np.random.normal(size=25).reshape(25,1) # to make it 2d array.
y=np.random.normal(size=25).reshape(25,1)
cov = np.matmul(x-np.mean(x), (y-np.mean(y)).T) / len(x)
As pointed out above, you only have two vectors so you'll only get a 2x2 cov matrix.
IIRC the 2 main diagonal terms will be sum( (x-mean(x))**2) / (n-1) and similarly for y.
The 2 off-diagonal terms will be sum( (x-mean(x))(y-mean(y)) ) / (n-1). n=25 in this case.
according the document, you should expect variable vector in column:
If we examine N-dimensional samples, X = [x1, x2, ..., xn]^T
though later it says each row is a variable
Each row of m represents a variable.
so you need input your matrix as transpose
x=np.random.normal(size=25)
y=np.random.normal(size=25)
X = np.array([x,y])
np.cov(X.T)
and according to wikipedia: https://en.wikipedia.org/wiki/Covariance_matrix
X is column vector variable
X = [X1,X2, ..., Xn]^T
COV = E[X * X^T] - μx * μx^T // μx = E[X]
you can implement it yourself:
# X each row is variable
X = X - X.mean(axis=0)
h,w = X.shape
COV = X.T # X / (h-1)
i don't think you understand the definition of covariance matrix.
If you need 25 x 25 covariance matrix, you need 25 vectors each with n data points.

Categories