Creating a matrix of random data in Python - python

I am trying to create a matrix in python that is 30 × 10 and has randomly generated numbers inside of it. But my numbers in the matrix have to follow the condition:
Randomly generate 30 data points from the sine function, where each data point (x,y) has the form
x = [x0, x1, x2,..., x10], x ∈ [0, 2π]
y = sin(x) + ε, ε ∈ N(0,0.3)
How might I be able to go about this?
Right now I only have a 1 × 10 matrix
def generate_sin_data():
x = np.random.rand()
y = np.sin(x)
features = [x**0, x**1, x**2, x**3, x**4,x**5, x**6, x**7, x**8, x**9,x**10]
return x,y,features

I'm not 100% certain I follow everything, but we can break it down. Here's how you can generate 30 random numbers between 0 and 2π:
import numpy as np
x = np.random.random(30) * 2*np.pi
Here, x is a 1D array of 30 numbers. Check this with x.shape.
Now if you add a dimension, it's easy to generate a matrix of powers up to 10 using NumPy's broadcasting feature. The question seems to ask for 11 numbers (0 to 10) not 10, so I'll do that:
X = x.reshape(-1, 1) ** np.arange(0, 11)
That reshape effectively turns x into a column vector. Now check X.shape and it's (30, 11), which is what I think you were after. Notice we use a big X for a matrix — this convention will help you keep track of things. Each column of X is the original function raised to a power from 0 to 10. (Note that each column comes from the same set of random numbers — I'm not sure if that's what you want?)
If you want y as a function of x (the vector) then do like so:
ϵ = np.random.random(30) * 0.3
y = np.sin(x) + ϵ

import numpy as np
# 30 random uniform values in [0, 2*pi)
_x = np.random.uniform(0, 2*np.pi, 30)
# matrix of 30x10:
x = np.array([
[v ** i for i in range(10)]
for v in _x
])
# random 30x10 normal noise:
eps = np.random.normal(0, 0.3, [30, 10])
# final result 30x10 matrix:
y = np.sin(x) + eps

Related

Generating N random unit vectors with their sum equal to 0 (Python)

I'd like to generate N random 3-dimensional vectors (uniformly) on the unit sphere but with the condition, that their sum is equal to 0. My attempt was to generate N/2 random unit vectors, while the other are just the same vectors with a minus sign. The problem is, as I'm trying to achieve as little correlation as possible, and my idea is obviously not ideal, since half of my vectors are perfectly anti-correlated with their corresponding pair.
Your problem does not really have a solution, but you can generate a set of vectors that are going to be slightly less visibly correlated than your original solution of negating them. To be precise, if you generate N / 2 vectors and negate them, then rotate the negated vectors about their sum by any angle, you can guarantee that the sum will be zero and the correlation will be a more complicated rotation than a negative identity matrix.
import numpy as np
from scipy.spatial.transform import Rotation
N = 10
v1 = np.random.normal(size=(N / 2, 3))
v1 /= np.linalg.norm(v1, axis=1, keepdims=True)
axis = v1.sum(0)
rot = Rotation.from_rotvec(np.random.uniform(2.0 * np.pi) * axis / np.linalg.norm(axis))
v2 = rot.apply(-v1)
result = np.concatenate((v1, v2), axis=0)
This assumes that N is even in all cases. The normal distribution is a fairly standard method to generate points uniformly on a sphere: https://mathworld.wolfram.com/SpherePointPicking.html.
If you had some leeway from the sum being exactly zero, you could align two random sets of N / 2 vectors so that their sums point opposite each other.
In this code, I tried to generate vectors selected from a sphere by converting a theta, phi to x, y, z.
import numpy as np
def vectorize(theta, phi):
x = np.cos(phi) * np.cos(theta)
y = np.cos(phi) * np.sin(theta)
z = np.sin(phi)
return np.array([x, y, z])
theta_range = np.arange(0, 2 * np.pi, 0.01)
phi_range = np.arange(-np.pi / 2, np.pi / 2, 0.01)
TH, PI = np.meshgrid(theta_range, phi_range)
whole_map = np.vstack((TH.flatten(), PI.flatten())).T
# Number of vectors:
N = 100
# Selecting N/2 Vectors first at random
v_selected = np.random.choice(range(whole_map.shape[0]), N // 2)
vectors = np.array([vectorize(whole_map[ind][0], whole_map[ind][1]) for ind in v_selected])
# Doubling up the number of vectors by adding the negate of each vector to the vector set
vectors = np.vstack((vectors, -vectors))
print(vectors.sum(axis=0))
# array([1.94289029e-16, 1.17961196e-16, 1.11022302e-16])
# Almost 0, but isn't zero because of floating number precision when converted to binary
Here is the scatter plot of the points generated on the sphere with radius=1:

How to take advantage of vectorization when computing the pdf for a multivariate gaussian?

I've been spending a few hours googling about this problem and it seems I can't find any information.
I tried coding a multivariate gaussian pdf as:
def multivariate_normal(X, M, S):
# X has shape (D, N) where D is the number of dimensions and N the number of observations
# M is the mean vector with shape (D, 1)
# S is the covariance matrix with shape (D, D)
D = S.shape[0]
S_inv = np.linalg.inv(S)
logdet = np.log(np.linalg.det(S))
log2pi = np.log(2*np.pi)
devs = X - M
a = np.array([- D/2 * log2pi - (1/2) * logdet - dev.T # S_inv # dev for dev in devs.T])
return np.exp(a)
I've only been successful in computing the pdf through a for loop, iterating N times. If I don't, I end up with an (N, N) matrix which is unhelpful. I've found another post here, but the post is quite outdated and in matlab.
Is there anyway to take advantage of numpy's vectorisation?
This is my first post on stackoverflow, let me know if anything is off!d
I came across this problem in a similar manner and here's how I solved it:
Variables:
X = numpy.ndarray[numpy.ndarray[float]] - m x n
MU = numpy.ndarray[numpy.ndarray[float]] - k x n
SIGMA = numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] - k x n x n
k = int
Where X is my feature vector, MU is my means, SIGMA is my covariance matrix.
To vectorize, I rewrote the dot product per the definition of the dot-product:
sigma_det = np.linalg.det(sigma)
sigma_inv = np.linalg.inv(sigma)
const = 1/((2*np.pi)**(n/2)*sigma_det**(1/2))
p = const*np.exp((-1/2)*np.sum((X-mu).dot(sigma_inv)*(X-mu),axis=1))
I have been working on this problem for the last few days and finally have come to a solution.
To do so I have added an extra dimension to the x vector, and then used the np.einsum() function for computing the Mahalanobis distance.
Example
For the following example we will use a (100 x 2) input array. That is, 100 samples of two random variables. That gives us a (1 x 2) mean vector and a (2 x 2) covariance matrix.
Generating some data:
# instantiate a random number generator
rng = np.random.default_rng(100)
# define mu and sigma for the dummy sample
mu = np.array([0.5, 0.25])
covmat = np.array([[1, 0.5],
[0.5, 1]])
# generate multivariate normal random sample
x = rng.multivariate_normal(mu, covmat, size=100)
And defining the pdf function:
def pdf(x, mu, covmat):
"""
Generates the probability of a given x vector based on the
probability distribution function N(mu, covmat)
Returns: the probability
"""
x = x[:, np.newaxis] # add a new first dimension to x
k = mu.shape[0] # number of dimensions
diff = x - mu # deviation of x from the mean
inv_covmat = np.linalg.inv(covmat)
term1 = (2*np.pi)**-(k/2)*np.linalg.det(inv_covmat)
term2 = np.exp(-np.einsum('ijk, kl, ijl->ij', diff, inv_covmat, diff) / 2)
return term1 * term2
Which returns a (n, 1) array, where n is the number of samples, in this case (100,1).
Explanation
The easiest way to think about solving the problem is just writing down the dimensions, and trying to do the linear algebra.
We need to do some kind of manipulation of three tensors with the following shapes, to get the resulting tensor:
A, B, C -> D
(100 x 1 x 2), (2, 2), (100 x 1 x 2) -> (100 x 1)
Let the first tensor, A, have the indices, ijk:
Then we want to do some operation of A and B to get the shape (100 x 1 x 2).
Hence,
ijk, kl - > ijl
(100 x 1 x 2), (2 x 2) -> (100 x 1 x 2)
This leaves us with AB, C
(100 x 1 x 2), (100 x 1 x 2)
We want D to have the shape (100 x 1)
Hence:
ijl, ijl->ij
(100 x 1 x 2), (100 x 1 x 2) -> (100 x 1)
Putting the two operations together, we get:
ijk, kl, ijl->ij

Create random database and convert it from numpy to pandas

I would like to create random database. in the database I want to create coordinaes so in the ed I can plot it, meaning, each point supppoose to have X and Y coordinate.
I have created data for one set of points but it is in numpy and I want it to be in pandas and I keep getting errors.
this is how I have created it:
#database 1
# defining the mean
mu = 0.5
# defining the standard deviation
sigma = 0.1
# The random module uses the seed value as a base
# to generate a random number. If seed value is not
# present, it takes the system’s current time.
np.random.seed(0)
# define the x co-ordinates
X = np.random.normal(mu, sigma, (395, 1))
# define the y co-ordinates
Y = np.random.normal(mu * 2, sigma * 3, (395, 1))
index=[X,Y]
##here I get all the errors
df = pd.DataFrame({'X': X, 'Y': Y}, index=index)
The errro I recieved:
Exception: Data must be 1-dimensional
I have also tried other methodes to make it dataframe but it didn't work and I believe it is something tiny that i'm missing.
My end goal is to create dataframe from those arrays.
The way you are calling np.random.normal is creating arrays of shape (395, 1). That means that you are creating an array that contains 395 arrays of 1 element.
Example:
array([[0.67640523],
[0.54001572],
[0.5978738 ],
[0.72408932],
[0.6867558 ],
[0.40227221],..])
This is what is breaking the pd.DataFrame call. So, to solve this, you need to pass the shape argument as (395) or simply 395 to create a one dimensional array.
#database 1
# defining the mean
mu = 0.5
# defining the standard deviation
sigma = 0.1
# The random module uses the seed value as a base
# to generate a random number. If seed value is not
# present, it takes the system’s current time.
np.random.seed(0)
# define the x co-ordinates
X = np.random.normal(mu, sigma, (395))
# define the y co-ordinates
Y = np.random.normal(mu * 2, sigma * 3, (395))
index=[X,Y]
##here I get all the errors
df = pd.DataFrame({'X': X, 'Y': Y}, index=index)
I would also suggest you to remove the line index=[X,Y] and the index parameter while calling pd.DataFrame as it doesn't make any sense to me. You are setting as index the same values that you have at X and Y. The final code would be something like this:
#database 1
# defining the mean
mu = 0.5
# defining the standard deviation
sigma = 0.1
# The random module uses the seed value as a base
# to generate a random number. If seed value is not
# present, it takes the system’s current time.
np.random.seed(0)
# define the x co-ordinates
X = np.random.normal(mu, sigma, 395)
print(X.shape)
# define the y co-ordinates
Y = np.random.normal(mu * 2, sigma * 3, 395)
print(Y.shape)
##here I get all the errors
df = pd.DataFrame({'X': X, 'Y': Y})
You should replace
X = np.random.normal(mu, sigma, (395, 1)) with X = np.random.normal(mu, sigma, 395) and Y = np.random.normal(mu * 2, sigma * 3, (395, 1)) with Y = np.random.normal(mu * 2, sigma * 3, 395).
In this way X and Y will be 1-dimensional: in fact let's check array shapes:
np.random.normal(mu, sigma, (395, 1)).shape
(395,1) #Hence this is a 2-dimensional vector
np.random.normal(mu, sigma, 395).shape
(395,) #this is a 1-dimensional vector
Is this the plot you want?
df = pd.DataFrame(list({'X': X, 'Y': Y}.items()))
df.explode(1).apply(lambda x: x[1][0], axis=1).plot()

Numpy find covariance of two 2-dimensional ndarray

I am new to numpy and am stuck at this problem.
I have two 2-dimensional numpy array such as
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
I want to use numpy cov function to find covariance of these two ndarrays row wise. i.e., for above example the output array should consist of 10 elements each denoting the covariance of corresponding rows of the ndarrays. I know I can do this by traversing the rows and finding the covariance of two 1D arrays but it isn't pythonic.
Edit1: The covariance of two array denotes the element at 0, 1 index.
Edit2: Currently this is my implementation
s = numpy.empty((x.shape[0], 1))
for i in range(x.shape[0]):
s[i] = numpy.cov(x[i], y[i])[0][1]
Use the definition of the covariance: E(XY) - E(X)E(Y).
import numpy as np
x = np.random.random((10, 5))
y = np.random.random((10, 5))
n = x.shape[1]
cov_bias = np.mean(x * y, axis=1) - np.mean(x, axis=1) * np.mean(y, axis=1))
cov_bias * n / (n-1)
Note that cov_bias corresponds to the result of numpy.cov(bias=True).
Here's one using the definition of covariance and inspired by corr2_coeff_rowwise -
def covariance_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1, keepdims=True)
B_mB = B - B.mean(-1, keepdims=True)
# Finally get covariance
N = A.shape[1]
return np.einsum('ij,ij->i',A_mA,B_mB)/(N-1)
Sample run -
In [66]: np.random.seed(0)
...: x = np.random.random((10, 5))
...: y = np.random.random((10, 5))
In [67]: s = np.empty((x.shape[0]))
...: for i in range(x.shape[0]):
...: s[i] = np.cov(x[i], y[i])[0][1]
In [68]: np.allclose(covariance_rowwise(x,y),s)
Out[68]: True
This works, but I'm not sure if it is faster for larger matrices x and y, the call numpy.cov(x, y) computes many entries we discard with numpy.diag:
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
# with loop
for (xi, yi) in zip(x, y):
print(numpy.cov(xi, yi)[0][1])
# vectorized
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
print(covariances)
I also did some timing for square matrices of size n x n:
import time
import numpy
def run(n):
x = numpy.random.random((n, n))
y = numpy.random.random((n, n))
started = time.time()
for (xi, yi) in zip(x, y):
numpy.cov(xi, yi)[0][1]
needed_loop = time.time() - started
started = time.time()
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
needed_vectorized = time.time() - started
print(
f"n={n:4d} needed_loop={needed_loop:.3f} s "
f"needed_vectorized={needed_vectorized:.3f} s"
)
for n in (100, 200, 500, 600, 700, 1000, 2000, 3000):
run(n)
output on my slow MacBook Air is
n= 100 needed_loop=0.006 s needed_vectorized=0.001 s
n= 200 needed_loop=0.011 s needed_vectorized=0.003 s
n= 500 needed_loop=0.033 s needed_vectorized=0.023 s
n= 600 needed_loop=0.041 s needed_vectorized=0.039 s
n= 700 needed_loop=0.043 s needed_vectorized=0.049 s
n=1000 needed_loop=0.061 s needed_vectorized=0.130 s
n=2000 needed_loop=0.137 s needed_vectorized=0.742 s
n=3000 needed_loop=0.224 s needed_vectorized=2.264 s
so the break even point is around n=600
Pick the diagonal vector of cov(x,y) and expand dims:
numpy.expand_dims(numpy.diag(numpy.cov(x,y),x.shape[0]),1)

numpy - evaluate function on a grid of points

What is a good way to produce a numpy array containing the values of a function evaluated on an n-dimensional grid of points?
For example, suppose I want to evaluate the function defined by
def func(x, y):
return <some function of x and y>
Suppose I want to evaluate it on a two dimensional array of points with the x values going from 0 to 4 in ten steps, and the y values going from -1 to 1 in twenty steps. What's a good way to do this in numpy?
P.S. This has been asked in various forms on StackOverflow many times, but I couldn't find a concisely stated question and answer. I posted this to provide a concise simple solution (below).
shorter, faster and clearer answer, avoiding meshgrid:
import numpy as np
def func(x, y):
return np.sin(y * x)
xaxis = np.linspace(0, 4, 10)
yaxis = np.linspace(-1, 1, 20)
result = func(xaxis[:,None], yaxis[None,:])
This will be faster in memory if you get something like x^2+y as function, since than x^2 is done on a 1D array (instead of a 2D one), and the increase in dimension only happens when you do the "+". For meshgrid, x^2 will be done on a 2D array, in which essentially every row is the same, causing massive time increases.
Edit: the "x[:,None]", makes x to a 2D array, but with an empty second dimension. This "None" is the same as using "x[:,numpy.newaxis]". The same thing is done with Y, but with making an empty first dimension.
Edit: in 3 dimensions:
def func2(x, y, z):
return np.sin(y * x)+z
xaxis = np.linspace(0, 4, 10)
yaxis = np.linspace(-1, 1, 20)
zaxis = np.linspace(0, 1, 20)
result2 = func2(xaxis[:,None,None], yaxis[None,:,None],zaxis[None,None,:])
This way you can easily extend to n dimensions if you wish, using as many None or : as you have dimensions. Each : makes a dimension, and each None makes an "empty" dimension. The next example shows a bit more how these empty dimensions work. As you can see, the shape changes if you use None, showing that it is a 3D object in the next example, but the empty dimensions only get filled up whenever you multiply with an object that actually has something in those dimensions (sounds complicated, but the next example shows what i mean)
In [1]: import numpy
In [2]: a = numpy.linspace(-1,1,20)
In [3]: a.shape
Out[3]: (20,)
In [4]: a[None,:,None].shape
Out[4]: (1, 20, 1)
In [5]: b = a[None,:,None] # this is a 3D array, but with the first and third dimension being "empty"
In [6]: c = a[:,None,None] # same, but last two dimensions are "empty" here
In [7]: d=b*c
In [8]: d.shape # only the last dimension is "empty" here
Out[8]: (20, 20, 1)
edit: without needing to type the None yourself
def ndm(*args):
return [x[(None,)*i+(slice(None),)+(None,)*(len(args)-i-1)] for i, x in enumerate(args)]
x2,y2,z2 = ndm(xaxis,yaxis,zaxis)
result3 = func2(x2,y2,z2)
This way, you make the None-slicing to create the extra empty dimensions, by making the first argument you give to ndm as the first full dimension, the second as second full dimension etc- it does the same as the 'hardcoded' None-typed syntax used before.
Short explanation: doing x2, y2, z2 = ndm(xaxis, yaxis, zaxis) is the same as doing
x2 = xaxis[:,None,None]
y2 = yaxis[None,:,None]
z2 = zaxis[None,None,:]
but the ndm method should also work for more dimensions, without needing to hardcode the None-slices in multiple lines like just shown. This will also work in numpy versions before 1.8, while numpy.meshgrid only works for higher than 2 dimensions if you have numpy 1.8 or higher.
import numpy as np
def func(x, y):
return np.sin(y * x)
xaxis = np.linspace(0, 4, 10)
yaxis = np.linspace(-1, 1, 20)
x, y = np.meshgrid(xaxis, yaxis)
result = func(x, y)
I use this function to get X, Y, Z values ready for plotting:
def npmap2d(fun, xs, ys, doPrint=False):
Z = np.empty(len(xs) * len(ys))
i = 0
for y in ys:
for x in xs:
Z[i] = fun(x, y)
if doPrint: print([i, x, y, Z[i]])
i += 1
X, Y = np.meshgrid(xs, ys)
Z.shape = X.shape
return X, Y, Z
Usage:
def f(x, y):
# ...some function that can't handle numpy arrays
X, Y, Z = npmap2d(f, np.linspace(0, 0.5, 21), np.linspace(0.6, 0.4, 41))
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(X, Y, Z)
The same result can be achieved using map:
xs = np.linspace(0, 4, 10)
ys = np.linspace(-1, 1, 20)
X, Y = np.meshgrid(xs, ys)
Z = np.fromiter(map(f, X.ravel(), Y.ravel()), X.dtype).reshape(X.shape)
In the case your function actually takes a tuple of d elements, i.e. f((x1,x2,x3,...xd)) (for example the scipy.stats.multivariate_normal function), and you want to evaluate f on N^d combinations/grid of N variables, you could also do the following (2D case):
x=np.arange(-1,1,0.2) # each variable is instantiated N=10 times
y=np.arange(-1,1,0.2)
Z=f(np.dstack(np.meshgrid(x,y))) # result is an NxN (10x10) matrix, whose entries are f((xi,yj))
Here np.dstack(np.meshgrid(x,y)) creates an 10x10 "matrix" (technically a 10x10x2 numpy array) whose entries are the 2-dimensional tuples to be evaluated by f.
My two cents:
import numpy as np
x = np.linspace(0, 4, 10)
y = np.linspace(-1, 1, 20)
[X, Y] = np.meshgrid(x, y, indexing = 'ij', sparse = 'true')
def func(x, y):
return x*y/(x**2 + y**2 + 4)
# I have defined a function of x and y.
func(X, Y)

Categories