I want to generate a dataset with m random data points of k dimensions each. Thus resulting in data size of shape (m, k). These points should be i.i.d. from a normal distribution with mean 0 and standard deviation 1. There are 2 ways of generating these points.
First way:
import numpy as np
# Initialize the array
Z = np.zeros((m, k))
# Generate each point of each dimension independent of each other
for datapoint in range(m):
z = [np.random.standard_normal() for _ in range(k)]
Z[datapoint] = z[:]
Second way:
import numpy as np
# Directly sample the points
Z = np.random.normal(0, 1, (m, k))
What I think is the 2nd way gives a resulting dataset not independent of each other but the 1st one gives i.i.d dataset of points. Is this the difference between the 2 pieces of code?
My assumption would be that standard_normal is just normal with "standard" parameters (mean=0 and std=1).
Let's test that:
import numpy as np
rng0 = np.random.default_rng(43210)
rng1 = np.random.default_rng(43210)
print(rng0.standard_normal(10))
print(rng1.normal(0, 1, 10))
which gives:
[ 0.62824213 -1.18535536 -1.18141382 -0.74127753 -0.41945915 1.02656223 -0.64935657 1.70859865 0.47731614 -1.12700957]
[ 0.62824213 -1.18535536 -1.18141382 -0.74127753 -0.41945915 1.02656223 -0.64935657 1.70859865 0.47731614 -1.12700957]
So I think that assumption was correct.
Related
I have a 1-d numpy array which I would like to down sample with a exponential distribution. Currently, I am using signal.resample(y,downsize) for a uniform re-sample. Not sure if there is a quick way to do this but exponentially
from scipy import signal
# uniform resample example
x = np.arange(100)
y = np.sin(x)
linear_resample = signal.resample(y,15)
import numpy as np
np.random.seed(73)
# a random array of integers of length 100
arr_test = np.random.randint(300, size=100)
print(arr_test)
# lets divide 0 to 100 in exponential fashion
ls = np.logspace(0.00001, 2, num=100, endpoint=False, base=10.0).astype(np.int32)
print(ls)
# sample the array
arr_samp = arr_test[ls]
print(arr_samp)
I have use log base 10. You can change to natural if you want.
I have used xarray to create two different DataArrays with the same dimensions and coordinates. However I want to add two different coordinates in one of these dimensions. I'm trying to add coordinate 'a' to coordinate 'b' in dimension 'x'. There is an easy workaround if these are the only dimensions of my matrix but more complicated if I have more dimensions and I want to keep the normal xarray behaviour for the other dimensions. Please see the example below that fails on the last line. I know how to manually fix this in numpy but the beauty of xarray is that I shouldn't have to.
Does xarray allow an easy solution for this kind of operation?
import xarray as xr
import numpy as np
# create simple DataArray M and N to show what I would like to do
M = xr.DataArray([1, 2], dims="x",coords={'x':['a','b']})
N = xr.DataArray([3, 4], dims="x",coords={'x':['a','b']})
print(M.sel(x='a')+N.sel(x='b')) # this will NOT give me the value
print(M.sel(x='a').values+N.sel(x='b').values) # this will give me the value
# create a more complex DataArray M and N to show what the challenge
m = np.arange(3*2*4)
m = m.reshape(3,2,4)
n = np.arange(4*2*3)
n = n.reshape(4,2,3)
M = xr.DataArray(m, dims=['z1',"x","z2"],coords={'x':['a','b']})
N = xr.DataArray(n, dims=["z2",'x','z1'],coords={'x':['a','b']})
print(M.sel(x='a')+N.sel(x='b')) # this will NOT give me the value
print(M.sel(x='a').values+N.sel(x='b').values) # this will result in an error
As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?
If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).
One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d times.
A faster approach would be to generate a random Nxd 2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.
You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different
individual dimensions of the data are uncorrelated. These basis
vectors are called principal components, and several related
procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).
So I need to write code that accomplishes the following:
Write a Python code that produces a variable op_table that is a numpy array with three
axes, i, j, and k. Define three arrays:
xi ranges from 0 (included) to 9 (included) in steps of 1,
yj ranges form 10 (included) to 11 (included) in 20 equal-size steps,
zk ranges form 10 to 106 in five steps (i.e. with six entries total), where zk=10zk−1.
Then create the final array op_table that satisfies:
op_table[i,j,k]=sin(xi)⋅yj+zk
My question lies in how to initially set the values. I've only seen numpy arrays created in manners such as np.array([1,2,3,4]) or np.arrange(10). Also, how is this set-up? Is the first column the x-axis, second the y-axis and so forth?
import numpy as np
import math
xi = np.linspace(0,9, num=10)
yj = np.linspace(10,11,20, endpoint=True)
zk = [10, 10**2, 10**3, 10**4, 10**5, 10**6]
op_table = np.random.rand(10,20,6)
for i in range (0,10):
for j in range (0,20):
for k in range (0,6):
op_table[i,j,k] = math.sin(xi[i]) * yj[j] + zk[k]
Don't personally believe in spoon-feeding answers, but it looks like you've misinterpreted the problem. The problem doesn't actually require that you generate any matrix, except by solving the second equation. Numpy happens to have a very helpful function called linspace that does almost exactly this.
import numpy as np
xi = np.linspace(0, 10)
yj = np.linspace(10, 11, 20)
Other than that, this seems to be a math problem, and this should get you 80% of the way to a solution. If you need help with the math, there's another stackexchange for that.
More np.linspace docs: http://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html
Math stackexchange: https://math.stackexchange.com/
What's the easiest way to get the DFT matrix for 2-d DFT in python? I could not find such function in numpy.fft. Thanks!
The easiest and most likely the fastest method would be using fft from SciPy.
import scipy as sp
def dftmtx(N):
return sp.fft(sp.eye(N))
If you know even faster way (might be more complicated) I'd appreciate your input.
Just to make it more relevant to the main question - you can also do it with numpy:
import numpy as np
dftmtx = np.fft.fft(np.eye(N))
When I had benchmarked both of them I have an impression scipy one was marginally faster but I
have not done it thoroughly and it was sometime ago so don't take my word for it.
Here's pretty good source on FFT implementations in python:
http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/UnderstandingTheFFT.ipynb
It's rather from speed perspective, but in this case we can actually see that sometimes it comes with simplicity too.
I don't think this is built in. However, direct calculation is straightforward:
import numpy as np
def DFT_matrix(N):
i, j = np.meshgrid(np.arange(N), np.arange(N))
omega = np.exp( - 2 * pi * 1J / N )
W = np.power( omega, i * j ) / sqrt(N)
return W
EDIT For a 2D FFT matrix, you can use the following:
x = np.zeros(N, N) # x is any input data with those dimensions
W = DFT_matrix(N)
dft_of_x = W.dot(x).dot(W)
As of scipy 0.14 there is a built-in scipy.linalg.dft:
Example with 16 point DFT matrix:
>>> import scipy.linalg
>>> import numpy as np
>>> m = scipy.linalg.dft(16)
Validate unitary property, note matrix is unscaled thus 16*np.eye(16):
>>> np.allclose(np.abs(np.dot( m.conj().T, m )), 16*np.eye(16))
True
For 2D DFT matrix, it's just a issue of tensor product, or specially, Kronecker Product in this case, as we are dealing with matrix algebra.
>>> m2 = np.kron(m, m) # 256x256 matrix, flattened from (16,16,16,16) tensor
Now we can give it a tiled visualization, it's done by rearranging each row into a square block
>>> import matplotlib.pyplot as plt
>>> m2tiled = m2.reshape((16,)*4).transpose(0,2,1,3).reshape((256,256))
>>> plt.subplot(121)
>>> plt.imshow(np.real(m2tiled), cmap='gray', interpolation='nearest')
>>> plt.subplot(122)
>>> plt.imshow(np.imag(m2tiled), cmap='gray', interpolation='nearest')
>>> plt.show()
Result (real and imag part separately):
As you can see they are 2D DFT basis functions
Link to documentation
#Alex| is basically correct, I add here the version I used for 2-d DFT:
def DFT_matrix_2d(N):
i, j = np.meshgrid(np.arange(N), np.arange(N))
A=np.multiply.outer(i.flatten(), i.flatten())
B=np.multiply.outer(j.flatten(), j.flatten())
omega = np.exp(-2*np.pi*1J/N)
W = np.power(omega, A+B)/N
return W
Lambda functions work too:
dftmtx = lambda N: np.fft.fft(np.eye(N))
You can call it by using dftmtx(N). Example:
In [62]: dftmtx(2)
Out[62]:
array([[ 1.+0.j, 1.+0.j],
[ 1.+0.j, -1.+0.j]])
If you wish to compute the 2D DFT as a single matrix operation, it is necessary to unravel the matrix X on which you wish to compute the DFT into a vector, as each output of the DFT has a sum over every index in the input, and a single square matrix multiplication does not have this ability. Taking care to be sure we are handling the indices correctly, I find the following works:
M = 16
N = 16
X = np.random.random((M,N)) + 1j*np.random.random((M,N))
Y = np.fft.fft2(X)
W = np.zeros((M*N,M*N),dtype=np.complex)
hold = []
for m in range(M):
for n in range(N):
hold.append((m,n))
for j in range(M*N):
for i in range(M*N):
k,l = hold[j]
m,n = hold[i]
W[j,i] = np.exp(-2*np.pi*1j*(m*k/M + n*l/N))
np.allclose(np.dot(W,X.ravel()),Y.ravel())
True
If you wish to change the normalization to orthogonal, you can divide by 1/sqrt(MN) or if you wish to have the inverse transformation, just change the sign in the exponent.
This might be a little late, but there is a better alternative for creating the DFT matrix, that performs faster, using NumPy's vander
also, this implementation does not use loops (explicitly)
def dft_matrix(signal):
N = signal.shape[0] # num of samples
w = np.exp((-2 * np.pi * 1j) / N) # remove the '-' for inverse fourier
r = np.arange(N)
w_matrix = np.vander(w ** r, increasing=True) # faster than meshgrid
return w_matrix
if I'm not mistaken, the main improvement is that this method generates the elements of the power from the (already calculated) previous elements
you can read about vander in the documentation:
numpy.vander