A random normally distributed matrix in numpy - python

I would like to generate a matrix M, whose elements M(i,j) are from a standard normal distribution. One trivial way of doing it is,
import numpy as np
A = [ [np.random.normal() for i in range(3)] for j in range(3) ]
A = np.array(A)
print(A)
[[-0.12409887 0.86569787 -1.62461893]
[ 0.30234536 0.47554092 -1.41780764]
[ 0.44443707 -0.76518672 -1.40276347]]
But, I was playing around with numpy and came across another "solution":
import numpy as np
import numpy.matlib as npm
A = np.random.normal(npm.zeros((3, 3)), npm.ones((3, 3)))
print(A)
[[ 1.36542538 -0.40676747 0.51832243]
[ 1.94915748 -0.86427391 -0.47288974]
[ 1.9303462 -1.26666448 -0.50629403]]
I read the document for numpy.random.normal, and it says it doesn't clarify how does this function work when matrix is passed instead of a single value. I suspected that in the second "solution" I might be drawing from a multivariate normal distribution. But this can't be true because the input arguments both have the same dimensions (covariance should be a matrix and mean is a vector). Not sure what is being generated by the second code.

The intended way to do what you want is
A = np.random.normal(0, 1, (3, 3))
This is the optional size parameter that tells numpy what shape you want returned (3 by 3 in this case).
Your second way works too, because the documentation states
If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.
So there is no multivariate distribution and no correlation.

Related

a uniform data structure that can represent an ndarray with various size along a given axis

I can use the following code to generate three dimensional array.
import numpy as np
x1 = np.random.rand(8,9,10)
In some scenarios, the studied data set (or array) have various length along the axis 0, In other words. A subset may be of shape (8, 9, 10), and another subset maybe of shape (7,9,10). All these subsets are of the same size along the second and the third axis. If I still want to represent the whole data set using the same data structure, how to achieve this goal?
One solution would be to use awkward-array:
https://github.com/scikit-hep/awkward-1.0
>>> import numpy as np
>>> import awkward as ak
>>> numpy_arrays = [np.random.rand(8,9,10), np.random.rand(7,9,10)]
>>> irregular_array = ak.Array(numpy_arrays)
>>> irregular_array
<Array [[[ ... ]]] type='var * 9 * 10 * int64'>

Subtract Mean from Multidimensional Numpy-Array

I'm currently learning about broadcasting in Numpy and in the book I'm reading (Python for Data Analysis by Wes McKinney the author has mentioned the following example to "demean" a two-dimensional array:
import numpy as np
arr = np.random.randn(4, 3)
print(arr.mean(0))
demeaned = arr - arr.mean(0)
print(demeaned)
print(demeand.mean(0))
Which effectively causes the array demeaned to have a mean of 0.
I had the idea to apply this to an image-like, three-dimensional array:
import numpy as np
arr = np.random.randint(0, 256, (400,400,3))
demeaned = arr - arr.mean(2)
Which of course failed, because according to the broadcasting rule, the trailing dimensions have to match, and that's not the case here:
print(arr.shape) # (400, 400, 3)
print(arr.mean(2).shape) # (400, 400)
Now, i have gotten it to work mostly, by substracting the mean from every single index in the third dimension of the array:
demeaned = np.ones(arr.shape)
for i in range(3):
demeaned[...,i] = arr[...,i] - means
print(demeaned.mean(0))
At this point, the returned values are very close to zero and i think, that's a precision error. Am i actually right with this thought or is there another caveat, that i missed?
Also, this doesn't seam to be the cleanest, most 'numpy'-way to achieve what i wanted to achieve. Is there a function or a principle that i can make use of to improve the code?
As of numpy version 1.7.0, np.mean, and several other functions, accept a tuple in their axis parameter. This means that you can perform the operation on the planes of the image all at once:
m = arr.mean(axis=(0, 1))
This mean will have shape (3,), with one element for each plane of the image.
If you want to subtract the means of each pixel individually, you have to remember that broadcasting aligns shape tuples on the right edge. That means that you need to insert an extra dimension:
n = arr.mean(axis=2)
n = n.reshape(*n.shape, 1)
Or
n = arr.mean(axis=2)[..., None]
Try np.apply_along_axis().
np.apply_along_axis(lambda x: x - np.mean(x), 2, arr)
Output: you get the array of the same shape where each cell is demeaned in the dimension you want (the second parameter, here it is 2).

Numpy random functions create inconsistent shapes for given argument

I found an odd behavior of numpy's random number generators.
It seems that they do not generate consistent matrix shapes for a given argument.
It is just super annoying to spend an extra line for conversion afterward which I'd like to circumvent.
How can I tell matlib.randn directly to generate a vector of size (200,)?
import numpy as np
A = np.zeros((200,))
B = np.matlib.randn((200,))
print(A.shape) # prints (200,)
print(B.shape) # prints (1, 200)
Use numpy.random instead of numpy.matlib:
numpy.random.randn(200).shape # prints (200,)
numpy.random.randn can create any shape, whereas numpy.matlib.randn always creates a matrix.
B is a matrix object, not a ndarray. The matrix object doesn't have an 1D equivalent objects and are not recommended to use anymore, so you should use np.random.random instead.

How to generate a number of random vectors starting from a given one

I have an array of values and would like to create a matrix from that, where each row is my starting point vector multiplied by a sample from a (normal) distribution.
The number of rows of this matrix will then vary in dependence from the number of samples I want.
%pylab
my_vec = array([1,2,3])
my_rand_vec = my_vec*randn(100)
Last command does not work, because array shapes do not match.
I could think of using a for loop, but I am trying to leverage on array operations.
Try this
my_rand_vec = my_vec[None,:]*randn(100)[:,None]
For small numbers I get for example
import numpy as np
my_vec = np.array([1,2,3])
my_rand_vec = my_vec[None,:]*np.random.randn(5)[:,None]
my_rand_vec
# array([[ 0.45422416, 0.90844831, 1.36267247],
# [-0.80639766, -1.61279531, -2.41919297],
# [ 0.34203295, 0.6840659 , 1.02609885],
# [-0.55246431, -1.10492863, -1.65739294],
# [-0.83023829, -1.66047658, -2.49071486]])
Your solution my_vec*rand(100) does not work because * corresponds to the element-wise multiplication which only works if both arrays have identical shapes.
What you have to do is adding an additional dimension using [None,:] and [:,None] such that numpy's broadcasting works.
As a side note I would recommend not to use pylab. Instead, use import as in order to include modules as pointed out here.
It is the outer product of vectors:
my_rand_vec = numpy.outer(randn(100), my_vec)
You can pass the dimensions of the array you require to numpy.random.randn:
my_rand_vec = my_vec*np.random.randn(100,3)
To multiply each vector by the same random number, you need to add an extra axis:
my_rand_vec = my_vec*np.random.randn(100)[:,np.newaxis]

Uniform Random Numbers

I am trying to understand what this code does. I am going through some examples about numpy and plotting and I can't figure out what u and v are. I know u is an array of two arrays each with size 10000. What does v=u.max(axis=0) do? Is the max function being invoked part of the standard python library? When I plot the histogram I get a pdf defined by 2x as opposed to a normal uniform distribution.
import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt
np.random.seed(123)
u=rand.uniform(0,1,[2,10000])
v=u.max(axis=0)
plt.figure()
plt.hist(v,100,normed=1,color='blue')
plt.ylim([0,2])
plt.show()
u.max(), or equivalently np.max(u), will give you the maximum value in the array - i.e. a single value. It's the Numpy function here, not part of the standard library. You often want to find the maximum value along a particular axis/dimension and that's what is happening here.
U has shape (2,10000), and u.max(axis=0) gives you the max along the 0 axis, returning an array with shape (10000,). If you did u.max(axis=1) you would get an array with shape (2,).
Simple illustration/example:
>>> a = np.array([[1,2],[3,4]])
>>> a
array([[1, 2],
[3, 4]])
>>> a.max(axis=0)
array([3, 4])
>>> a.max(axis=1)
array([2, 4])
>>> a.max()
4
first three lines you load in different modules (libraries that are relied apon in the rest of the code). you load numpy which is a numerical library, numpy.random which is a library that does a lot of great work to create random numbers and matplotlib allows for plotting functions.
the rest is described here:
np.random.seed(123)
A computer does not really generate a random number rather picks a number from a long list of numbers (for a more correct explanation of how this is done http://en.wikipedia.org/wiki/Random_number_generation). In essence if you want to reproduce the work with the same random numbers the computer needs to know where in this list of numbers to start picking numbers. This is what this line of code does. If anybody else runs the same piece of code now you end up with the same 'random' numbers.
u=rand.uniform(0,1,[2,10000])
This generates 10000 random numbers twice that are distributed between 0 and 1. This is uniform distribution so it is equally likely to get any point between 0 and 1. (Again more information can be found here: http://en.wikipedia.org/wiki/Uniform_distribution_(continuous) ). You are creating two arrays within an array. This can be checked by doing: len(u) and len(u[0]).
v=u.max(axis=0)
The u.max? command in iPython refers you to the docs. It is basically select a max and the axis determines how the max is chosen. Try the following:
a = np.arange(4).reshape((2,2))
np.amax(a, axis=0) # gives array([2, 3])
np.amax(a, axis=1) # gives array([1, 3])
The rest of the code is meant to set the histogram plot. There are 100 bins in total in the histogram and the bars will be colored blue. The maximum height on the histogram y-axis is 2 and normed will guarantee that at least one sample will be in every bin.
I can't clearly make up what the true purpose or application of the code was. But this is en essence what it is doing.

Categories