Why does numpy.random.dirichlet() not accept multidimensional arrays? - python

On the numpy page they give the example of
s = np.random.dirichlet((10, 5, 3), 20)
which is all fine and great; but what if you want to generate random samples from a 2D array of alphas?
alphas = np.random.randint(10, size=(20, 3))
If you try np.random.dirichlet(alphas), np.random.dirichlet([x for x in alphas]), or np.random.dirichlet((x for x in alphas)), it results in a
ValueError: object too deep for desired array. The only thing that seems to work is:
y = np.empty(alphas.shape)
for i in xrange(np.alen(alphas)):
y[i] = np.random.dirichlet(alphas[i])
print y
...which is far from ideal for my code structure. Why is this the case, and can anyone think of a more "numpy-like" way of doing this?
Thanks in advance.

np.random.dirichlet is written to generate samples for a single Dirichlet distribution. That code is implemented in terms of the Gamma distribution, and that implementation can be used as the basis for a vectorized code to generate samples from different distributions. In the following, dirichlet_sample takes an array alphas with shape (n, k), where each row is an alpha vector for a Dirichlet distribution. It returns an array also with shape (n, k), each row being a sample of the corresponding distribution from alphas. When run as a script, it generates samples using dirichlet_sample and np.random.dirichlet to verify that they are generating the same samples (up to normal floating point differences).
import numpy as np
def dirichlet_sample(alphas):
"""
Generate samples from an array of alpha distributions.
"""
r = np.random.standard_gamma(alphas)
return r / r.sum(-1, keepdims=True)
if __name__ == "__main__":
alphas = 2 ** np.random.randint(0, 4, size=(6, 3))
np.random.seed(1234)
d1 = dirichlet_sample(alphas)
print "dirichlet_sample:"
print d1
np.random.seed(1234)
d2 = np.empty(alphas.shape)
for k in range(len(alphas)):
d2[k] = np.random.dirichlet(alphas[k])
print "np.random.dirichlet:"
print d2
# Compare d1 and d2:
err = np.abs(d1 - d2).max()
print "max difference:", err
Sample run:
dirichlet_sample:
[[ 0.38980834 0.4043844 0.20580726]
[ 0.14076375 0.26906604 0.59017021]
[ 0.64223074 0.26099934 0.09676991]
[ 0.21880145 0.33775249 0.44344606]
[ 0.39879859 0.40984454 0.19135688]
[ 0.73976425 0.21467288 0.04556287]]
np.random.dirichlet:
[[ 0.38980834 0.4043844 0.20580726]
[ 0.14076375 0.26906604 0.59017021]
[ 0.64223074 0.26099934 0.09676991]
[ 0.21880145 0.33775249 0.44344606]
[ 0.39879859 0.40984454 0.19135688]
[ 0.73976425 0.21467288 0.04556287]]
max difference: 5.55111512313e-17

I think you're looking for
y = np.array([np.random.dirichlet(x) for x in alphas])
for your list comprehension. Otherwise you're simply passing a python list or tuple. I imagine the reason numpy.random.dirichlet does not accept your list of alpha values is because it's not set up to - it already accepts an array, which it expects to have a dimension of k, as per the documentation.

Related

Numpy.dot dot product function for statsmodels

I am learning statsmodels.api module to use python for regression analysis. So I started from the simple OLS model.
In econometrics, the function is like: y = Xb + e
where X is NxK dimension, b is Kx1, e is Nx1, so adding together y is Nx1. This is perfectly fine from linear algebra point of view.
But I followed the tutorial from Statsmodels as the following:
import numpy as np
nsample = 100 # total obs is 100
x = np.linspace(0, 10, 100) # using np.linspace(start, stop, number)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size = nsample) # draw numbers from normal distribution
default at mu = 0, and std.dev = 1, size = set by user
# e is n x 1
# Now, we add the constant/intercept term to X
X = sm.add_constant(X)
# Now, we compute the y
y = np.dot(X, beta) + e
So this generates the correct answer. But I have a question about the generation of beta = np.array([1,0.1,10]). This beta, if we use:
beta.shape
(3,)
It has a dimension of (3,), the same goes with y and e except X:
X.shape
(100,3)
e.shape
(100,)
y.shape
(100,)
So I guess initiating array using the following three ways
o = array([1,2,3])
o1 = array([[1],[2],[3]])
o2 = array([[1,2,3]])
print(o.shape)
print(o1.shape)
print(o2.shape)
----------------
(3,)
(3, 1)
(1, 3)
If I use beta = array([[1],[2],[3]]), which is a (3,1), and np.dot(X, beta) gets me a wrong answer, although the dimension seems to work.
If I use array([[1,2,3]]), which is a row vector, the dimension doesn't match for dot product in numpy, neither in linear algebra.
So, I am wondering why for a NxK dot Kx1 numpy dot product, we have to use a (N,K) dot (K,) instead of (N,K) dot (K,1) matrices. What operation makes only np.array([1, 0.1, 10]) works for numpy.dot() while np.array([[1], [0.1], [10]]) doesn't.
Thank you very much.
Some update
Sorry about the confusion, the codes in Statsmodels are randomly generated so I tried to fix the X and get the following input:
f = array([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]])
o = array([1,2,3])
o1 = array([[1],[2],[3]])
o2 = array([[1,2,3]])
print(o.shape)
print(o1.shape)
print(o2.shape)
print("---------")
print(np.dot(f,o))
print(np.dot(f,o1))
r1 = np.dot(f,o)
r2 = np.dot(f,o1)
type1 = type(np.dot(f,o))
type2 = type(np.dot(f,o1))
tf = type1 is type2
tf2 = type1 == type2
print(type1)
print(type2)
print(tf)
print(tf2)
-------------------------
(3,)
(3, 1)
(1, 3)
---------
[14 32 50 68 86]
[[14]
[32]
[50]
[68]
[86]]
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
True
True
Sorry again for the confusion and inconvenience, they worked fine.
python/numpy is not a matrix-based language as it is Matlab or Octave or Scilab. These follow the rules of matrix multplication strictly. So
np.dot(f,o) ---------> f*o in Matlab/Octave/Scilab
np.dot(f,o1) ---------> f*o1 does not work in Matlab/Octave/Scilab
python/numpy has the 'broadcasting' which are the rules how the different data types and operations give together a result. It's not obvious why np.dot(f,o1) even should work, but the broadcasting defines some usefull results. You will have to consult the docs for that.
In python/numpy the * is not a matrix operator. You can find out what the broadcasting gives for
print(f*o)
print(f*o1)
print(f*o2)
Rather recently python/numpy has introduced the matrix operator #. You might find out what happens with
print(f#o)
print(f#o1)
print(f#o2)
Does this give some impressions ?

MATLAB to Python conversion: vectors, arrays, index elements

Good day to everyone! I'm currently converting a MATLAB project to Python 2.7. I am trying to convert the line
h = [ im(:,2:cols) zeros(rows,1) ] - [ zeros(rows,1) im(:,1:cols-1) ];
When I try to convert it
h = np.concatenate((im[1,range(2,cols)], np.zeros((rows, 1)))) -
np.concatenate((np.zeros((rows, 1)),im[1,range(2,cols - 1)] ))
IDLE returns different errors like
ValueError: all the input arrays must have same number of dimensions
I'm very new to Python and I would appreciate it if you would suggest other methods. Thank you so much! Here's the function I am trying to convert.
function [gradient, or] = canny(im, sigma, scaling, vert, horz)
xscaling = vert; yscaling = horz;
hsize = [6*sigma+1, 6*sigma+1]; % The filter size.
gaussian = fspecial('gaussian',hsize,sigma);
im = filter2(gaussian,im); % Smoothed image.
im = imresize(im, scaling, 'AntiAliasing',false);
[rows, cols] = size(im);
h = [ im(:,2:cols) zeros(rows,1) ] - [ zeros(rows,1) im(:,1:cols-1) ];
And I also would ask the equivalent of ':' operator that is used mainly in indeces and arrays in Python. Is there any equivalent for the : operator?
The Python converted code I started:
def canny(im=None, sigma=None, scaling=None, vert=None, horz=None):
xscaling = vert
yscaling = horz
hsize = (6 * sigma + 1), (6 * sigma + 1) # The filter size.
gaussian = gauss2D(hsize, sigma)
im = filter2(gaussian, im) # Smoothed image.
print("This is im")
print(im)
print("This is hsize")
print(hsize)
print("This is scaling")
print(scaling)
#scaling = 0.4
#scaling = tuple(scaling)
im = cv2.resize(im,None, fx=scaling, fy=scaling )
[rows, cols] = np.shape(im)
Say your data is in a list of lists. Try this:
a = [[2, 9, 4], [7, 5, 3], [6, 1, 8]]
im = np.array(a, dtype=float)
rows = 3
cols = 3
h = (np.hstack([im[:, 1:cols], np.zeros((rows, 1))])
- np.hstack([np.zeros((rows, 1)), im[:, :cols-1]]))
The equivalent of MATLAB's horzcat (that is, [A B]) is np.hstack and the equivalent of vertcat ([A; B]) is np.vstack.
Array indexing in numpy is very close to MATLAB, except that indexes start at 0 in numpy, and the range p:q means "p to q-1".
Also, the storage order of arrays is row-major by default, and you can use column-major order if you want (see this). In MATLAB, arrays are stored in column-major order. To check in Python, type for instance np.isfortran(im). If it returns true, the array has the same order as MATLAB (Fortran order), otherwise it's row-major (C order). It's important when you want to optimize loops, or when you pass an array to a C or Fortran routine.
Ideally, try to put everything in an np.array as soon as possible, and don't use lists (they take much more space and processing is much slower). There are also some quirks: for instance, 1.0 / 0.0 throws an exception, but np.float64(1.0) / np.float64(0.0) returns inf, like in MATLAB.
Another example from the comments:
d1 = [ im(2:rows,2:cols) zeros(rows-1,1); zeros(1,cols) ] - ...
[ zeros(1,cols); zeros(rows-1,1) im(1:rows-1,1:cols-1) ];
d2 = [ zeros(1,cols); im(1:rows-1,2:cols) zeros(rows-1,1); ] - ...
[ zeros(rows-1,1) im(2:rows,1:cols-1); zeros(1,cols) ];
For this one, rather than np.vstack and np.hstack, you can use np.block.
im = np.ones((10, 15))
rows, cols = im.shape
d1 = (np.block([[im[1:rows, 1:cols], np.zeros((rows-1, 1))],
[np.zeros((1, cols))]]) -
np.block([[np.zeros((1, cols))],
[np.zeros((rows-1, 1)), im[:rows-1, :cols-1]]]))
d2 = (np.block([[np.zeros((1, cols))],
[im[:rows-1, 1:cols], np.zeros((rows-1, 1))]]) -
np.block([[np.zeros((rows-1, 1)), im[1:rows, :cols-1]],
[np.zeros((1, cols))]]))
With np.zeros((Nrows,1)) you are generating a 2D array containing Nrows 1D arrays with 1 element. Then, with im[1,2:cols] your are getting a 1D array of cols-2 elements. You should change np.zeros((rows,1)) by np.zeros(rows).
Moreover, at the second np.concatenate, when you get a subarray from 'im' you should take the same number of elements than in the first concatenate. Note that you are taking one element less: range(2,cols) VS range(2,cols-1).

Python: Dendogram with Scipy doesn´t work

I want to use the dendogram of scipy.
I have the following data:
I have a list with seven different means. For example:
Y = [71.407452200146807, 0, 33.700136456196823, 1112.3757110973756, 31.594949722819372, 34.823881975554166, 28.36368420190157]
Each mean is calculate for a different user. For example:
X = ["user1", "user2", "user3", "user4", "user5", "user6", "user7"]
My aim is to display the data described above with the help of a dendorgram.
I tried the following:
Y = [71.407452200146807, 0, 33.700136456196823, 1112.3757110973756, 31.594949722819372, 34.823881975554166, 28.36368420190157]
X = ["user1", "user2", "user3", "user4", "user5", "user6", "user7"]
# Attempt with matrix
#X = np.concatenate((X, Y),)
#Z = linkage(X)
Z = linkage(Y)
# Plot the dendogram with the results above
dendrogram(Z, leaf_rotation=45., leaf_font_size=12. , show_contracted=True)
plt.style.use("seaborn-whitegrid")
plt.title("Dendogram to find clusters")
plt.ylabel("Distance")
plt.show()
But it says:
ValueError: Length n of condensed distance matrix 'y' must be a binomial coefficient, i.e.there must be a k such that (k \choose 2)=n)!
I already tried to convert my data into a matrix. With:
# Attempt with matrix
#X = np.concatenate((X, Y),)
#Z = linkage(X)
But that doesn´t work too!
Are there any suggestions?
Thanks :-)
The first argument of linkage is either an n x m array, representing n points in m-dimensional space, or a one-dimensional array containing the condensed distance matrix. These are two very different meanings! The first is the raw data, i.e. the observations. The second format assumes that you have already computed all the distances between your observations, and you are providing these distances to linkage, not the original points.
It looks like you want the first case (raw data), with m = 1. So you must reshape the input to have shape (n, 1).
Replace this:
Z = linkage(Y)
with:
Z = linkage(np.reshape(Y, (len(Y), 1)))
So you are using 7 observations in Y len(Y) = 7.
But as per documentation of Linkage, the number of observations len(Y) should be such that.
{n \choose 2} = len(Y)
which means
1/2 * (n -1) * n = len(Y)
so length of Y should be such that n is a valid integer.

Python Replacing every imaginary value in array by random

I got an
array([[ 0.01454911+0.j, 0.01392502+0.00095922j,
0.00343284+0.00036535j, 0.00094982+0.0019255j ,
0.00204887+0.0039264j , 0.00112154+0.00133549j, 0.00060697+0.j],
[ 0.02179418+0.j, 0.01010125-0.00062646j,
0.00086327+0.00495717j, 0.00204473-0.00584213j,
0.00159394-0.00678094j, 0.00121372-0.0043044j , 0.00040639+0.j]])
I need a solution which gives me the possibility to replace just the imaginary components by an random value generated by:
numpy.random.vonmises(mu, kappa, size=size)
The resulting array needs to be in the same form as the first one.
Loop over the numbers and just set them to a value you like. The parameters mu and kappa for the numpy.random.vonmises function need to be defined, since in they are undefined in the below example.
import numpy as np
data = np.array([[ 0.01454911+0.j, 0.01392502+0.00095922j,
0.00343284+0.00036535j, 0.00094982+0.0019255j ,
0.00204887+0.0039264j , 0.00112154+0.00133549j, 0.00060697+0.j],
[ 0.02179418+0.j, 0.01010125-0.00062646j,
0.00086327+0.00495717j, 0.00204473-0.00584213j,
0.00159394-0.00678094j, 0.00121372-0.0043044j , 0.00040639+0.j]])
def setRandomImag(c):
c.imag = np.random.vonmises(mu, kappa, size=size)
return c
data = [ setRandomImag(i) for i in data]
n_epochs = 2
n_freqs = 7
# form giving parameters for the array
data2 = np.zeros((n_epochs, n_freqs), dtype=complex)
for i in range(0,n_epochs):
data2[i] = np.real(data[i]) + np.random.vonmises(mu, kappa) * complex(0,1)
It gives my whole n_epoch the same imaginary value. Not exactly what I was asking for, but solves my problem.
Try using this approach:
Store your numbers into a 2-D array: Real-part and Imaginary-part.
Then replace the Imaginary-part with the randomly chosen numbers.

Translating Matlab (Octave) group coloring code into python (numpy, pyplot)

I want to translate the following group coloring octave function to python and use it with pyplot.
Function input:
x - Data matrix (m x n)
a - A parameter.
index - A vector of size "m" with values in range [: a]
(For example if a = 4, index can be [random.choice(range(4)) for i in range(m)]
The values in "index" indicate the number of the group the "m"th data point belongs to.
The function should plot all the data points from x and color them in different colors (Number of different colors is "a").
The function in octave:
p = hsv(a); % This is a x 3 metrix
colors = p(index, :); % ****This is m x 3 metrix****
scatter(X(:,1), X(:,2), 10, colors);
I couldn't find a function like hsv in python, so I wrote it myself (I think I did..):
p = colors.hsv_to_rgb(numpy.column_stack((
numpy.linspace(0, 1, a), numpy.ones((a ,2)) )) )
But I can't figure out how to do the matrix selection p(index, :) in python (numpy).
Specially because the size of "index" is bigger then "a".
Thanks in advance for your help.
So, you want to take an m x 3 of HSV values, and convert each row to RGB?
import numpy as np
import colorsys
mymatrix = np.matrix([[11,12,13],
[21,22,23],
[31,32,33]])
def to_hsv(x):
return colorsys.rgb_to_hsv(*x)
#Apply the to_hsv function to each matrix row.
print np.apply_along_axis(to_hsv, axis=1, arr=mymatrix)
This produces:
[[ 0.5 0. 13. ]
[ 0.5 0. 23. ]
[ 0.5 0. 33. ]]
Follow through on your comment:
If I understand you have a matrix p that is an a x 3 matrix, and you want to randomly select rows from the matrix over and over again, until you have a new matrix that is m x 3?
Ok. Let's say you have a matrix p defined as follows:
a = 5
p = np.random.randint(5, size=(a, 3))
Now, make a list of random integers between the range 0 -> 3 (index starts at 0 and ends to a-1), That is m in length:
m = 20
index = np.random.randint(a, size=m)
Now access the right indexes and plug them into a new matrix:
p_prime = np.matrix([p[i] for i in index])
Produces a 20 x 3 matrix.

Categories