Related
Say I have a matrix of shape (2,3), I need to diagonalize the 3-elements vector into matrix of shape (3,3), for all the 2 vectors at once. That is, I need to return matrix with shape (2,3,3). How can I do that with Numpy elegantly ?
given data = np.array([[1,2,3],[4,5,6]])
i want the result [[[1,0,0],
[0,2,0],
[0,0,3]],
[[4,0,0],
[0,5,0],
[0,0,6]]]
Thanks
tl;dr, my one-liner: mydiag=np.vectorize(np.diag, signature='(n)->(n,n)')
I suppose here that by "diagonalize" you mean "applying np.diag".
Which, as a teacher of linear algebra, tickles me a bit. Since "diagonalizing" has a specific meaning, which is not that (it is computing eigen vectors and values, and from there, writing M=P⁻¹ΛP. Which you cannot do from the inputs you have).
So, I suppose that if input matrix is
[[1, 2, 3],
[9, 8, 7]]
The output matrix you want is
[[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[9, 0, 0],
[0, 8, 0],
[0, 0, 7]]]
If not, you can ignore this answer [Edit: in the meantime, you explained exactly that. So yo may continue to read].
There are many way to do that.
My one liner would be
mydiag=np.vectorize(np.diag, signature='(n)->(n,n)')
Which build a new functions which does what you want (it interprets the input as a list of 1D-array, call np.diag of each of them, to get a 2D-array, and put each 2D-array in a numpy array, thus getting a 3D-array)
Then, you just call mydiag(M)
One advantage of vectorize, is that it uses numpy broadcasting. In other words, the loops are executed in C, not in python. In yet other words, it is faster. Well it is supposed to be (on small matrix, it is in fact slower than Michael's method - in comment; on large matrix, it is has the exact same speed. Which is frustrating, since einsum doc itself specify that it sacrifices broadcasting).
Plus, it is a one-liner, which has no other interest than bragging on forums. But well, here we are.
Here is one way with indexing:
out = np.zeros(data.shape+(data.shape[-1],), dtype=data.dtype)
x,y = np.indices(data.shape).reshape(2, -1)
out[x,y,y] = data.ravel()
output:
array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3]],
[[4, 0, 0],
[0, 5, 0],
[0, 0, 6]]])
We use array indexing to precisely grab those elements that are on the diagonal. Note that array indexing allows broadcasting between the indices, so we have index1 contain the index of the array, and index2 contain the index of the diagonal element.
index1 = np.arange(2)[:, None] # 2 is the number of arrays
index2 = np.arange(3)[None, :] # 3 is the square size of each matrix
result = np.zeros((2, 3, 3))
result[index1, index2, index2] = data
I've stated this question in graph theory terms, but that conceptualization isn't necessary.
What I'm trying to do, using Python, is produce a matrix of zeros and ones, where every row has the same number of ones and every column has the same number of ones. The number for rows will not be the same as the number for columns when the number of rows (sending nodes) does not equal the number of columns (receiving nodes) -- which is something I'm allowing.
It makes sense to me to do this in numpy, but there may be other packages (like networkx?) that would help.
Here's the function I'm looking to write with the desired inputs and outputs:
n_pre = 4 # number of nodes available to send a connection
n_post = 4 # number of nodes available to receive a connection
p = 0.5 # proportion of all possible connections that exist
mat = generate_mat(n_pre, n_post, p)
print mat
The output would be, for example:
[[0, 1, 0, 1],
[1, 0, 1, 0],
[1, 1, 0, 0],
[0, 0, 1, 1]]
Notice, every column and every row has two ones in it. Aside from this constraint, the positions of the ones should be random (and vary from call to call of this function).
In graph theory terms, this means every node has an in-degree of 2 and an out-degree of 2 (50% of all possible connections, as specified with p = 0.5).
For a square matrix, what you describe is the adjacency matrix of a random k-regular directed graph, and there are known algorithms to generate such graphs. igraph implements one:
# I think this is how you call it - it's an instance method for some reason.
igraph.Graph().K_Regular(n, k, directed=True)
networkx has a function for random k-regular undirected graphs:
networkx.random_regular_graph(k, n)
For a non-square matrix, what you describe is isomorphic to a random biregular graph. I have found no convenient existing implementation for random biregular graphs, but the term should be a good starting point for searching for known algorithms.
First, do the pre-work so that we have available the size of the square matrix and the population pop of each row and column. Now, initialize a matrix with pop ones on the diagonal. For n = 6 and pop = 3, you'd have
[[1, 1, 1, 0, 0, 0]
[0, 1, 1, 1, 0, 0]
[0, 0, 1, 1, 1, 0]
[0, 0, 0, 1, 1, 1]
[1, 0, 0, 0, 1, 1]
[1, 1, 0, 0, 0, 1]]
Now, apply your friendly neighborhood random shuffle operation to the columns, then the rows (or in the other order). There's your matrix. A shuffle of rows-only or columns-only does not change the population on either axis.
I found out about vtkInterface, a python vtk wrapper that facilitates vtk plotting.
Trying to run their first example, under Initialize from Numpy Arrays in this page: vtkInterface.PolyData, by simply running the code as is, and it results in a gray render window with nothing in it.
Some of the other examples do work but this is exactly the thing that I need at the moment and was wondering if anybody has tried it and knows what might be wrong.
Example Code:
import numpy as np
import vtkInterface
# mesh points
vertices = np.array([[0, 0, 0],
[1, 0, 0],
[1, 1, 0],
[0, 1, 0]])
# mesh faces
faces = np.hstack([[4, 0, 1, 2, 3], # square
[3, 0, 1, 4], # triangle
[3, 1, 2, 4]]) # triangle
surf = vtkInterface.PolyData(vertices, faces)
# plot each face with a different color
surf.Plot(scalars=np.arange(3))
The example is wrong. It lacks a fifth point. For example this will work.
vertices = np.array([[0, 0, 0],
[1, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0.5, 0.5, -1]])
Explanation: In VTK, faces are encoded in the following way:
face_j = [ n, i_0, i_1, ..., i_n ]
Here, n is the number of points per face, and i_k are the indices of the points in the vertex-array. The face is formed by connecting the points vertices[i_k] with k in range(0,n). A list of faces is created by simply concatenating the single face specifications:
np.hstack([face_0, face_1, ..., face_j, ...])
The advantage of this encoding is that the number of points used per face, n, can vary. So a mesh can consist of lines, triangles, quads, etc.
In the example, the vertex with id 4 is used in the second and third face. So vertices is required to consist of at least five entries. Surprisingly, the sample doesn't crash, as VTK almost certainly would if some faces were accessing non-existing points.
I have two 2D arrays (or of higher dimension), one that defines averages (M) and one that defines standard deviations (S). Is there a python library (numpy, scipy, ...?) that allows me to generate an array (X) containing samples drawn from the corresponding distributions?
In other words: each entry xij is a sample that comes from the normal distribution defined by the corresponding mean mij and standard deviation sij.
Yes numpy can help here:
There is a np.random.normal function that accepts array-like inputs:
import numpy as np
means = np.arange(10) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
stddevs = np.ones(10) # [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
samples = np.random.normal(means, stddevs)
array([-1.69515214, -0.20680708, 0.61345775, 2.98154162, 2.77888087,
7.22203785, 5.29995343, 8.52766436, 9.70005434, 9.58381479])
even if they are multidimensional:
means = np.arange(10).reshape(2,5) # make it multidimensional with shape 2, 5
stddevs = np.ones(10).reshape(2,5)
samples = np.random.normal(means, stddevs)
array([[-0.76585438, 1.22226145, 2.85554809, 2.64009423, 4.67255324],
[ 3.21658151, 4.59969355, 6.87946817, 9.14658687, 8.68465692]])
The second one has a shape of (2,5)
In case you want only different means but the same standard deviation you can also only pass one array and one scalar and still get an array with the right shape:
means = np.arange(10)
samples = np.random.normal(means, 1)
array([ 0.54018686, -0.35737881, 2.08881115, 3.08742942, 4.4426366 ,
3.6694955 , 5.27515536, 8.68300816, 8.83893819, 7.71284217])
I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Update: (Eleven years after this original answer, it's probably time for an update.)
First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).
Here are some resources:
sklearn k-means and sklearn other clustering algorithms
scipy k-means and scipy k-means2
Old answer:
Scipy's clustering implementations work well, and they include a k-means implementation.
There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.
SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.
For now, I would recommend using PyCluster instead. Example usage:
>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean,
0.03 * numpy.diag([1,1]),
20)
for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound # Number of times this solution was found
1
For continuous data, k-means is very easy.
You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.
I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is
k*(1-(1/n)) + n*(1/n)
Here is the full code in Python
from __future__ import division
from random import random
# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]
param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1
for x in data:
closest_k = 0;
smallest_error = 9999; # this should really be positive infinity
for k in enumerate(means):
error = abs(x-k[1])
if error < smallest_error:
smallest_error = error
closest_k = k[0]
means[closest_k] = means[closest_k]*(1-param) + x*(param)
you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!
(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.
From wikipedia, you could use scipy, K-means clustering an vector quantization
Or, you could use a Python wrapper for OpenCV, ctypes-opencv.
Or you could OpenCV's new Python interface, and their kmeans implementation.
SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as:
kmeans = KMeans(n_clusters=2, random_state=0).fit(X).
This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
(courtesy of SciKit Learn's documentation, linked above)
You can also use GDAL, which has many many functions to work with spatial data.
Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.