So I am trying to solve a problem in python and I am able to generate a matrix of the form:
[
[ 0, 0, 0, 1, 1, 1 ],
[ 0, 0, 1, 0, 1, 1 ],
[ 0, 1, 0, 0, 0, 1 ],
[ 1, 0, 0, 0, 1, 1 ],
[ 1, 1, 0, 1, 0, 0 ],
[ 1, 1, 1, 1, 0, 0 ],
]
Which is just an array of arrays. The matrix is a square matrix whose diagonal is always going to be 0 because it essentially represents compatibility between two entities. For example, entity 0 is compatible with entity 3, 4, and 5, represented by 1's in the first column. The matrix is mirrored along the diagonal because if entity 1 is compatible with entity 3 then entity 3 will also be compatible with entry 1.
Now what I essentially want to do is find a way that maximum number of compatible entities get paired up, without repetition. For example, if entity 0 has been paired with entity 3, then it should not be paired up with entity 4. However if entity 3 has another compatible entity but entity 4 does not, then entity 0 should be paired with entity 4 and not entity 3.
The end goal is to find out the number of entities that would remain unpaired. I could probably brute force such a solution but since its more of a math problem than a programming one, I would like to know if there is an efficient way of going about it.
It appears what you have is an adjacency matrix of an undirected graph. Graphs consist of nodes and edges between the nodes. Your problem can be summarized as finding the number of unmatched nodes in a maximum cardinality matching. There's networkx API in Python that deals with graphs and if you're willing to use it, the problem becomes very simple.
You first build a graph object from the adjacency matrix; then call max_weight_matching to find the maximum cardinality matching. It implements Edmonds' blossom algorithm.
# pip install networkx
import networkx as nx
G = nx.from_numpy_matrix(data)
tmp = nx.max_weight_matching(G)
this outputs the maximum cardinality matching:
{(0, 5), (1, 2), (3, 4)}
From there, if you want to find the number of nodes that are unmatched, you can find the difference between the nodes and the matched nodes:
out = len(G.nodes - {x for pair in list(tmp) for x in pair})
Output:
0
Related
I am working on a new question. How am I supposed to develop the a solution for the following:
The neighbors() Function
In a 2D array, each element has up to eight neighbors - the cells immediately to the north, east, west, south, northeast, northwest, southeast, and southwest of it. Of course, elements on the boundary of the array have fewer neighbors (entries at the four corners only have three neighbors). Your task is to write a function called neighbors() that takes a 2D array named input, a row index, and a column index as input and returns the number of neighbors of array[row][column] that are 1's.
For example, if the 2D input array is
array = [ [0, 0, 0, 0], [1, 1, 0, 1], [0, 0, 0, 1] ]
representing the array
0__0__0__0
1__1__0__1
0__0__0__1
then here are some examples of the function being used:
>>> neighbors(array, 1, 1)
1
>>> neighbors(array, 2, 2)
3
For Finite-Element simulations I need higher order meshes.
For sake of efficiency I want to use serendipity elements, i.e. elements without interior nodes.
The setOrder() function of gmsh was easy to find, it generates Lagrangian elements.
How can I set another element type, or somehow remove the interior nodes?
In the following example of a square in 2D the generated quadrilaterals of second order have 9 nodes each, I would like to have only 8 nodes per element.
Interestingly, gmsh seems to know these element types, as they are mentioned in the documentation of the file format elm-type=10 and elm-type=16, respectively.
import numpy
import gmsh
gmsh.initialize()
gmsh.model.add("mini")
dim1=1
dim2=2
gmsh.model.geo.addPoint(0, 0, 0, 0.5, 1)
gmsh.model.geo.addPoint(1, 0, 0, 0.5, 2)
gmsh.model.geo.addPoint(1, 1, 0, 0.5, 3)
gmsh.model.geo.addPoint(0, 1, 0, 0.5, 4)
gmsh.model.geo.addLine(1, 2, 1)
gmsh.model.geo.addLine(2, 3, 2)
gmsh.model.geo.addLine(3, 4, 3)
gmsh.model.geo.addLine(4, 1, 4)
gmsh.model.geo.addCurveLoop([1, 2, 3, 4], 1)
gmsh.model.geo.addPlaneSurface([1], 1)
Square = gmsh.model.addPhysicalGroup(dim2, [1])
gmsh.model.setPhysicalName(dim2, Square, "Unit_Square")
gmsh.model.geo.mesh.setRecombine(dim2, 1)
gmsh.model.geo.synchronize()
gmsh.model.mesh.generate(dim2)
gmsh.model.mesh.setOrder(2) # generates Laplacian elements, but I want Serendipity
gmsh.write("mesh_order2.msh")
gmsh.finalize()
Christophe Geuzaine answered my question, the keyword is "incomplete elements"
https://gitlab.onelab.info/gmsh/gmsh/-/issues/1272
meaning for my example
...
gmsh.model.mesh.generate(dim2)
gmsh.option.setNumber('Mesh.SecondOrderIncomplete', 1) # <-- that's it!
gmsh.model.mesh.setOrder(2)
...
I've stated this question in graph theory terms, but that conceptualization isn't necessary.
What I'm trying to do, using Python, is produce a matrix of zeros and ones, where every row has the same number of ones and every column has the same number of ones. The number for rows will not be the same as the number for columns when the number of rows (sending nodes) does not equal the number of columns (receiving nodes) -- which is something I'm allowing.
It makes sense to me to do this in numpy, but there may be other packages (like networkx?) that would help.
Here's the function I'm looking to write with the desired inputs and outputs:
n_pre = 4 # number of nodes available to send a connection
n_post = 4 # number of nodes available to receive a connection
p = 0.5 # proportion of all possible connections that exist
mat = generate_mat(n_pre, n_post, p)
print mat
The output would be, for example:
[[0, 1, 0, 1],
[1, 0, 1, 0],
[1, 1, 0, 0],
[0, 0, 1, 1]]
Notice, every column and every row has two ones in it. Aside from this constraint, the positions of the ones should be random (and vary from call to call of this function).
In graph theory terms, this means every node has an in-degree of 2 and an out-degree of 2 (50% of all possible connections, as specified with p = 0.5).
For a square matrix, what you describe is the adjacency matrix of a random k-regular directed graph, and there are known algorithms to generate such graphs. igraph implements one:
# I think this is how you call it - it's an instance method for some reason.
igraph.Graph().K_Regular(n, k, directed=True)
networkx has a function for random k-regular undirected graphs:
networkx.random_regular_graph(k, n)
For a non-square matrix, what you describe is isomorphic to a random biregular graph. I have found no convenient existing implementation for random biregular graphs, but the term should be a good starting point for searching for known algorithms.
First, do the pre-work so that we have available the size of the square matrix and the population pop of each row and column. Now, initialize a matrix with pop ones on the diagonal. For n = 6 and pop = 3, you'd have
[[1, 1, 1, 0, 0, 0]
[0, 1, 1, 1, 0, 0]
[0, 0, 1, 1, 1, 0]
[0, 0, 0, 1, 1, 1]
[1, 0, 0, 0, 1, 1]
[1, 1, 0, 0, 0, 1]]
Now, apply your friendly neighborhood random shuffle operation to the columns, then the rows (or in the other order). There's your matrix. A shuffle of rows-only or columns-only does not change the population on either axis.
I have data that is a matrix of z-scores. Each row has zero average. I am trying to perform a kmeans cluster analysis so as to assign each row to a cluster. To take a very simplified example, in the matrix:
[0, -1, 1, 0]
[0, -1, 1, 0]
[0, 1, -1, 0]
[1, 1, -1, -1]
[-1, -1, 1, 1]
(Except the actual z-score data would have a variance of 1 in each row.)
Python should recognize that the top two rows are in one cluster. I can do this with sklearn.cluster.KMeans. However, now I want it to detect "anticorrelation" and classify the third row together with the top two rows because it is simply the negative of them. If I tell it to find two clusters, it should find one with the top three rows and another with the bottom two, because the bottom two are also negatives of one another.
One possible approach (perhaps) is if I could use a user-defined goodness-of-fit function that defines the distance of two points r1 and r2 as being the minimum of sqrt((r1+r2)**2) and sqrt((r1-r2)**2). I might than want to know whether a given row has been used as its positive or negative version in its cluster.
Thanks for any help you can give.
I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Update: (Eleven years after this original answer, it's probably time for an update.)
First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).
Here are some resources:
sklearn k-means and sklearn other clustering algorithms
scipy k-means and scipy k-means2
Old answer:
Scipy's clustering implementations work well, and they include a k-means implementation.
There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.
SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.
For now, I would recommend using PyCluster instead. Example usage:
>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean,
0.03 * numpy.diag([1,1]),
20)
for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound # Number of times this solution was found
1
For continuous data, k-means is very easy.
You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.
I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is
k*(1-(1/n)) + n*(1/n)
Here is the full code in Python
from __future__ import division
from random import random
# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]
param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1
for x in data:
closest_k = 0;
smallest_error = 9999; # this should really be positive infinity
for k in enumerate(means):
error = abs(x-k[1])
if error < smallest_error:
smallest_error = error
closest_k = k[0]
means[closest_k] = means[closest_k]*(1-param) + x*(param)
you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!
(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.
From wikipedia, you could use scipy, K-means clustering an vector quantization
Or, you could use a Python wrapper for OpenCV, ctypes-opencv.
Or you could OpenCV's new Python interface, and their kmeans implementation.
SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as:
kmeans = KMeans(n_clusters=2, random_state=0).fit(X).
This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
(courtesy of SciKit Learn's documentation, linked above)
You can also use GDAL, which has many many functions to work with spatial data.
Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.