Matrix clustering using Python - python

I'm working on a growing matrix of data. I found that probably the best way to make my computations faster, I need to clusterize it somewhat in a way of this: Clusterized matrix
My matrix shows connections between nodes on a graph with their weights on the intersections.
I made a graph using NetworkX and noticed it does something similar. Screenshot: NX's Graph
Maybe I could use NetworkX's code to cluster it instead of growing my code by another function?
If not, then any python way of doing it would be helpful. I read many tutorial on hierarchical clustering but it all seems to be about connecting points in a two-dimensional space, not in the graph-space with given 'distances'.

Related

DBSCAN provided with lines as input

I am new to both machine learning and python and my goal is to experiment with route prediction through clustering.
I've just started using DBSCAN and I was able to obtain results given an array of coordinates as input to the fit procedure, e.g. [[1,1],[2,2],[3,3],...], which includes all coordinates of all routes.
However, what I really want is to provide DBSCAN with a set containing all routes/lines instead of a set containing all coordinates of all routes. Therefore, my question is whether this is possible (does it even make sense?) and if so how can I accomplish this?
Thank you for your time.
Why do you think density based clustering is a good choice for clustering routes? What notion of density would you use here?
I'd rather try hierarchical clustering with a proper route distance.
But if you have the distance matrix anyway, you can of course just try DBSCAN on it for "free" (computing the distances will be way more expensive than DBSCAN on a distance matrix).

Document Clustering and Visualization

I would like to test if a set of documents have some special similarity, looking on a graph built with each one's vector representation, showed together with a text dataset of other documents. I guess that they will be together in a visualization.
The solution is to use doc2vec to calculate the vector for each document and plot it? Can it be done in a unsupervised way? Which python library should I use to get those beautiful 2D and 3D representations of Word2vec?
Not sure of what you're asking but if you want a way to check if vector are of the same type you could use K-Means.
K-Means make a number K of cluster out of a list of vector, so if you choose a good K (not too low so it will search for something but not too high so it will not be too discriminant) it could work.
K-Means grossly work that way:
init_center(K) # randomly set K vector that will be the center of your cluster
while not converge(): # This one is tricky as you can find a lot of way to check for the convergence, the easiest is to check if your center has moved since the last itteration
associate_vector() # Here you associate all the vectors to the closest center
re_calculate_center() # And now you put the center at the... well center of their point, you can do that just by doing the mean of all the vector of the cluster.
This gif is probably clearer than me:
And this article (where this gif is from) is really clearer than me, even if he talk for java here:
https://picoledelimao.github.io/blog/2016/03/12/multithreaded-k-means-in-java/

What's the right algorithm for finding isolated subsets

Picture is worth a thousand words, so:
My input is the matrix on the left, and what I need to find is the sets of nodes that are maximum one step away from each other (not diagonally). Node that is more than one up/down/left/right step away would be in a separate set.
So, my plan was running a BFS from every node I find, then returning the set it traversed through, and removing it from the original set. Iterate this process until I'm done. But then I've had the wild idea of looking for a graph analysis tools - and I've found NetworkX. Is there an easy way (algorithm?) to achieve this without manually writing BFS, and traverse the whole matrix?
Thanks
What you are trying to do is searching for "connected components" and
NetworX has itself a method for doing exactly that as can be seen in the first example on this documentation page as others has already pointed out on the comments.
Reading your question it seems that your nodes are on a discrete grid and the concept of connected that you describe is the same used on the pixel of an image.
Connected components algorithms are available for graphs and for images also.
If performances are important in your case I would suggest you to go for the image version of connected components.
This comes by the fact that images (grids of pixels) are a specific class of graphs so the connected components algorithms dealing with grids of nodes
are built knowing the topology of the graph itself (i.e. graph is planar, the max vertex degree is four). A general algorithm for graphs has o be able to work on general graphs
(i.e they may be not planar, with multiple edges between some nodes) so it has to spend more work because it can't assume much about the properties of the input graph.
Since connected components can be found on graphs in linear time I am not telling the image version would be orders of magnitude faster. There will only be a constant factor between the two.
For this reason you should also take into account which is the data structure that holds your input data and how much time will be spent in creating the input structures which are required by each version of the algorithm.

DBSCAN plotting Non-geometrical-Data

I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.
I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.

Using Python to generate a connection/network graph

I have a text file with about 8.5 million data points in the form:
Company 87178481
Company 893489
Company 2345788
[...]
I want to use Python to create a connection graph to see what the network between companies looks like. From the above sample, two companies would share an edge if the value in the second column is the same (clarification from/for Hooked).
I've been using the NetworkX package and have been able to generate a network for a few thousand points, but it's not making it through the full 8.5 million-node text file. I ran it and left for about 15 hours, and when I came back, the cursor in the shell was still blinking, but there was no output graph.
Is it safe to assume that it was still running? Is there a better/faster/easier approach to graph millions of points?
If you have 1000K points of data, you'll need some way of looking at the broad picture. Depending on what you are looking for exactly, if you can assign a "distance" between companies (say number of connections apart) you can visualize relationships (or clustering) via a Dendrogram.
Scipy does clustering:
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html#module-scipy.cluster.hierarchy
and has a function to turn them into dendrograms for visualization:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram
An example for a shortest path distance function via networkx:
http://networkx.lanl.gov/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path
Ultimately you'll have to decide how you want to weight the distance between two companies (vertices) in your graph.
You have too many datapoints and if you did visualize the network it won't make any sense. You need to have ways to 1)reduce the number of companies by removing those that are less important/less connected 2)summarize the graph somehow and then visualize.
to reduce the size of data it might be better to create the network independently (using your own code to create an edgelist of companies). This way you can reduce the size of your graph (by removing singletons for example, which may be many).
For summarization I recommend running a clustering or a community detection algorithm. This can be done very fast even for very large networks. Use the "fastgreedy" method in the igraph package: http://igraph.sourceforge.net/doc/R/fastgreedy.community.html
(there is a faster algorithm available online as well, this is by Blondel et al: http://perso.uclouvain.be/vincent.blondel/publications/08BG.pdf I know their code is available online somewhere)

Categories