Build clustering upon matrix lines with Kmeans

Build clustering upon matrix lines with Kmeans - python

BIG EDIT
The original code was:
The the plotting of a graph that corresponds to the reading of a text file with n lines. Each line contains 4 columns,the first three columns are coordinates of (x,y,z) points, and the fourth column is a binary variable not necessary for this plotting. At each 20 lines read, a skeleton is read, this skeleton being a group of 20 (x,y,z) points or joints, each joint made by the first three columns of each line.
Example of a text file content: A text file contains 860 lines, and 860/20 = 43, being 20 the number of joints to create a skeleton of (x,y,z) joints. Then, the text file is made of 43 skeletons, that generates a movement. Therefore, the text file represents a movement. I've called it "example" because the numbers vary.
After building the code to read the skeleton's movements, I've made a big 2D array that contains all the movements together, and the result was a 22797x400 array, where each line is a skeleton. Therefore, there are 22797 skeletons, with 400 columns for each. I've called this last 2D array of final_array.
I've applied the Singular Value Decomposition (SVD) to final_array, where I've used the V matrix from SVD (that results in S, V and D matrices) to make a multiplication between final_array and a reduced version of V (which is originally 400x400), resulting in a 22797x3 2D array, since the reduced version of V was 400x3. This was necessary for some reasons that don't need to be mentioned here, but it was for dimension reduction to plot the skeletons in upcoming parts of the process.
Hence, I have a 22797x3 2D array, where each line represents a skeleton, built from operations explained above, and I need to apply clustering to this matrix, where each line will be clustered to a group, using Kmeans from Scikit-learn in Python. It must be a cluster with 100 clustering groups.
What I need to have as result is the kmeans_labels result, with a list of 22797 elements, informing was group of the 100 clustering groups each line (skeleton) was grouped at.
So far I've tried:
kmeans = KMeans(n_clusters=100, random_state=0).fit(matrix)
But the result was the following error message:
Number of distinct clusters (68) found smaller than n_clusters (100). Possibly due to duplicate points in X.
return_n_iter=True)
It doesn't matter how many times I change the groups number, the error message returns with a smaller value.
Any hep?

This error means that your data matrix is mostly composed of repeated vectors.
So from your 22797 data points, there are only 68 different vectors and the rest are just repetitions of these 68 values.
Try printing the matrix. I believe you either you are not reading the data as you should, or you are not measuring them the right way

Related

Retrieving MST for geographic coordinates using scipy minimum spanning tree

I am trying to create a minimum spanning tree (MST) from geographical coordinates using scipy, but for the life of me I cannot understand how to extract information to it . The scipy documentation is not very clear and multiple searches have not provided results.
For context in total I have around 200k datapoints per set and they look like this
My final objective is to create a line vector that connects these points through the MST, more or less as they appear in the image above. But for that I need an ordered list of point indices (or coordinates) I can work with.
Most of all I would need help understanding how to use the output of minimum_spanning_tree but it might be that I am making mistakes along the path
Overall steps
The steps I take are:
Create the sparse matrix with coordinate info
provide the matrix to scipy.sparse.csgraph.minimum_spanning_tree
Do some magic to extract column values
This is the small sample test data:
test_data = {
"index": [0,1,2,3,4],
"X": [35,36,37,38,38],
"Y": [2113,2113,2112,2101,2102]
}
df= pd.DataFrame(test_data)
Step 1, create the sparse matrix
xs = df[["X"]].values.squeeze().astype(int)
ys = df[["Y"]].values.squeeze().astype(int)
data= np.array(df.index).squeeze().astype(int)
max_dim =max(np.max(xs), np.max(ys)) +1
max_dim
dist_matr=csr_matrix((data, (xs,ys)),shape=(max_dim, max_dim))
Q1:I couldn't understand what data is in this context as scipy docs do not explain that in detail. should {data} be the labels of the points or are they the edge weights?
Step2: calculate the minimum spanning tree
mst = minimum_spanning_tree(dist_matr)
Step3: get an ordered list of indices (or coordinates)
As I understand it the output of MST is a sparse graph that should look something like this (source)
Q2: However, my matrix is not 5X5, but max_value*max_value (2113 in this case). And it seems like the content of the matrix is not the edge weight. Am I getting this wrong?
I have tried to extract the connected components, but the labels don't make sense to me
# Label connected components.
num_graphs, labels = connected_components(mst, directed=False)
# This is a snippet I found somewhere but I have difficulties following the logic of it
results = [[] for i in range(max(labels) + 1)]
for idx, label in enumerate(labels):
results[label].append(idx)
portion of the results:
As you can see point coordinates are grouped in an odd way, without a relationship between x and y. I have also tried 'depth_first_order' but aside for asking a starting point (that I wouldn't know how to choose) it provides me with equally confusing outputs
Q4: How do I "read" the MST matrix and extract the minimum spanning tree for all points?
I am happy to explore other solutions as long as they provide a similar result and are scalable, however I have seen concerns about NetworkX for lots of data and MisTree doesn't install on my setup

Plot a matrix as a single point in space

I have a dataset of drugs represented as a graph, each of which is described by three non-square matrices:
edge index (A), an 2xe matrix, where e are the bonds of the molecule, the first line indicates the node (atom) from which the edge (bond) starts, and the second one the node where the edge arrives;
node feature matrix (X), an nx9 matrix, where n are the atoms of the molecule and 9 are the features used to describe these (e.g. atomic number, charge, hybridization);
edge feature matrix (E), an 4xe matrix, where e are the bonds of the molecule and 4 are the features used to describe these (e.g. type of bond, geometry).
I would like to plot these data on a Cartesian space to see if clusters are created based on their activity label. I thought, if I can reduce each matrix to a single point in space for each graph I will have three x, y, z coordinates, and then it will be very easy to plot the points. Does this make sense in your opinion? How could I go about turning a matrix into a single point using python? Finally, I leave you with an example of the graph I would like to create
Thank you all!

Assuming:
The nodes in a drug's graph represent features that every drug has to different extents, including zero.
The structure of a drug's graph models the extent to which every feature applies to that that drug
There is an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature applies the the drug
Then:
Construct a table where each row models a drug and each column is for a feature. Each cell then contains the "extent" to which the column's feature applies to the row's drug.
Apply the K-Means algorithm to the table.
The challenge is, of course: an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature.
IMHO the first step is to enter your data into a graph theory library. I see you are using Python. Python folks generally use a library called networkx. Are you familiar with this library?
Personally, I much prefer to work with C++ ( it gives the performance required for large problem sets ) Recently, I added a SMILES parser to my C++ graph library.

Convert the SMILES representation of each drug to its graph representation
Calculate the graph edit distance ( GED https://en.wikipedia.org/wiki/Graph_edit_distance ) between every pair of drugs
LOOP GEDMAX from 1 to 10
Add a connection between two drugs if the GED is less than GEDMAX. This forms a new graph we can call "GEDgraph"
Find the components ( clusters of drugs all reachable from each other in the GEDgraph )
SELECT "best" set of components

Block Artifact Grid of an Image

I have been reading this paper where they have used Error Level Analysis (ELA) and Block Artifact Grid (BAG).
I have found a way to do the ELA using skimage from here
I want a final output of the BAG part showed in the diagram below [Look at the BAG output specifically] -
But I haven't found a single implementation of BAG in python:
Divide the image into 8 × 8 blocks. Take the DCT of the blocks (using an 8 × 8 DCT matrix and matrix multiply).
Make a histogram of the color-quantized DCT values for each of the 64 locations of the blocks (where the number of blocks and the number
of values in each histogram is equal to the number that can fit into
the image).
Take the Fast-Fourier Transform (FFT) of the histogram of each of the 64 frequencies to get the periodicity and then power spectrum to
get peaks.
Calculate the number of local minimums of the extrema. This is the estimated Q value.
Get a Q estimate for at least 32 Q values, and use it to calculate the block artifact (error in the Q value) for each image block. Output
an error map of the image.
What I have tried
I can try and figure out individual pieces like taking DCT, calculating FFT but unable to put everything together, especially Q estimate for at least 32 Q values, which is not making sense to me. Thanks in advance.

Simulate speakers around a sphere using superposition - speed improvments needed

Note: Drastic speed improvements since posting, see edits at bottom.
I have some working code by it over utilizes loops and I'm pretty sure there should be a faster way of doing it. The size of the output array ends up being pretty large and so when I try to make other arrays the same size of the output, I run out of memory rather quickly.
I am simulating many speakers placed around a sphere all pointing toward the center. I have a simulation of a single speaker and I would like to leverage this single simulation by using the principle of superposition. Basically I want to sum up rotated copies of the single transducer simulation to get my final result.
I have an axisymmetric simulation of acoustic pressure data in cylindrical coordinates ("polar_coord_r", "polar_coord_z"). The pressure field from the simulation is unique at each R and Z value and completely described by a 2D array ("P_real_RZ").
I want to sum together multiple, rotated copies of the this pressure field onto a 3D Cartesian output array. Each copy is rotated to a different location on the sphere. Currently, I am specifying the rotation with an x,y,z point because it allows me to do vector math (spherical coordinates wouldn't allow me to do this as elegantly). The output array is rather large (770 × 770 × 804).
I have some working code to get the output from a single copy of the speaker ("transducer"). It takes about 12 seconds for each slice so it would take over two hours to add each new speaker!! I want to have a dozen or so copies of the speaker so this will take way to long.
The code takes a slice with constant X and computes the R and Z positions at each location in the that slice. "r_distance" is a 2D array containing the radial distance from a line passing between the origin and a point ("point"). Similarlity, "z_distance" is a 2D array containing the distance along that same line.
To get the pressure for the slice, I find the indices of the closest matching "polar_coord_r" and "polar_coord_z" to the computed R distances and Z distances. I use these indices to find what value of pressure (from P_real_RZ) to place at each value in the output.
Some definitions:
xx, yy, and zz are 1D arrays of describing the slices through the output volume
XXX, YYY, and ZZZ are 3D arrays produced by numpy.meshgrid
point is a point which defines the direction that the speaker is rotated. Basically it's just a position vector of the speakers center.
P_real_RZ is a 2D array which specifies the real pressure at each unique R and Z value.
polar_coord_r and polar_coord_z are 1D arrays which define the unique values of R and Z on which P_real_RZ is defined.
current_transducer (only one so far represented in this code) is the pressure values computer for the current point.
output is the result from summing many speakers/transducers together.
Any suggestions to speed up this code is greatly appreciated.
Working loop:
for i, x in enumerate(xx):
# Creates a unit vector from origin to a point
vector = normalize(point)
# Create a slice of the cartesian space with constant X
xyz_slice = np.array([x*np.ones_like(XXX[i,:,:]), YYY[i,:,:], ZZZ[i,:,:]])
# Projects the position vector of each point of the slice onto the unit vector.
projection = np.array(list(map(np.dot, xyz_slice, vector )))
# Normalizes the projection which results in the Z distance along the line passing through the point
#z_distance = np.apply_along_axis(np.linalg.norm, 0, projection) # this is the slow bit
z_distance = np.linalg.norm(projection, axis=0) # I'm an idiot
# Uses vector math to determine the distance from the line
# Each point in the XYZ slice is the sum of vector along the line and the vector away from the line (radial vector).
# By extension the position of the xyz point minus the projection of the point against the unit vector, results in the radial vector
# Norm the radial vector to get the R value for everywhere in the slice
#r_distance = np.apply_along_axis(np.linalg.norm, 0, xyz_slice - projection) # this is the slow bit
r_distance = np.linalg.norm(xyz_slice - projection, axis=0) # I'm an idiot
# Map the pressure data to each point in the slice using the R and Z distance with the RZ pressure slice.
# look for a more efficient way to do this perhaps. currently takes about 12 seconds per slice
r_indices = r_map_v(r_distance) # 1.3 seconds by itself
z_indices = z_map_v(z_distance)
r_indices[r_indices>384] = 384 # find and remove indicies above the max for r_distance
z_indices[r_indices>803] = 803
current_transducer[i,:,:] = P_real_RZ[z_indices, r_indices]
# Sum the mapped pressure data into the output.
output += current_transducer
I have also tried to work with the simulation data in the form of a 3D Cartesian array. That is the pressure data from the simulation for all x, y, and z values the same size as the output.I can rotate this 3D array in one direction (not two rotations needed for speakers arranged on a sphere). This takes up waaaay too much memory and is still painfully slow. I end up getting memory errors with this approach.
Edit: I found a slightly simpler way to do it but it is still slow. I've updated the code above so that there are no longer nested loops.
I ran a line profiler and the slowest lines by far were the two containing np.apply_along_axis(). I'm afraid I might have to rethink how I do this completely.
Final Edit: I initially had a nested loop which I assumed to be the issue. I don't know what made me think I needed to use apply_along_axis with linalg.norm. In any case that was the issue.

I haven't looked for all the ways that you could optimize this code, but this issue jumped out at me: "I ran a line profiler and the slowest lines by far were the two containing np.apply_along_axis()." np.linalg.norm accepts an axis argument. You can replace the line
z_distance = np.apply_along_axis(np.linalg.norm, 0, projection)
with
z_distance = np.linalg.norm(projection, axis=0)
(and likewise for the other use of np.apply_along_axis and np.linalg.norm).
That should improve the performance a bit.

Numpy - Finding matches across multiple co-ordinates - Revisited

I'm using somoclu to produce an emergent Self-Organising Map of some data. Once I have the BMUs (Best Matching Units) I'm performing a Delaunay Triangulation on the co-ordinates of the BMUs in order to find each BMU's neighbours in the SOM.
Using the information kindly provided here, I have come up with the following line of Python, which seems awfully messy. Can it be shortened or otherwise tidied up?
points = np.unique(np.array(som.bmus), axis=0)
# Tidy up the line below?
bmu_idxs = np.argwhere((som.bmus[:,None] == points).all(axis=2))[:,1]
points and som.bmus are both two-column int32 numpy arrays where each row is a pair of coordinates. points will contain a sorted list list of unique points, and the object is to find the index in that list of each row of som.bmus. The output into bmu_idxs from the above is therefore a numpy array of the same length as som.bmus has rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.