I am trying to coarse grain a large network to a smaller network by predefined node labels. say:
large_network = np.random.rand(100,100)
labels = [1,1,1,1,
5,5,5,5,5,5,5,5,
0,0,0,0,0, ...] #[1x100]
for example, we have 10 regions each having a few nodes.
something like membership list (in the network community detection algorithms in networkx), that tells each node belongs to which community, but here I am defining it manually. Then I need to calculate new reduced adjacency matrix say [10x10].
So the average weights of edges between the regions A and B that w_{AB} = mean(edges(A, B)) determine the weight of the edge between these two regions.
One way is to loop over edges of each node and if two endpoints of the edge were in the membership list of two regions, add it to the weighted sum.
Am I doing right?
Is there any better strightforward method?
You could coo_matrix in scipy.sparse to do the job for you. The nice thing is that this approach can readily by extended to sparse network representations.
import numpy as np
from scipy.sparse import coo_matrix
# set parameters
N = 100 # no of nodes
M = 10 # no of types
# initialise random network and random node labels
weights = np.random.rand(N, N) # a.k.a "large_network"
labels = np.random.randint(0, M, size=N)
# get sum of weights by connection type
indices = np.tile(labels, (N,1)) # create N x N matrix of labels
nominator = coo_matrix((weights.ravel(), (indices.ravel(), indices.transpose().ravel())), shape=(M,M)).todense()
# count number of weights by connection type
adjacency = (weights > 0.).astype(np.int)
denominator = coo_matrix((adjacency.ravel(), (indices.ravel(), indices.transpose().ravel())), shape=(M,M)).todense()
# normalise sum of weights by counts
small_network = nominator / denominator
Related
Let's say I have some DataArray:
da = xr.DataArray(
data=np.random.random((25,25)),
dims=["x", "y"],
coords=dict(
x=np.arange(25),
y=np.arange(25),
),
)
I want to downsample this array to 5x5 chunks. I can do this with the coarsen function:
da_coarse = da.coarsen(x=5,y=5).mean()
As I understand it, this is basically splitting the DataArray into 5x5 "chunks" and averaging each chunk into one value. However, what I want to do is take a weighted average of this 5x5 group, so the center points are weighted more heavily in the final mean than the edge points.
I can create a gaussian kernel with weights like this:
def gkern(kernlen=21, sig=3):
import scipy.stats as st
"""Returns a 2D Gaussian kernel."""
x = np.linspace(-(kernlen/2)/sig, (kernlen/2)/sig, kernlen+1)
kern1d = np.diff(st.norm.cdf(x))
kern2d = np.outer(kern1d, kern1d)
return kern2d/kern2d.sum()
window = gkern(5)
Where window is now a 5x5 array with the desired weights for each point. However, I am unsure how to implement this window/kernel when doing the averaging in the coarsen function.
What is the best way to do this?
One way to do this is through DataArrayCoarsen.construct, which allows you to more easily operate on individual windows at a time:
weights = xr.DataArray(gkern(5), dims=["x_window", "y_window"])
windowed_da = da.coarsen(x=5, y=5).construct(
x=("x_coarse", "x_window"),
y=("y_coarse", "y_window")
)
coarsened = (weights * windowed_da).sum(["x_window", "y_window"]) / weights.sum()
windowed_da is the original DataArray, but reshaped into individual windows of the size specified in the coarsen step.
I'm working on an assignment where I am tasked to implement PCA in Python for an online course. Unfortunately, when I try to run a comparison (provided by the course) between my implementation and SKLearn's, my results appear to differ too greatly.
After many hours of review, I am still unsure where it is going wrong. If someone could take a look and determine what step I have coded or interpreted incorrectly, I would greatly appreciate it.
def normalize(X):
"""
Normalize the given dataset X to have zero mean.
Args:
X: ndarray, dataset of shape (N,D)
Returns:
(Xbar, mean): tuple of ndarray, Xbar is the normalized dataset
with mean 0; mean is the sample mean of the dataset.
Note:
You will encounter dimensions where the standard deviation is zero.
For those ones, the process of normalization results in normalized data with NaN entries.
We can handle this by setting the std = 1 for those dimensions when doing normalization.
"""
# YOUR CODE HERE
### Uncomment and modify the code below
mu = np.mean(X, axis = 0) # Setting axis = 0 will compute means column-wise. Setting it to 1 will compute the mean across rows.
std = np.std(X, axis = 0) # Computing the std dev column wise using axis = 0.
std_filled = std.copy()
std_filled[std == 0] = 1
# Compute the normalized data as Xbar
Xbar = (X - mu)/std_filled
return Xbar, mu, # std_filled
def eig(S):
"""
Compute the eigenvalues and corresponding unit eigenvectors for the covariance matrix S.
Args:
S: ndarray, covariance matrix
Returns:
(eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors
Note:
the eigenvals and eigenvecs should be sorted in descending
order of the eigen values
"""
# YOUR CODE HERE
# Uncomment and modify the code below
# Compute the eigenvalues and eigenvectors
# You can use library routines in `np.linalg.*` https://numpy.org/doc/stable/reference/routines.linalg.html for this
eigvals, eigvecs = np.linalg.eig(S)
# The eigenvalues and eigenvectors need to be sorted in descending order according to the eigenvalues
# We will use `np.argsort` (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) to find a permutation of the indices
# of eigvals that will sort eigvals in ascending order and then find the descending order via [::-1], which reverse the indices
sort_indices = np.argsort(eigvals)[::-1]
# Notice that we are sorting the columns (not rows) of eigvecs since the columns represent the eigenvectors.
return eigvals[sort_indices], eigvecs[:, sort_indices]
def projection_matrix(B):
"""Compute the projection matrix onto the space spanned by the columns of `B`
Args:
B: ndarray of dimension (D, M), the basis for the subspace
Returns:
P: the projection matrix
"""
# YOUR CODE HERE
P = B # (np.linalg.inv(B.T # B)) # B.T
return P
def select_components(eig_vals, eig_vecs, num_components):
"""
Selects the n components desired for projecting the data upon.
Args:
eig_vals: The eigenvalues sorted in descending order of magnitude.
eig_vecs: The eigenvectors sorted in order relative to that of the eigenvalues.
num_components: the number of principal components to use.
Returns:
The number of desired components to keep for projection of the data upon.
"""
principal_vals, principal_components = eig_vals[:num_components], eig_vecs[:, range(num_components)]
return principal_vals, principal_components
def PCA(X, num_components):
"""
Projects normalized data onto the 'n' desired principal components.
Args:
X: ndarray of size (N, D), where D is the dimension of the data,
and N is the number of datapoints
num_components: the number of principal components to use.
Returns:
the reconstructed data, the sample mean of the X, principal values
and principal components
"""
# Normalize to have mean 0 and variance 1.
Z, mean_vec = normalize(X)
# Calculate the covariance matrix
S = np.cov(Z, rowvar=False, bias=True) # Set rowvar = False to treat columns as variables. Set bias = True to ensure normalization is done with N and not N-1
# Calculate the (unit) eigenvectors and eigenvalues of S. Sort them in descending order of importance relative to the magnitude of the eigenvalues.
eig_vals, eig_vecs = eig(S)
# Keep only the n largest Principle Components of the sorted unit eigenvectors.
principal_vals, principal_components = select_components(eig_vals, eig_vecs, num_components)
# Compute the projection matrix using only the n largest Principle Components of the sorted unit eigenvectors, where n = num_components.
#P = projection_matrix(eig_vecs[:, :num_components])
P = projection_matrix(principal_components)
# Reconstruct the data by using the projection matrix to project the data onto the principal component vectors we've kept
X_reconst = (P # X.T).T
return X_reconst, mean_vec, principal_vals, principal_components
And here is the test case I'm supposed to pass:
random = np.random.RandomState(0)
X = random.randn(10, 5)
from sklearn.decomposition import PCA as SKPCA
for num_component in range(1, 4):
# We can compute a standard solution given by scikit-learn's implementation of PCA
pca = SKPCA(n_components=num_component, svd_solver="full")
sklearn_reconst = pca.inverse_transform(pca.fit_transform(X))
reconst, _, _, _ = PCA(X, num_component)
# The difference in the result should be very small (<10^-20)
print(
"difference in reconstruction for num_components = {}: {}".format(
num_component, np.square(reconst - sklearn_reconst).sum()
)
)
np.testing.assert_allclose(reconst, sklearn_reconst)
As far as I can tell, there are a few things wrong with your code.
Your projection matrix is wrong.
If the eigenvectors of your covariance matrix is B with dimension D x M where M is the number of components you select and D is the dimension of the original data, then the projection matrix is just B # B.T.
In standard implementation of PCA, we typically do not scale the data by the inverse of the standard deviation. You seem to be trying to do an approximation of a whitened PCA (ZCA), but even then it looks wrong.
As a quick test, you can compute the normalized data without dividing by the standard deviation, and when you compute the covariance matrix, set bias=False.
You should also subtract the mean from the data before multiplying it by the projection operator, and adding it back after that, i.e.,
X_reconst = (P # (X - mean_vec).T).T + mean_vec.
PCA essentially is just a change of basis, followed by discarding coordinates corresponding to directions with low variance. The eigenvectors of the covariance matrix corresponds to the new orthogonal basis, and the eigenvalues tells you the variance of the data along the direction of the corresponding eigenvectors. P = B # B.T is just the change of basis followed to the new basis (and discarding some coordinates), B, followed by a change back to the original basis.
Edit
I'm curious to know which online course teaches people to implement PCA this way.
I am performing a binary classification of a partially labeled dataset. I have a reliable estimate of its 1's, but not of its 0's.
From sklearn KMeans documentation:
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
I would like to pass an ndarray, but I only have 1 reliable centroid, not 2.
Is there a way to maximize the entropy between the K-1st centroids and the Kth? Alternatively, is there a way to manually initialize K-1 centroids and use K++ for the remaining?
=======================================================
Related questions:
This seeks to define K centroids with n-1 features. (I want to define k-1 centroids with n features).
Here is a description of what I want, but it was interpreted as a bug by one of the developers, and is "easily implement[able]"
I'm reasonably confident this works as intended, but please correct me if you spot an error. (cobbled together from geeks for geeks):
import sys
def distance(p1, p2):
return np.sum((p1 - p2)**2)
def find_remaining_centroid(data, known_centroids, k = 1):
'''
initialized the centroids for K-means++
inputs:
data - Numpy array containing the feature space
known_centroid - Numpy array containing the location of one or multiple known centroids
k - remaining centroids to be found
'''
n_points = data.shape[0]
# Initialize centroids list
if known_centroids.ndim > 1:
centroids = [cent for cent in known_centroids]
else:
centroids = [np.array(known_centroids)]
# Perform casting if necessary
if isinstance(data, pd.DataFrame):
data = np.array(data)
# Add a randomly selected data point to the list
centroids.append(data[np.random.randint(
n_points), :])
# Compute remaining k-1 centroids
for c_id in range(k - 1):
## initialize a list to store distances of data
## points from nearest centroid
dist = np.empty(n_points)
for i in range(n_points):
point = data[i, :]
d = sys.maxsize
## compute distance of 'point' from each of the previously
## selected centroid and store the minimum distance
for j in range(len(centroids)):
temp_dist = distance(point, centroids[j])
d = min(d, temp_dist)
dist[i] = d
## select data point with maximum distance as our next centroid
next_centroid = data[np.argmax(dist), :]
centroids.append(next_centroid)
# Reinitialize distance array for next centroid
dist = np.empty(n_points)
return centroids[-k:]
Its usage:
# For finding a third centroid:
third_centroid = find_remaining_centroid(X_train, np.array([presence_seed, absence_seed]), k = 1)
# For finding the second centroid:
second_centroid = find_remaining_centroid(X_train, presence_seed, k = 1)
Where presence_seed and absence_seed are known centroid locations.
I want to compute the distance from a set of N 3D-points to a set of M 3D-centers and store the results in a NxM matrix (where column i is the distance from all points to center i)
Example:
data = np.random.rand(100,3) # 100 toy 3D points
centers = np.random.rand(20,3) # 20 toy 3D points
For computing the distance between all points and a single center we can use "broadcasting" so we avoid looping though all points:
i = 0 # first center
np.sqrt(np.sum(np.power(data - centers[i,:], 2),1)) # Euclidean distance
Now we can put this code in a loop that iterates over all centers:
distances = np.zeros(data.shape[0], centers.shape[0])
for i in range(centers.shape[0]):
distances[:,i] = np.sqrt(np.sum(np.power(data - centers[i,:], 2),1))
However this is clearly an operation that could be parallelized and improved.
I'm wondering if there is a better way of doing this (maybe some multi-dimensional broadcasting or some library).
This is a very common problem for clustering and classification, where you want to get distances from your data to a set of classes, so I think it should be some efficient implementation to to this.
What's the best way of doing this?
Broadcast all the way:
import numpy as np
data = np.random.rand(100,3)
centers = np.random.rand(20,3)
distances = np.sqrt(np.sum(np.power(data[:,None,:] - centers[None,:,:], 2), axis=-1))
print distances.shape
# 100, 20
If you just want the nearest center, and you have a lot of data points (a lot being more than a several 100 000 samples), you probably should store your data in a KD tree and query that with the centers (scipy.spatial.KDTree).
I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
A possible solution is a function, which returns a codebook with the centroids like kmeans in scipy.cluster.vq does. Only thing you need is the partition as vector with flat clusters part and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
You can do something like this (D=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.