I'm working on graph analysis. I want to compute an N by N similarity matrix that contains the Adamic Adar similarity between every two vertices. To give an overview of Adamic Adar let me start with this introduction:
Given the adjacency matrix A of an undirected graph G. CN is the set of all common neighbors of two vertices x, y. A common neighbor of two vertices is one where both vertices have an edge/link to, i.e. both vertices will have a 1 for the corresponding common neighbor node in A. k_n is the degree of node n.
Adamic-Adar is defined as the following:
My attempt to compute it is to fetch both rows of the x and y nodes from A and then sum them. Then look for the elements that has 2 as the value and then gets their degrees and apply the equation. However computing that takes really really a long of time. I tried with a graph that contains 1032 vertices and it took a lot of time to compute. It started with 7 minutes and then I cancelled the computations. So my question: is there a better algorithm to compute it?
Here's my code in python:
def aa(graph):
"""
Calculates the Adamic-Adar index.
"""
N = graph.num_vertices()
A = gts.adjacency(graph)
S = np.zeros((N,N))
degrees = get_degrees_dic(graph)
for i in xrange(N):
A_i = A[i]
for j in xrange(N):
if j != i:
A_j = A[j]
intersection = A_i + A_j
common_ns_degs = list()
for index in xrange(N):
if intersection[index] == 2:
cn_deg = degrees[index]
common_ns_degs.append(1.0/np.log10(cn_deg))
S[i,j] = np.sum(common_ns_degs)
return S
Since you're using numpy, you can really cut down on your need to iterate for every operation in the algorithm. my numpy- and vectorized-fu aren't the greatest, but the below runs in around 2.5s on a graph with ~13,000 nodes:
def adar_adamic(adj_mat):
"""Computes Adar-Adamic similarity matrix for an adjacency matrix"""
Adar_Adamic = np.zeros(adj_mat.shape)
for i in adj_mat:
AdjList = i.nonzero()[0] #column indices with nonzero values
k_deg = len(AdjList)
d = np.log(1.0/k_deg) # row i's AA score
#add i's score to the neighbor's entry
for i in xrange(len(AdjList)):
for j in xrange(len(AdjList)):
if AdjList[i] != AdjList[j]:
cell = (AdjList[i],AdjList[j])
Adar_Adamic[cell] = Adar_Adamic[cell] + d
return Adar_Adamic
unlike MBo's answer, this does build the full, symmetric matrix, but the inefficiency (for me) was tolerable, given the execution time.
I believe you are using rather slow approach. It would better to revert it -
- initialize AA (Adamic-Adar) matrix by zeros
- for every node k get it's degree k_deg
- calc d = log(1.0/k_deg) (why log10 - is it important or not?)
- add d to all AAij, where i,j - all pairs of 1s in kth row
of adjacency matrix
Edit:
- for sparse graphs it is useful to extract positions of all 1s in kth row to the list to reach O(V*(V+E)) complexity instead of O(V^3)
AA = np.zeros((N,N))
for k = 0 to N - 1 do
AdjList = []
for j = 0 to N - 1 do
if A[k, j] = 1 then
AdjList.Add(j)
k_deg = AdjList.Length
d = log(1/k_deg)
for j = 0 to AdjList.Length - 2 do
for i = j+1 to AdjList.Length - 1 do
AA[AdjList[i],AdjList[j]] = AA[AdjList[i],AdjList[j]] + d
//half of matrix filled, it is symmetric for undirected graph
I don't see a way of reducing the time complexity, but it can be vectorized:
degrees = A.sum(axis=0)
weights = np.log10(1.0/degrees)
adamic_adar = (A*weights).dot(A.T)
With A a regular Numpy array. It seems you're using graph_tool.spectral.adjacency and thus A would be a sparse matrix. In that case the code would be:
from scipy.sparse import csr_matrix
degrees = A.sum(axis=0)
weights = csr_matrix(np.log10(1.0/degrees))
adamic_adar = A.multiply(weights) * A.T
This is much faster than using Python loops. A small warning though: with this approach you really need to make sure that the values on the main diagonal (of A and adamic_adar) are what you expect them to be. Also, A must not contain weights, but only zeros and ones.
I believe there most be a function like the one defined in R igraph in its python_igraph as well for the node similarity (Adamic_Adar as well)
Related
I am looping through a large number of H x W matrices. I cannot store them all in memory. I need to get N matrices. For example, the element of the 1st of N matrix in position (i, j) will be the largest among all elements in position (i, j) of all processed matrix matrices. For the second of the N matrix, the elements that are the second-largest will be taken, and so on.
Example.
Let N = 2. Then the 1st matrix will look like this.
And the second matrix is like this.
How to do such an operation inside a loop so as not to store all matrices in memory?
The comments suggested using the np.partition function. I replaced the use of numpy with cupy, which uses the GPU. And also added a buffer to sort less frequently.
import cupy as np
buf = // # As much as fits into the GPU
largests = np.zeros((buf + N, h, w))
for i in range(num):
val = //
largests[i % buf] = val
if i % buf == buf - 1:
largests.partition(range(buf, buf + N), axis=0)
largests.partition(range(buf, buf + N), axis=0) # Let's not forget the tail
res = largests[:-(N + 1):-1]
The solution does not work very quickly, but I have come to terms with this speed.
You are given a grid ; having n rows ; and mcolumns ; where two cells are called to be adjacent if : they have a common side.
let two adjacent cells be a and b . Since they are adjacent ; hence ; you may go both from a to b ; and also from b to a.
In terms of graph theory ; we may say that if the grid is modelled as a directed graph ; then, there exists a directed edge from a to b ; and also a from b to ; in case cells a and b are adjacent.
You are asked to find the number of directed edges in the graph.
Input Format
The first line of the input contains a single integer T ; denoting the number of test cases.
Then ;t lines follow ; where each line contains two space seperated integers n and m ; the dimensions of the grid respectively.
Sample Input 0
1
3 3
Sample Output 0
24
Explanation 0
Number of the directed edges is 24.
Is this approach correct ? My code did passes the sample test cases but fails for others
def compute(m,n):
arr = [[0 for x in range(n)] for y in range(m)]
arr[0][0]=2
arr[n-1][m-1]=2
arr[0][m-1]=2
arr[n-1][0]=2
for i in range (1,m-1):
arr[i][0]=3
arr[i][n-1]=3
for j in range (1,n-1):
arr[0][j]=3
arr[n-1][j]=3
for i in range (1,n-2):
for j in range (1,m-2):
arr[i][j]=4
return sum(sum(arr,[])) +4
Please explain the correct approach for this problem.Thanks in advance
For a grid of n rows and m columns: the number of sides in any row is m-1, the number of sides in any column is n-1. Every side has two edges in the graph of adjacent cells.
Therefore the number of edges for an n*m grid is:
def compute(n, m):
return n * (m - 1) * 2 + m * (n - 1) * 2
Or, simplified even further:
def compute(n, m):
return 4 * n * m - 2 * n - 2 * m
Your algorithm goes and fills in the individual edges to sum them at the end, which is far more complicated than it needs to be for this problem without additional constraints.
I think you can solve this with dynamic programming.
number of edge in m*n = number of edge in (m-1)*(n) + to_be_calculated
Then you can simply find the amount of to_be_calculated by 2*n + 2*(n-1)
After you finished with columns and reached to m == 1 then you can reduce n to 1.
def compute(n,m,sum):
if n == 1 and m == 1:
return sum
elif m == 1:
return compute(1, 1, sum + 2*(n-1))
else:
return compute(n, m-1, sum + 2*n + 2*(n-1) )
compute(5,5,0) # For table 5x5
You can find a formula to compute the number of edges in the graph as follows:
Suppose we have a grid with dimensions n and m. For each cell we need to count the number of neighbor cells. Then, the summation of such numbers is the number of edges.
Case 1) such a grid has 4 corner cells each with 2 neighbors; total neighbors case 1: 4*2=8
Case 2: such a grid has 2(n+m-2)-4 cells in its sides exclude the corners each with 3 neighbors, total neighbors case 2: (2(n+m-2)-4)3
Case 3) such a grid has nm-(2(n+m-2)-4)-4 inner cells each with 4 neighbors, total neighbors case 3: *(nm-(2(n+m-2)-4)-4)4
Total number of edges = Case 1 + Case 2 + Case 3 = 8 + (2(n+m-2)-4)3 + (nm-(2(n+m-2)-4)-4)4 = 4nm - 2(n+m)
The figure below displays all the cases:
So you can use the code below to compute the number of edges:
def compute_total_edges(m,n):
return 4*n*m-2*(n+m)
print(compute_total_edges(3,3))
I'm creating N_MC paths of simulated stock prices S with n points in each path, excluding the initial point. The algorithm to do so is recursive on the previous value of the stock price, for a given path. Here's what I have now:
import numpy as np
import time
N_MC = 1000
n = 10000
S = np.zeros((N_MC, n+1))
S0 = 1.0
S[:, 0] = S0
start_time_normals = time.clock()
Z = np.exp(np.random.normal(size=(N_MC, n)))
print "generate normals time = ", time.clock() - start_time_normals
start_time_prices = time.clock()
for i in xrange(N_MC):
for j in xrange(1, n+1):
S[i, j] = S[i, j-1]*Z[i, j-1]
print "pices time = ", time.clock() - start_time_prices
The times were:
generate normals time = 1.07
pices time = 9.98
Is there a much more efficient way to generate the arrays S, perhaps using Numpy's routines? It would be nice if the normal random variables Z could be generated more quickly, too, but I'm not as hopeful.
It's not necessary to loop over 'paths', because they're independent of each other. So, you can remove the outer loop for i in xrange(N_MC) and just operate on entire columns of S and Z.
For accelerating the recursive computation, let's just consider a single 'path'. Say z is vector containing the random values at each timestep (all known ahead of time). s is a vector that should contain the output at each timestep. s0 is the initial output at time zero. j is time.
Your code defines the ouput recursively:
s[j] = s[j-1]*z[j-1]
Let's expand this:
s[1] = s[0]*z[0]
s[2] = s[1]*z[1]
= s[0]*z[0]*z[1]
s[3] = s[2]*z[2]
= s[0]*z[0]*z[1]*z[2]
s[4] = s[3]*z[3]
= s[0]*z[0]*z[1]*z[2]*z[3]
Each output s[j] is given by s[0] times the product of the random values from 0 to j-1. You can calculate cumulative products like this using numpy.cumprod(), which should be much more efficient than looping:
s = np.concatenate(([s0], s0 * np.cumprod(z[0:-1])))
You can use the axis parameter for operating along one dimension of a matrix (e.g. for doing this in parallel across 'paths').
Consider points Y given in increasing order from [0,T). We are to consider these points as lying on a circle of circumference T. Now consider points X also from [0,T) and also lying on a circle of circumference T.
We say the distance between X and Y is the sum of the absolute distance between the each point in X and its closest point in Y recalling that both are considered to be lying in a circle. Write this distance as Delta(X, Y).
I am trying to find a quick way of approximating the distributions of distance between the circles over all possible rotations of X. I am currently does this by Monte Carlo simulation. First here is my code to make some fake data.
import random
import numpy as np
from bisect import bisect_left
def simul(rate, T):
time = np.random.exponential(rate)
times = [0]
newtime = times[-1]+time
while (newtime < T):
times.append(newtime)
newtime = newtime+np.random.exponential(rate)
return times[1:]
Now the code the find the distance between two circles.
def takeClosest(myList, myNumber, T):
"""
Assumes myList is sorted. Returns closest value to myNumber in a circle of circumference T.
If two numbers are equally close, return the smallest number.
"""
pos = bisect_left(myList, myNumber)
if (pos == 0 and myList[pos] != myNumber):
before = myList[pos - 1] - T
after = myList[0]
elif (pos == len(myList)):
before = myList[pos-1]
after = myList[0] + T
else:
before = myList[pos - 1]
after = myList[pos]
if after - myNumber < myNumber - before:
return after
else:
return before
def circle_dist(timesY, timesX):
dist = 0
for t in timesX:
closest_number = takeClosest(timesY, t, T)
dist += np.abs(closest_number - t)
return dist
Now the main code to make the data and to try 1000 different random rotations.
T = 50000
timesX = simul(1, T)
timesY = simul(10, T)
dists=[]
iters = 100
for i in xrange(iters):
offset = np.random.randint(0,T)
timesX = [(t+offset) % T for t in timesX]
dists.append(circle_dist(timesY, timesX))
We can now print out any statistics we like of the distances. I am particularly interested in the variance.
print "Variance is ", np.var(dists)
Unfortunately I need to do this a lot and it takes around 16 seconds currently. I find this a little surprising it is so slow. Any suggestions for how to speed it up gratefully received.
Edit 1. Reduced the number of iterations to 100 (the previous value didn't correspond to my timings correctly). This now takes around 16 seconds on my computer.
Edit 2. Fixed bug in takeClosest
EDIT: I've just noticed that performance optimization is a little premature, because the expression closest_number - t is not a valid implementation of any definition of a distance on a "circle" - that is only a distance on an open-ended line
sample test case (pseudocode):
T = 10
X = [1, 2]
Y = [9]
dist(X, Y) = dist(1, 9) + dist(2, 9)
dist_on_line = 8 + 7 = 15
dist_on_circle = 2 + 3 = 5
Note that definition of the circle [0,10) implies that dist(0, 10) is not defined, but in the limit it approaches 0: lim(dist(0, t), t->10) = 0
A correct implementation of a distance on a circle would be:
dist_of_t = min(t - closest_number_before_t,
closes_number_after_t - t,
T - t + closes_number_before_t,
T - closest_number_after_t + t)
Original answer:
you could rotate and iterate over timesY instead of timesX since that array is an order of magnitude smaller - doing bisect_left of timeX is negligible (O(logn)) compared to iterating over all the elements (O(n))
but IMHO, the real slowdown if because of Python dynamic typing (every of the ~50000 items in timesX has to be checked for type compatibility each time you try to compare it to some other value) => converting timesX and timesY to numpy arrays should help, if that is not enought CPU acceleration (cython, numba, ...) is the think you need
The function circle_dist can be replaced by a one-liner. So you can plug it into your outer for i loop:
sum(abs(takeClosest(timesY, t) - t) for t in timesX)
Furthermore, you should always - if possible - allocate arrays like dists in one step and avoid appending elements many thousand times.
But, unfortunately, both improvements only save a few percent of computing time.
Edit 1: Replacing np.abs(...) with abs(...) decreases computing time by 50 % on my machine (on a reduced data set)!
Edit 2: Updated the one-liner according to Aprillion's comment.
Suppose we want to compute C=A*B for given sparse matrices A,B but are interested in a very small subset of entries of C, represented by a list of index pairs:
rows=[i1, i2, i3 ... ]
cols=[j1, j2, j3 ... ]
Both A and B are quite large (say 50Kx50K), but very sparse (<1% of entries is non-zero).
How can we compute this subset of the multiplication?
Here's a naive implementation that works really slow:
def naive(A, B, rows, cols):
N = len(rows)
vals = []
for n in xrange(N):
v = A.getrow(rows[n]) * B.getcol(cols[n])
vals.append(v[0, 0])
R = sps.coo_matrix((np.array(vals), (np.array(rows), np.array(cols))), shape=(A.shape[0], B.shape[1]), dtype=np.float64)
return R
even for small matrices this is quite bad:
import scipy.sparse as sps
import numpy as np
D = 1000
A = np.random.randn(D, D)
A[np.abs(A) > 0.1] = 0
A = sps.csr_matrix(A)
B = np.random.randn(D, D)
B[np.abs(B) > 0.1] = 0
B = sps.csr_matrix(B)
X = np.random.randn(D, D)
X[np.abs(X) > 0.1] = 0
X[X != 0] = 1
X = sps.csr_matrix(X)
rows, cols = X.nonzero()
naive(A, B, rows, cols)
On my machine, naive() finishes after 1 minute, and most of the effort is spent on structuring the rows/cols (in getrow(), getcol()).
Of course, converting this (very small) example to dense matrices, the computation takes about 100ms:
A0 = np.array(A.todense())
B0 = np.array(B.todense())
X0 = np.array(X.todense())
A0.dot(B0) * X0
Any thoughts on how to efficiently compute such matrix multiplication?
Note: This question is almost identical to the following question:
Subset of a matrix multiplication, fast, and sparse
However, there, A and B are full matrices, and, one of the dimensions is very low (say, 10)
the proposed solutions seem to benefit from both.
The format of your sparse matrices is important here. You always need a row form A and a column from B. So, store A as a csr and B as csc to get rid of the getrow/getcol overhead. Unfortunately, this is only a small part of the story.
The best solution depends a lot on the structure of your sparse matrix (a lot of sparse columns/rows, etc), but you might try one based on dictionaries and sets. For matrix A for each row the following are kept:
a set with all non-zero column indices on that row
a dictionary with the non-zero indices as keys and the corresponding non-zero values as values
For matrix B similar dicts and sets are kept for each column.
To calculate element (M, N) in the multiplication result, row M of A is multiplied with column N of B. The multiplication:
find the set intersection of the non-zero sets
calculate the sum of multiplications of the non-zero elements (i.e. the intersection above)
In most cases this should be very fast, as in a sparse matrix the set intersection is usually very small.
Some code:
class rowarray():
def __init__(self, arr):
self.rows = []
for row in arr:
nonzeros = np.nonzero(row)[0]
nzvalues = { i: row[i] for i in nonzeros }
self.rows.append((set(nonzeros), nzvalues))
def __getitem__(self, key):
return self.rows[key]
def __len__(self):
return len(self.rows)
class colarray(rowarray):
def __init__(self, arr):
rowarray.__init__(self, arr.T)
def maybe_less_naive(A, B, rows, cols):
N = len(rows)
vals = []
for n in xrange(N):
nz1,v1 = A[rows[n]]
nz2,v2 = B[cols[n]]
# list of common non-zeros
nz = nz1.intersection(nz2)
# sum of non-zeros
vals.append(sum([ v1[i]*v2[i] for i in nz]))
R = sps.coo_matrix((np.array(vals), (np.array(rows), np.array(cols))), shape=(len(A), len(B)), dtype=np.float64)
return R
D = 1000
Ap = np.random.randn(D, D)
Ap[np.abs(Ap) > 0.1] = 0
A = rowarray(Ap)
Bp = np.random.randn(D, D)
Bp[np.abs(Bp) > 0.1] = 0
B = colarray(Bp)
X = np.random.randn(D, D)
X[np.abs(X) > 0.1] = 0
X[X != 0] = 1
X = sps.csr_matrix(X)
rows, cols = X.nonzero()
maybe_less_naive(A, B, rows, cols)
This is a bit more efficient, the multiplication takes approximately 2 seconds for the test (80 000 elements). The results seem to be essentially the same.
A few comments on the performance.
There are two operations performed for each output element:
set intersection
multiplication
The complexity of set intersection should be O(min(m,n)) where m and n are the numbers of non-zeros in each operand. This is invariant of the size of the matrix, only the average number of non-zeros per row/column is important.
The number of multiplications (and dict lookups) depends on the number of non-zeros found in the intersection above.
If both matrices have randomly distributed non-zeros with probability (density) p, and the row/column length is n, then:
set intersection: O(np)
dictionary lookup, multiplication: O(np^2)
This shows that with really sparse matrices finding the intersections is the critical point. This can also be verified by profiling; most of the time is spent calculating the intersections.
When this is reflected to the real world, we seem to spend around 20 us for a row/column of 80 non-zeros. This is not blindingly fast, and the code can certainly be made faster. Cython may be one solution, but this may be one of the problems where Python is not the best possible solution. A simple linear matching (merge sort -type algorithm) for sorted integers should be at least an order of magnitude faster when written in C.
One important thing to note is that the algorithm can be done in parallel for several elements at a time. There is no need to settle for a single thread, as the calculations are independent as far as one thread handles one output point.