I have a weighted graph data structure used in a machine learning algorithm, that requires frequent alterations (insertions, deletions of both vertices and edges). I am currently using an adjacency matrix implemented with a numpy 2d array with entries being
G[i, j] = W{i, j} if ij (is an edge) else 0
This works well for edges |V| < 1,500 but gets really slow with the search, insert and delete operations beyond that.
Since I am using a vectorized optimization of the graph embedding based on the weights, I need to use numpy arrays, so using lists is not feasible in this case.
Is there any efficient implementations of graphs that I can use for the storage, and operations on Graphs written in Python that can used ?
As mentioned in the question, it is very hard to beat the performance of an adjacency list when the graph is sparse. Adjacency matrices will always waste a lot of space for sparse graphs so you will probably have to find an alternative from using numpy arrays in all operations.
Some of the possible solutions to your problem may be:
Use an adjacency list structures for the other operations and convert to 2d numpy arrays when necessary (may not be efficient)
Use a sparse matrix: try to use a sparse matrix so you still can do matrix operations without converting back and forth. You may read more about them in this blog post. Note that you will have to replace some of the numpy operations to their scipy.sparse equivalents in your code if you opt for this solution.
Try using NetworkX library which is one of the best out there to handle Graph data structures.
Related
I have to generate a matrix (propagator in physics) by ordered multiplication of many other matrices. Each matrix is about the size of (30,30), all real entries (floats), but not symmetric. The number of matrices to multiply varies between 1e3 to 1e5. Each matrix is only slightly different from previous, however they are not commutative (and at the end I need the product of all these non-commutative multiplication). Each matrix is for certain time slice, so I know how to generate each of them independently, wherever they are in the multiplication sequence. At the end, I have to produce many such matrix propagators, so any performance enhancement is welcomed.
What is the algorithm for fastest implementation of such matrix multiplication in python?
In particular -
How to structure it? Are there fast axes and so on? preferable dimensions for rows / columns of the matrix?
Assuming memory is not a problem, to allocate and build all matrices before multiplication, or to generate each per time step? To store each matrix in dedicated variable before multiplication, or to generate when needed and directly multiply?
Cumulative effects of function call overheads effects when generating matrices?
As I know how to build each, should it be parallelized? For example maybe to create batch sequences from start of the sequence and from the end, multiply them in parallel and at the end multiply the results in proper order?
Is it preferable to use other module than numpy? Numba can be useful? or some other efficient way to compile in place to C, or use of optimized external libraries? (please give reference if so, I don't have experience in that)
Thanks in advance.
I don't think that the matrix multiplication would take much time. So, I would do it in a single loop. The assembling is probably the costly part here.
If you have bigger matrices, a map-reduce approach could be helpful. (split the set of matrices, apply matrix multiplication to each set and do the same for the resulting matrices)
Numpy is perfectly fine for problems like this as it is pretty optimized. (and is partly in C)
Just test how much time the matrix multiplication takes and how much the assembling. The result should indicate where you need to optimize.
I'm working on a growing matrix of data. I found that probably the best way to make my computations faster, I need to clusterize it somewhat in a way of this: Clusterized matrix
My matrix shows connections between nodes on a graph with their weights on the intersections.
I made a graph using NetworkX and noticed it does something similar. Screenshot: NX's Graph
Maybe I could use NetworkX's code to cluster it instead of growing my code by another function?
If not, then any python way of doing it would be helpful. I read many tutorial on hierarchical clustering but it all seems to be about connecting points in a two-dimensional space, not in the graph-space with given 'distances'.
I have been experiencing with MPI in Python through mpi4py. And it's working amazingly well to parallelize some code I developed. However I rely heavily on Numpy for matrix manipulation. I have a question regarding the usage of Numpy with MPI.
Let's take for instance de dot function in Numpy. Let's say I have 2 huge matrices A and B and I want to compute their matrix product A*B:
numpy.dot(A, B)
I would like to know how to spread this function call over the whole cluster. I can chunk B (columnwise) into smaller matrices and spread the matrix product on the cluster nodes and therefore regroup the result. However it seems like a bad workaround. Is there a better solution ?
I want to create a 2D matrix in python when number of rows and columns are equal and it is around 231000. Most of the cell entries would be zero.
Some [i][j] entries would be non-zero.
The reason for creating this matrix is to apply SVD and get [U S V] matrices with rank of say 30.
Can anyone provide me with the idea how to implement this by applying proper libraries. I tried pandas Dataframe but it shows Memory error.
I have also seen scipy.sparse matrix but couldn't figure out how it would be applied to find SVD.
I think this is a duplicate question, but I'll answer this anyways.
There are several libraries in python aimed at dealing with partial svds on very sparse matrices.
My personal preference is scipy.sparse.linalg.svds, a ARPACK implementation of iterative partial SVD calculation.
You can also try the function sparsesvd.sparsesvd, which uses the SVDLIBC implementation, or scipy.sparse.linalg.svd, which uses the LAPACK implementation.
To convert your table to a format that these algorithms use, you will need to import scipy.sparse, which allows you to use the csc_matrix class
Use the above links to help you out. There are a lot of resources already here on stack overflow and many more on the internet.
I'm trying to decomposing signals in components (matrix factorization) in a large sparse matrix in Python using the sklearn library.
I made use of scipy's scipy.sparse.csc_matrix to construct my matrix of data. However I'm unable to perform any analysis such as factor analysis or independent component analysis. The only thing I'm able to do is use truncatedSVD or scipy's scipy.sparse.linalg.svds and perform PCA.
Does anyone know any work-arounds to doing ICA or FA on a sparse matrix in python? Any help would be much appreciated! Thanks.
Given:
M = UΣV^t
The drawback with SVD is that the matrix U and V^t are dense matrices. It doesn't really matter that the input matrix is sparse, U and T will be dense. Also the computational complexity of SVD is O(n^2*m) or O(m^2*n) where n is the number of rows and m the number of columns in the input matrix M. It depends on which one is biggest.
It is worth mentioning that SVD will give you the optimal solution and if you can live with a smaller loss, calculated by the frobenius norm, you might want to consider using the CUR algorithm. It will scale to larger datasets with O(n*m).
U = CUR^t
Where C and R are now SPARSE matrices.
If you want to look at a python implementation, take a look at pymf. But be a bit careful about that exact implementations since it seems, at this point in time, there is an open issue with the implementation.
Even the input matrix is sparse the output will not be a sparse matrix. If the system does not support a dense matrix neither the results will not be supported
It is usually a best practice to use coo_matrix to establish the matrix and then convert it using .tocsc() to manipulate it.