I want to create a scipy csr_matrix to represent a sparse graph. For each edge, along with the weight, I want to define several "attributes." One of the attributes will be accessed directly after the weight. The others are less important and will be accessed infrequently. What is the best/ most efficient way to do this? Do scipy matrices accept non-numbers as entries?
Example:
Say I had a sparse matrix m represented by arrays (values are weights):
[[ 0.49850701 0.27 0. ]
[ 0.22479665 33. 3. ]
[ 0. 10. 0. ]]
I want to store additional data for each entry (x, y). For example, let's say I wanted to store "lifespan" and "name" attributes.
The lifespan attributes would be in a matrix like this:
[[ 8. 10. 0. ]
[ 0.9 4. 3.2 ]
[ 0. 4.3 0. ]]
The name attributes would be in a matrix like this:
[[ 'Bob' 'Kevin' 0. ]
[ 'Gary' 'Joe' 'Sally' ]
[ 0. 'Ralph' 0. ]]
What is the best way to do this? Should I have a separate matrix for each attribute, a matrix for the weight and lifespan and a dict keyed by coordinate tuple for all other attributes, a matrix with list or dict entries, or a matrix for weight and a dict for all the other ones?
Related
I'm needing to solve a problem where I have a plane in known 3D space (a court), and I know the coordinates of its corners (and other line intersections). I then have a 2D image of the court that I can digitise to find the pixel coordinates.
I understand that I can relate the two via homography (and have successfully done this in MATLAB), and calculate new 3D points based on an (x,y) location in the 2D image. The 3D z-coordinate is always 0.
My question is how can I perform this in Python?
court_coordinates
array([[-3.05 , -6.705, 0. ],
[ 3.05 , -6.705, 0. ],
[-3.05 , 0. , 0. ],
[ 3.05 , 0. , 0. ],
[-3.05 , 6.705, 0. ],
[ 3.05 , 6.705, 0. ]], dtype=float32)
pixel_coordinates
array([[ 257.4123305 , 694.90208136],
[1016.79422895, 694.90208136],
[ 338.1439609 , 505.68732261],
[ 936.06259855, 510.73304951],
[ 393.6469568 , 397.20419426],
[ 885.60532955, 402.24992116]])
New 2D point:
[(635.8418479974771, 689.8563544623148)]
This should give a 3D coordinate of (0,-6.705,0)
So far I have tried using cv2.findHomography, inputting my two sets of coordinates. Then I have used the inverse of this matrix, multiplied with the new pixel 2D coordinates, however this doesn't return a realistic 3D coordinate.
Thanks
From code
rotation = cv2.getRotationMatrix2D((0, 0), 47.65, 1.0)
I got a rotation transform matrix like:
[[ 0.67365771 0.7390435 0. ]
[-0.7390435 0.67365771 0. ]]
Since rotation is a special case of affine transform, I think this is a valid affine transform matrix, am I right?
Since affine transform is a special case of perspective transform, I also think this matrix will be a valid perspective transform matrix, if I make some modification based on it.
So I tried to add 1 more row to make it shape as 3 x 3.
newrow = numpy.array([numpy.array([1, 1, 1])]) # [[0 0 0]]
rotation3 = numpy.append(rotation, newrow, axis=0)
print(rotation3):
[[ 0.67365771 0.7390435 0. ]
[-0.7390435 0.67365771 0. ]
[ 1. 1. 1. ]]
But rotation3 does not seem to work properly as a perspective matrix, here is how I tested it:
rotated_points = cv2.perspectiveTransform(points, rotation3)
rotated_points does not look like a rotaion of points
Is [1, 1, 1] the correct row 3, should I also change row 1 and 2? and how can I do it?
Basically you are right, the affine transform is a special case of the perspective transform.
The perspective transform of a identity matrix results in no change to the output:
(identity 3x3 matrix)
[1,0,0]
[0,1,0]
[0,0,1]
So if you want a affine transformation matrix to grow to a perspective one you want to add the last line of this identity matrix.
Your example would look like:
[ 0.67365771 0.7390435 0. ]
[-0.7390435 0.67365771 0. ]
[ 0. 0. 1. ]
Applying the above perspective mat has the same effect as if you would apply a affine transform with:
[ 0.67365771 0.7390435 0. ]
[-0.7390435 0.67365771 0. ]
--> have a look at
affie transformation wikipedia
Opencv generate identity matrix
identity matrix wikipedia
tl;dr
How do I use pySpark to compare the similarity of rows?
I have a numpy array where I would like to compare the similarities of each row to one another
print (pdArray)
#[[ 0. 1. 0. ..., 0. 0. 0.]
# [ 0. 0. 3. ..., 0. 0. 0.]
# [ 0. 0. 0. ..., 0. 0. 7.]
# ...,
# [ 5. 0. 0. ..., 0. 1. 0.]
# [ 0. 6. 0. ..., 0. 0. 3.]
# [ 0. 0. 0. ..., 2. 0. 0.]]
Using scipy I can compute cosine similarities as follow...
pyspark.__version__
# '2.2.0'
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pdArray)
similarities.shape
# (475, 475)
print(similarities)
array([[ 1.00000000e+00, 1.52204908e-03, 8.71545594e-02, ...,
3.97681174e-04, 7.02593036e-04, 9.90472253e-04],
[ 1.52204908e-03, 1.00000000e+00, 3.96760121e-04, ...,
4.04724413e-03, 3.65324300e-03, 5.63519735e-04],
[ 8.71545594e-02, 3.96760121e-04, 1.00000000e+00, ...,
2.62367141e-04, 1.87878869e-03, 8.63876439e-06],
...,
[ 3.97681174e-04, 4.04724413e-03, 2.62367141e-04, ...,
1.00000000e+00, 8.05217639e-01, 2.69724702e-03],
[ 7.02593036e-04, 3.65324300e-03, 1.87878869e-03, ...,
8.05217639e-01, 1.00000000e+00, 3.00229809e-03],
[ 9.90472253e-04, 5.63519735e-04, 8.63876439e-06, ...,
2.69724702e-03, 3.00229809e-03, 1.00000000e+00]])
As I am looking to expand to much larger sets than my original (475 row) matrix I am looking at using Spark via pySpark
from pyspark.mllib.linalg.distributed import RowMatrix
#load data into spark
tempSpark = sc.parallelize(pdArray)
mat = RowMatrix(tempSpark)
# Calculate exact similarities
exact = mat.columnSimilarities()
exact.entries.first()
# MatrixEntry(128, 211, 0.004969676943490767)
# Now when I get the data out I do the following...
# Convert to a RowMatrix.
rowMat = approx.toRowMatrix()
t_3 = rowMat.rows.collect()
a_3 = np.array([(x.toArray()) for x in t_3])
a_3.shape
# (488, 749)
As you can see the shape of the data is a) no longer square (which it should be and b) has dimensions which do not match the original number of rows... now it does match (in part_ the number of features in each row (len(pdArray[0]) = 749) but I don't know where the 488 is coming from
The presence of 749 makes me think I need to transpose my data first. Is that correct?
Finally, if this is the case why are the dimensions not (749, 749) ?
First, the columnSimilarities method only returns the off diagonal entries of the upper triangular portion of the similarity matrix. With the absence of the 1's along the diagonal, you may have 0's for entire rows in the resulting similarity matrix.
Second, a pyspark RowMatrix doesn't have meaningful row indices. So essentially when converting from a CoordinateMatrix to a RowMatrix, the i value in the MatrixEntry is being mapped to whatever is convenient (probably some incrementing index). So what is likely happening is the rows that have all 0's are simply being ignored and the matrix is being squished vertically when you convert it to a RowMatrix.
It probably makes sense to inspect the dimension of the similarity matrix immediately after computation with the columnSimilarities method. You can do this by using the numRows() and the numCols() methods.
print(exact.numRows(),exact.numCols())
Other than that, it does sound like you need to transpose your matrix to get the correct vector similarities. Furthermore, if there is some reason that you need this in a RowMatrix-like form, you could try using an IndexedRowMatrix which does have meaningful row indices and would preserve the row index from the original CoordinateMatrix upon conversion.
So it is easier for me to think about vectors as column vectors when I need to do some linear algebra. Thus I prefer shapes like (n,1).
Is there significant memory usage difference between shapes (n,) and (n,1)?
What is preferred way?
And how to reshape (n,) vector into (n,1) vector. Somehow b.reshape((n,1)) doesn't do the trick.
a = np.random.random((10,1))
b = np.ones((10,))
b.reshape((10,1))
print(a)
print(b)
[[ 0.76336295]
[ 0.71643237]
[ 0.37312894]
[ 0.33668241]
[ 0.55551975]
[ 0.20055153]
[ 0.01636735]
[ 0.5724694 ]
[ 0.96887004]
[ 0.58609882]]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
More simpler way with python syntax sugar is to use
b.reshape(-1,1)
where the system will automatically compute the correct shape instead "-1"
ndarray.reshape() returns a new view, or a copy (depends on the new shape). It does not modify the array in place.
b.reshape((10, 1))
as such is effectively no-operation, since the created view/copy is not assigned to anything. The "fix" is simple:
b_new = b.reshape((10, 1))
The amount of memory used should not differ at all between the 2 shapes. Numpy arrays use the concept of strides and so the dimensions (10,) and (10, 1) can both use the same buffer; the amounts to jump to next row and column just change.
Let's say I do some calculation and I get a matrix of size 3 by 3 each time in a loop. Assume that each time, I want to save such matrix in a column of a bigger matrix, whose number of rows is equal to 9 (total number of elements in the smaller matrix). first I reshape the smaller matrix and then try to save it into one column of the big matrix. A simple code for only one column looks something like this:
import numpy as np
Big = np.zeros((9,3))
Small = np.random.rand(3,3)
Big[:,0]= np.reshape(Small,(9,1))
print Big
But python throws me the following error:
Big[:,0]= np.reshape(Small,(9,1))
ValueError: could not broadcast input array from shape (9,1) into shape (9)
I also tried to use flatten, but that didn't work either. Is there any way to create a shape(9) array from the small matrix or any other way to handle this error?
Your help is greatly appreciated!
try:
import numpy as np
Big = np.zeros((9,3))
Small = np.random.rand(3,3)
Big[:,0]= np.reshape(Small,(9,))
print Big
or:
import numpy as np
Big = np.zeros((9,3))
Small = np.random.rand(3,3)
Big[:,0]= Small.reshape((9,1))
print Big
or:
import numpy as np
Big = np.zeros((9,3))
Small = np.random.rand(3,3)
Big[:,[0]]= np.reshape(Small,(9,1))
print Big
Either case gets me:
[[ 0.81527817 0. 0. ]
[ 0.4018887 0. 0. ]
[ 0.55423212 0. 0. ]
[ 0.18543227 0. 0. ]
[ 0.3069444 0. 0. ]
[ 0.72315677 0. 0. ]
[ 0.81592963 0. 0. ]
[ 0.63026719 0. 0. ]
[ 0.22529578 0. 0. ]]
Explanation
the shape of Big you are trying to assign to is (9, ) one-dimensional. The shape you are trying to assign with is (9, 1) two-dimensional. You need to reconcile this by making the two-dim a one-dim np.reshape(Small, (9,1)) into np.reshape(Small, (9,)). Or, make the one-dim into a two-dim Big[:, 0] into Big[:, [0]]. The exception is when I assigned 'Big[:, 0] = Small.reshape((9,1))`. In this case, numpy must be checking.