Accessing element in a python numpy.matrix - python

I seem to be stuck with something seemlingy trivial: I need to access elements in a numpy.matrix. But the matrix doesn't behave as I expect:
>>> mymatrix
matrix([[0.02700243, 0. , 0. , ..., 0. , 0. ,
0. ]])
>>> type(mymatrix)
<class 'numpy.matrix'>
>>> mymatrix.shape
(1, 10000)
>>> mymatrix[0]
matrix([[0.02700243, 0. , 0. , ..., 0. , 0. ,
0. ]])
>>> mymatrix[0][0]
matrix([[0.02700243, 0. , 0. , ..., 0. , 0. ,
0. ]])
>>> mymatrix[0][0][0]
matrix([[0.02700243, 0. , 0. , ..., 0. , 0. ,
0. ]])
i.e. no matter whether I take the matrix itself, or the [0] element of the matrix or the [0][0] element of the [0][0][0], i always get the same object ... How is that possible?

According to NumPy Manual:
A matrix is a specialized 2-D array that retains its 2-D nature
through operations
And:
It is no longer recommended to use this class, even for linear
algebra. Instead use regular arrays. The class may be removed in the
future.
Maybe you could consider using a regular array instead. You can return your matrix as an array using:
mymatrix.A
mymatrix.A[0]
mymatrix.A[0][0]

You need to transpose your matrix to index the first element in a matrix with this shape.
Try:
mymatrix.T[0]

Related

Question on discrete convolution with python

I am struggling to understand why the np.convolve method returns an N+M-1 set. I would appreciate your help.
Suppose I have two discrete probability distributions with values of [1,2] and [10,12] and probabilities of [.5,0.2] and [.5,0.4] respectively.
Using numpy's convolve function I get:
>>In[]: np.convolve([.5,0.2],[.5,0.4])
>>Out[]: array([[0.25, 0.3 , 0.08])
However I don't understand why the resulting probability distribution only has 3 datapoints. To my understanding the sum of my input variables can have the following values: [11,12,13,14] so I would expect 4 datapoints to reflect the probabilities of each of these occurrences.
What am I missing?
I have managed to find the answer to my own question after understanding convolution a bit better. Posting it here for anyone wondering:
Effectively, the convolution of the two "signals" or probability functions in my example above is not correctly done as it is nowhere reflected that the events [1,2] of the first distribution and [10,12] of the second do not coincide.
Simply taking np.convolve([.5,0.2],[.5,0.4]) assumes the probabilities corresponding to the same events (e.g. [1,2] [1,2]).
Correct approach would be to bring the two series into alignment under a common X axis as in x \in [1,12] as below:
>>In[]: vector1 = [.5,0.2, 0,0,0,0,0,0,0,0,0,0]
>>In[]: vector2 = [0,0,0,0,0,0,0,0,0,.5, 0,0.4]
>>In[]: np.convolve(vector1, vector2)
>>Out[]: array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25, 0.1 ,
0.2 , 0.08, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. ])
which gives the correct values for 11,12,13,14

Why are 3D numpy arrays printed the way they are (How are they ordered)?

I am trying to wrap my head around 3D arrays (or multi-dimensional arrays in general), but it's blowing my brains a bit. Especially the way in which 3D numpy arrays are printed is counter-intuitive to me. This question is similar but it is more about the differences between programming languages, and I still do not fully get it. Let me try to explain.
Say I want to create a 3D array with 3 rows (length), 5 columns(width) and 2 depth. So a 3x5x2 matrix.
I do the following:
import numpy as np
a = np.zeros(30).reshape(3, 5, 2)
To me, a logical way to print this would be like this:
[[[0. 0. 0. 0. 0.] #We can still see three rows from top to bottom
[0. 0. 0. 0. 0.]] #We can still see five columns from left to right
[[0. 0. 0. 0. 0.] #Depth values are shown underneath each other
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]]
However, when I print this array it prints like this:
[[[0. 0.] #We can still see three rows from top to bottom,
[0. 0.] #However columns now also appear from top to bottom instead of from left to right
[0. 0.] #Depth values are now shown from left to right
[0. 0.]
[0. 0.]]
[[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]]
[[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]]]
It is unobvious to me why the array would be printed in this way. Maybe it is just me (Maybe my spatial reasoning is lacking here), or is there a specific reason why NumPy arrays are printed like this?
Synthesizing the comments into a proper answer:
First, take a look at np.zeros(10).reshape(5, 2). That's 5 rows of 2 columns, not 2 rows of 5 columns. Adding 3 at the front means 3 planes of 5 rows and 2 columns. What you're missing is that you new dimension is at the front, not the end. In mathematics, usually the extra dimensions are added at the end (Like extending an (x,y) with a z becomes (x,y,z). However, in computer science array dimensions are typically done this way. It reflects the way arrays are typically stored in row-major order in memory.

Problems with pySpark Columnsimilarities

tl;dr
How do I use pySpark to compare the similarity of rows?
I have a numpy array where I would like to compare the similarities of each row to one another
print (pdArray)
#[[ 0. 1. 0. ..., 0. 0. 0.]
# [ 0. 0. 3. ..., 0. 0. 0.]
# [ 0. 0. 0. ..., 0. 0. 7.]
# ...,
# [ 5. 0. 0. ..., 0. 1. 0.]
# [ 0. 6. 0. ..., 0. 0. 3.]
# [ 0. 0. 0. ..., 2. 0. 0.]]
Using scipy I can compute cosine similarities as follow...
pyspark.__version__
# '2.2.0'
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pdArray)
similarities.shape
# (475, 475)
print(similarities)
array([[ 1.00000000e+00, 1.52204908e-03, 8.71545594e-02, ...,
3.97681174e-04, 7.02593036e-04, 9.90472253e-04],
[ 1.52204908e-03, 1.00000000e+00, 3.96760121e-04, ...,
4.04724413e-03, 3.65324300e-03, 5.63519735e-04],
[ 8.71545594e-02, 3.96760121e-04, 1.00000000e+00, ...,
2.62367141e-04, 1.87878869e-03, 8.63876439e-06],
...,
[ 3.97681174e-04, 4.04724413e-03, 2.62367141e-04, ...,
1.00000000e+00, 8.05217639e-01, 2.69724702e-03],
[ 7.02593036e-04, 3.65324300e-03, 1.87878869e-03, ...,
8.05217639e-01, 1.00000000e+00, 3.00229809e-03],
[ 9.90472253e-04, 5.63519735e-04, 8.63876439e-06, ...,
2.69724702e-03, 3.00229809e-03, 1.00000000e+00]])
As I am looking to expand to much larger sets than my original (475 row) matrix I am looking at using Spark via pySpark
from pyspark.mllib.linalg.distributed import RowMatrix
#load data into spark
tempSpark = sc.parallelize(pdArray)
mat = RowMatrix(tempSpark)
# Calculate exact similarities
exact = mat.columnSimilarities()
exact.entries.first()
# MatrixEntry(128, 211, 0.004969676943490767)
# Now when I get the data out I do the following...
# Convert to a RowMatrix.
rowMat = approx.toRowMatrix()
t_3 = rowMat.rows.collect()
a_3 = np.array([(x.toArray()) for x in t_3])
a_3.shape
# (488, 749)
As you can see the shape of the data is a) no longer square (which it should be and b) has dimensions which do not match the original number of rows... now it does match (in part_ the number of features in each row (len(pdArray[0]) = 749) but I don't know where the 488 is coming from
The presence of 749 makes me think I need to transpose my data first. Is that correct?
Finally, if this is the case why are the dimensions not (749, 749) ?
First, the columnSimilarities method only returns the off diagonal entries of the upper triangular portion of the similarity matrix. With the absence of the 1's along the diagonal, you may have 0's for entire rows in the resulting similarity matrix.
Second, a pyspark RowMatrix doesn't have meaningful row indices. So essentially when converting from a CoordinateMatrix to a RowMatrix, the i value in the MatrixEntry is being mapped to whatever is convenient (probably some incrementing index). So what is likely happening is the rows that have all 0's are simply being ignored and the matrix is being squished vertically when you convert it to a RowMatrix.
It probably makes sense to inspect the dimension of the similarity matrix immediately after computation with the columnSimilarities method. You can do this by using the numRows() and the numCols() methods.
print(exact.numRows(),exact.numCols())
Other than that, it does sound like you need to transpose your matrix to get the correct vector similarities. Furthermore, if there is some reason that you need this in a RowMatrix-like form, you could try using an IndexedRowMatrix which does have meaningful row indices and would preserve the row index from the original CoordinateMatrix upon conversion.

Is there a sparse version of tf.multiply?

Does Tensorflow has a sparse element wise multiplication?
I.e. A sparse version of tf.multiply()
I only found tf.sparse_tensor_dense_matmul(), but it's not element wise multiplication.
The function you might be looking for is: __mul__
Additional details from official documentation:
The output locations corresponding to the implicitly zero elements in the sparse tensor will be zero (i.e., will not take up storage space), regardless of the contents of the dense tensor (even if it's +/-INF and that INF*0 == NaN).
Limitation: this Op only broadcasts the dense side to the sparse side, but not the other direction.
Example:
sp_mat = tf.SparseTensor([[0,0],[0,2],[1,2],[2,1]], np.ones(4), [3,3])
const1 = tf.constant([[1,2,3],[4,5,6],[7,8,9]], dtype=tf.float64)
const2 = tf.constant(np.array([1,2,3]),dtype=tf.float64)
elementwise_result = sp_mat.__mul__(const1)
broadcast_result = sp_mat.__mul__(const2)
print("Sparse Matrix:\n",tf.sparse_tensor_to_dense(sp_mat).eval())
print("\n\nElementwise:\n",tf.sparse_tensor_to_dense(elementwise_result).eval())
print("\n\nBroadcast:\n",tf.sparse_tensor_to_dense(broadcast_result).eval())
Output:
Sparse Matrix:
[[ 1. 0. 1.]
[ 0. 0. 1.]
[ 0. 1. 0.]]
Elementwise:
[[ 1. 0. 3.]
[ 0. 0. 6.]
[ 0. 8. 0.]]
Broadcast:
[[ 1. 0. 3.]
[ 0. 0. 3.]
[ 0. 2. 0.]]

How do I change column type in Python from int to object for sklearn?

I am really new to Python and scikit-learn (sklearn) and I am trying to load this dataset which consists of 7 columns of attributes and 1 column of the data classification (class/data target). But there's this one attribute which consists of data [1,2,3,4,5] which actually marks a stage of something, thus making it a nominal, not numeric. But of course python recognizes it as a numerical data (int64), when in fact I want it to be treated as a nominal data (object). How do I change the column type to nominal?
I have done the following.
print(data.dtypes)
data["col_name"]=data["col_name"].astype(numpy.object)
print(data.dtypes)
In the first print, it still recognizes my data["col_name"] as an int64, but after the astype line, it has changed it object. But it doesn't make any difference to the data, since when I try to use matplotlib and create a histogram, it still recognizes both the X and Y as numbers instead of object.
Also I have read about the One Hot Encoding and Label Encoding on the documentation, but I figured they are not what I need in my case. I wonder if I have misunderstood something or maybe there's another solution.
Thanks
Reading through the documents for sklearn. This package has thorough documentation. In particular the Preprocessing section on encoding categorical features:
In regards to keeping categorical features represented in an array of integers, ie [1,2,3,4,5], we have this:
Such integer representation can not be used directly with scikit-learn
estimators, as these expect continuous input, and would interpret the
categories as being ordered, which is often not desired (i.e. the set
of browsers was ordered arbitrarily). One possibility to convert
categorical features to features that can be used with scikit-learn
estimators is to use a one-of-K or one-hot encoding, which is
implemented in OneHotEncoder. This estimator transforms each
categorical feature with m possible values into m binary features,
with only one active.
So what you can to do is convert your array into 5 new columns (this case, since you have 5 possible values) using one-hot encoding.
Here is some working code. The input is a column of categorical parameters [1,2,3,4,5], the ouput is a matrix, 5 columns, 1 for each of the 5 possible choices:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[1],[2],[3],[4],[5]])
OneHotEncoder(categorical_features='all', dtype='numpy.float64', handle_unknown='error', n_values='auto', sparse=True)
print enc.transform([[1],[2],[3],[4],[5]]).toarray()
Output:
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
Say your categorical parameters were in this order: [1,3,2,5,4,3,2,1,3,4,2]. You would get this output:
[[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 1. 0. 0. 0.]]
So this 1 column will convert into 5 columns.
print(data.dtypes)
data["col_name"]=data["col_name"].astype(str)
print(data.dtypes)

Categories