How to append a vector to a matrix in python - python

I want to append a vector to a matrix in python. I tried append or concatenate methods but I didn't get the answer. I was previously working with Matlab and there I used this:
m = zeros(10, 4) % define my matrix, 10x4
v = ones(10, 1) % my vecto, 10x1
c = [m,v] % so simple! the result is: 10x5 (the vector added as the last column)
How can I do that in python using numpy?

You're looking for np.r_ and np.c_. (Think "column stack" and "row stack" (which are also functions) but with matlab-style range generations.)
Also see np.concatenate, np.vstack, np.hstack, np.dstack, np.row_stack, np.column_stack etc.
For example:
import numpy as np
m = np.zeros((10, 4))
v = np.ones((10, 1))
c = np.c_[m, v]
Yields:
array([[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.]])
This is also equivalent to np.hstack([m, v]) or np.column_stack([m, v])
If you're not coming from matlab, hstack and column_stack probably seem much more readable and descriptive. (And they're arguably better in this case for that reason.)
However, np.c_ and np.r_ have additional functionality that folks coming from matlab tend to expect. For example:
In [7]: np.r_[1:5, 2]
Out[7]: array([1, 2, 3, 4, 2])
Or:
In [8]: np.c_[m, 0:10]
Out[8]:
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 2.],
[ 0., 0., 0., 0., 3.],
[ 0., 0., 0., 0., 4.],
[ 0., 0., 0., 0., 5.],
[ 0., 0., 0., 0., 6.],
[ 0., 0., 0., 0., 7.],
[ 0., 0., 0., 0., 8.],
[ 0., 0., 0., 0., 9.]])
At any rate, for matlab folks, it's handy to know about np.r_ and np.c_ in addition to vstack, hstack, etc.

In numpy it is similar:
>>> m=np.zeros((10,4))
>>> m
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>> v=np.ones((10,1))
>>> v
array([[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.]])
>>> np.c_[m,v]
array([[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.]])

Related

Indexing numpy matrix

So lets say I have a (4,10) array initialized to zeros, and I have an input array in the form [2,7,0,3]. The input array will modify the zeros matrix to look like this:
[[0,0,1,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,1,0,0],
[1,0,0,0,0,0,0,0,0,0],
[0,0,0,1,0,0,0,0,0,0]]
I know I can do that by looping through the input target and indexing the matrix array with something like matrix[i][target in input target], but I tried to do it without a loop doing something like:
matrix[:, input_target] = 1, but that sets me the entire matrix to all 1.
Apparently the way to do it is:
matrix[range(input_target.shape[0]), input_target], the question is why this works and not using the colon ?
Thanks!
You only wish to update one column for each row. Therefore, with advanced indexing you must explicitly provide those row identifiers:
A = np.zeros((4, 10))
A[np.arange(A.shape[0]), [2, 7, 0, 3]] = 1
Result:
array([[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]])
Using a colon for the row indexer will tell NumPy to update all rows for the specified columns:
A[:, [2, 7, 0, 3]] = 1
array([[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.]])

Extracting one-hot vector from text

In pandas or numpy, I can do the following to get one-hot vectors:
>>> import numpy as np
>>> import pandas as pd
>>> x = [0,2,1,4,3]
>>> pd.get_dummies(x).values
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.]])
>>> np.eye(len(set(x)))[x]
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.]])
From text, with gensim, I can do:
>>> from gensim.corpora import Dictionary
>>> sent1 = 'this is a foo bar sentence .'.split()
>>> sent2 = 'this is another foo bar sentence .'.split()
>>> texts = [sent1, sent2]
>>> vocab = Dictionary(texts)
>>> [[vocab.token2id[word] for word in sent] for sent in texts]
[[3, 4, 0, 6, 1, 2, 5], [3, 4, 7, 6, 1, 2, 5]]
Then I'll have to do the same pd.get_dummies or np.eyes to get the one-hot vector but I get an error where there's one dimension missing from my one-hot vector I have 8 unique words but the one-hot vector lengths are only 7:
>>> [pd.get_dummies(sent).values for sent in texts_idx]
[array([[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0.]]), array([[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.]])]
It seems like it's doing one-hot vector individually as it iterates through each sentence, instead of using the global vocabulary.
Using np.eye, I do get the right vectors:
>>> [np.eye(len(vocab))[sent] for sent in texts_idx]
[array([[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]]), array([[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]])]
Also, currently, I have to do several things from using gensim.corpora.Dictionary to converting the words to their ids then getting the one-hot vector.
Are there other ways to achieve the same one-hot vector from texts?
There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.
import numpy as np
vocab_size = len(vocab)
text_length = len(text_idx)
one_hot = np.zeros(([vocab_size, text_length])
one_hot[text_idx, np.arange(text_length)] = 1
to create one_hot_vector, you need to create unique vocabulary from text
vocab=set(vocab)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(vocab)
one_hot_encoder = OneHotEncoder(sparse=False)
doc = "dog"
index=vocab.index(doc)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
one_hot_encoder=one_hot_encoder.fit_transform(integer_encoded)[index]
The 7th value is the "."(Dot) in your sentences separated by a " "(space) and split() counts it as a word !!

How to set first column to a constant value of an empty np.zeros numPy matrix? [duplicate]

This question already has answers here:
Assigning to columns in NumPy
(2 answers)
Closed 7 years ago.
I'm working on setting some boundary conditions for a water table model, and I am able to set the entire first row to a constant value, but not the entire first column. I am using np.zeros((11, 1001)) to make an empty matrix. Does anyone know why I am successful at defining the first row, but not the first column? I have noted the line in question below.
import numpy as np
x = range(0, 110, 10)
time = range(0, 5005, 5)
xSize = len(x)
timeSize = len(time)
dx = 10
dt = 5
Sy = 0.1
k = 0.002
head = np.zeros((11, 1001))
head[0:][0] = 16 # sets the first row to 16
head[0][0:] = 16 # DOESN'T set the first column to 16
for t in time:
for i in x[1:len(x)-1]:
head[t+1][i] = head[t][i] + ((dt*k)/(2*Sy)) * (((head[t][i-1]**2) - (2*head[t][i]**2) + (head[t][i+1]**2)) / (dx**2))
All you have to do is to change
head[0][0:]
to
head[:, 0] = 16
If you want to change the first row you can just do:
head[0, :] = 16
EDIT:
Just in case you also wonder how you can change an arbitrary amount of values in an arbitrary row/column:
myArray = np.zeros((6, 6))
Now we set in row 2 3 values to 16:
myArray[2, 1:4] = 16.
array([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 16., 16., 16., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.]])
The same works for columns:
myArray[2:5, 4] = -4.
array([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 16., 16., 16., -4., 0.],
[ 0., 0., 0., 0., -4., 0.],
[ 0., 0., 0., 0., -4., 0.],
[ 0., 0., 0., 0., 0., 0.]])
And if you then also want to change certain values in e.g. two different rows at the same time you can do it like this:
myArray[[0, 5], 0:3] = 10.
array([[ 10., 10., 10., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 16., 16., 16., -4., 0.],
[ 0., 0., 0., 0., -4., 0.],
[ 0., 0., 0., 0., -4., 0.],
[ 10., 10., 10., 0., 0., 0.]])
You can use a similar syntax.
In [12]: head = np.zeros((11,101))
In [13]: head
Out[13]:
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
In [14]: head[:,0] = 42.0
In [15]: head
Out[15]:
array([[ 42., 0., 0., ..., 0., 0., 0.],
[ 42., 0., 0., ..., 0., 0., 0.],
[ 42., 0., 0., ..., 0., 0., 0.],
...,
[ 42., 0., 0., ..., 0., 0., 0.],
[ 42., 0., 0., ..., 0., 0., 0.],
[ 42., 0., 0., ..., 0., 0., 0.]])

Filling edges of a 2D list in python

I have been using the code
BOARD = [[0 for i in range(GRIDWIDTH)] for j in range(GRIDWIDTH)]
to create a 2 dimensional 'grid' for part of my pygame project.
I would like to know how I can fill in the edges of the grid with a different number, so I have all zeros in the centre of the grid, surrounded by ones on the edges.
Thanks!
BOARD = [[1] * GRIDWIDTH] + [
[1] + [0]*(GRIDWIDTH-2) + [1] for j in range(GRIDWIDTH-2)
] + [[1] * GRIDWIDTH]
I would suggest you to use numpy here
>>> import numpy as np
>>> grid = np.zeros(shape=(10,10))
>>> grid[[0,-1],:] , grid[:,[0,-1]] = 1, 1
>>> grid
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
>>>

Latent Semantic Analysis in Python discrepancy

I'm trying to follow the Wikipedia Article on latent semantic indexing in Python using the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.]])
u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False)
sigma = diag(s)
## remove extra dimensions...
numberOfDimensions = 4
for i in range(4, len(sigma) -1):
sigma[i][i] = 0
queryVector = array([[ 0.], # same as first column in documentTermMatrix
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 1.],
[ 0.],
[ 0.],
[ 1.]])
How the math says it should work:
dtMatrixToQueryAgainst = dot(u, dot(s,vt))
queryVector = dot(inv(s), dot(transpose(u), queryVector))
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0]
# gives 'matrices are not aligned' error. should be 1 because they're the same
What does work, with math that looks incorrect: ( from here)
dtMatrixToQueryAgainst = dot(s, vt)
queryVector = dot(transpose(u), queryVector)
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0])
# gives 1, which is correct
Why does route work, and the first not, when everything I can find about the math of LSA shows the first as correct? I feel like I'm missing something obvious...
There are several inconsistencies in your code that cause errors before your point of confusion. This makes it difficult to understand exactly what you tried and why you are confused (clearly you did not run the code as it is pasted, or it would have thrown an exception earlier).
That said, if I follow your intent correctly, your first approach is nearly correct. Consider the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.]])
numDimensions = 4
u, s, vt = linalg.svd(documentTermMatrix, full_matrices=False)
u = u[:, :numDimensions]
sigma = diag(s)[:numDimensions, :numDimensions]
vt = vt[:numDimensions, :]
lowRankDocumentTermMatrix = dot(u, dot(sigma, vt))
queryVector = documentTermMatrix[:, 0]
lowDimensionalQuery = dot(inv(sigma), dot(u.T, queryVector))
lowDimensionalQuery
vt[:,0]
You should see that lowDimensionalQuery and vt[:,0] are equal. Think of vt as a representation of the documents in a low-dimensional subspace. First we map our query into that subspace to get lowDimensionalQuery, and then we compare it with the corresponding column of vt. Your mistake was trying to compare the transformed query to the document vector from lowRankDocumentTermMatrix, which lives in the original space. Since the transformed query has fewer elements than the "reconstructed" document, Python complained.

Categories