Extracting one-hot vector from text

Extracting one-hot vector from text - python

In pandas or numpy, I can do the following to get one-hot vectors:
>>> import numpy as np
>>> import pandas as pd
>>> x = [0,2,1,4,3]
>>> pd.get_dummies(x).values
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.]])
>>> np.eye(len(set(x)))[x]
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.]])
From text, with gensim, I can do:
>>> from gensim.corpora import Dictionary
>>> sent1 = 'this is a foo bar sentence .'.split()
>>> sent2 = 'this is another foo bar sentence .'.split()
>>> texts = [sent1, sent2]
>>> vocab = Dictionary(texts)
>>> [[vocab.token2id[word] for word in sent] for sent in texts]
[[3, 4, 0, 6, 1, 2, 5], [3, 4, 7, 6, 1, 2, 5]]
Then I'll have to do the same pd.get_dummies or np.eyes to get the one-hot vector but I get an error where there's one dimension missing from my one-hot vector I have 8 unique words but the one-hot vector lengths are only 7:
>>> [pd.get_dummies(sent).values for sent in texts_idx]
[array([[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0.]]), array([[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.]])]
It seems like it's doing one-hot vector individually as it iterates through each sentence, instead of using the global vocabulary.
Using np.eye, I do get the right vectors:
>>> [np.eye(len(vocab))[sent] for sent in texts_idx]
[array([[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]]), array([[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]])]
Also, currently, I have to do several things from using gensim.corpora.Dictionary to converting the words to their ids then getting the one-hot vector.
Are there other ways to achieve the same one-hot vector from texts?

There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.
import numpy as np
vocab_size = len(vocab)
text_length = len(text_idx)
one_hot = np.zeros(([vocab_size, text_length])
one_hot[text_idx, np.arange(text_length)] = 1

to create one_hot_vector, you need to create unique vocabulary from text
vocab=set(vocab)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(vocab)
one_hot_encoder = OneHotEncoder(sparse=False)
doc = "dog"
index=vocab.index(doc)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
one_hot_encoder=one_hot_encoder.fit_transform(integer_encoded)[index]

The 7th value is the "."(Dot) in your sentences separated by a " "(space) and split() counts it as a word !!

Related

Trying to compare different sized one-hot-encoded lists

I have run an autoencoder model, and returned a dictionary with each output and it's label, using FashionMNIST. My goal is to print 10 images only for the dress and coat class (class labels 3 and 4). I have one-hot-encoded the labels such that the dress class appears as [0.,0,.0,1.,0.,0.,0.,0.,0.]. My dictionary output is:
print(pa). #dictionary is called pa
{'output': array([[1.5346111e-04, 2.3307074e-04, 2.8705355e-04, ..., 1.9890528e-04,
1.8257453e-04, 2.0764180e-04],
[1.9767908e-03, 1.5839143e-03, 1.7811939e-03, ..., 1.7838757e-03,
1.4038634e-03, 2.3405524e-03],
[5.8998094e-06, 6.9388111e-06, 5.8752844e-06, ..., 5.1715115e-06,
4.4670110e-06, 1.2018012e-05],
...,
[2.1034568e-05, 3.0344427e-05, 7.0048365e-05, ..., 9.4724113e-05,
8.9003828e-05, 4.1828611e-05],
[2.7930623e-06, 3.0393956e-06, 4.5835086e-06, ..., 3.8765144e-04,
3.6324131e-05, 5.6411723e-06],
[1.2453397e-04, 1.1948447e-04, 2.0121646e-04, ..., 1.0773790e-03,
2.9582143e-04, 1.7229551e-04]], dtype=float32),
'label': array([[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)}
I am trying to run a for loop, where if the pa['label'] is equal to a certain one-hot-encoded array, I plot the corresponding pa['output'].
for i in range(len(pa['label'])):
if pa['label'][i] == np.array([0.,0.,0.,1.,0.,0.,0.,0.,0.]):
print(pa['lable'][i])
# plt.imshow(pa['output'][i].reshape(28,28))
# plt.show()
However, I get a warning(?):
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
I have also tried making a list of arrays of the one-hot-encoded arrays i want to plot and trying to compare my dictionary label to this array (different sized arrays):
clothing_array = np.array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])
for i in range(len(pa['label'])):
if (pa['label'][i] == clothing_array[i]).any():
plt.imshow(pa['output'][i].reshape(28,28))
plt.show()
However, it plots a picture of a tshirt, a bag, and then i get the error
IndexError: index 2 is out of bounds for axis 0 with size 2
Which i understand since clothing_array only has two indices. But obviously this code is wrong since I want to print ONLY dress and coat. I don't know why it's printing these images and i don't know how to fix it. Any help or clarifying questions are more than welcome.
Here are the first ten arrays of my dictionary labels:
array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

I will post an example here.
Here we have two arrays for you x is the label array and y the clothing . You can get in z the ones that are identical (the indexes). Finally by using the matching_indexes you can collect the onces you want from output and plot them
x = np.array([[1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.]])
y = np.array([[1.,0.,0.,0.,0.,0.,0.]])
z= np.multiply(x,y)
matching_indexes = np.where(z.any(axis=1))[0]

How does dim argument of "Tensor.scatter_" method in PyTorch work?

Could anyone teach me why the below code uses dim=1 in the scatter_ method? The meaning of the attached codes is for one-hot encoding. I tried to read the PyTorch document example and thought I should use dim=0 for the desired result. However, the result has shown that dim=1 is correct instead.
>>> target = torch.tensor([3, 5, 0, 2, 7, 5])
>>> target
tensor([3, 5, 0, 2, 7, 5])
>>> onehot = torch.zeros(target.shape[0], 8)
>>> onehot.scatter_(1, target.unsqueeze(1), 1.0)
tensor([[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 1., 0., 0.]])

You are applying scatter on a zero tensor onehot shaped (len(target), 8) on dim=1 using target as input and 1. as value. This will have the following effect on onehot:
onehot[i][target[i][j]] = 1.
This means for every row in target it will look at the unique value since j is always equal to 1 and use it to index the 2nd axis of onehot. In other words, for every row, it takes the value from target to position the 1. among the columns of onehot.
Step by step illustration would be:
>>> for i in range(len(target)):
... k = target[i] # k, depends on values of target i.e. dim=1
... onehot[i, k] = 1
... print(onehot)
tensor([[0., 0., 0., 1., 0., 0., 0., 0.], # i=0; k=3
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.], # i=1; k=5
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0.], # i=2; k=0
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0.], # i=3; k=2
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1.], # i=4; k=7
[0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 1., 0., 0.]]) # i=5; k=5
Notice that onehot.scatter_(0, target.unsqueeze(1), 1.0) would have produced:
onehot[target[i][j]][j] = 1.
Which is a valid operation only if you initialize onehot the other way around:
>>> onehot = torch.zeros(8, len(target))
>>> onehot.scatter_(0, target.unsqueeze(1), 1.)
tensor([[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.]])
And you get the transpose of the other matrix.

How can I plot hysteresis in matplotlib?

I am trying to plot a the development of a pitchfork bifurcation over time. The relationship between x and y starts off approximately linear, but ends up being a sigmoidal S shape. The final relationship is not a function; there are multiple y values for some values of x.
Matplotlib does nice wire frames for surface plots, but these surface plots don't seem to be able to handle non-functions.
Is there another way of plotting just the surface of this relationship? (If possible I don't want a solid shape.)
At the moment my data is in zero arrays where 1s indicate an approximation to the location of the surface.
I've included a very small sample data set, and sample code that will plot of their location. How do I 'join the dots'?
My actual data sets are larger (500x200x200) and varied, so I need to develop a flexible system.
This is what the final figure might look like:
From reading mplot3d documentation here it seems that I may need to convert my data to 2D arrays.
If this is the case please could you provide a method for this, and if possible please tell me what these arrays represent.
I greatly appreciate any comment/suggestions that will advance this.
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
sample_data = np.array([
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]
] )
XS, YS, ZS = [],[],[]
for g in xrange(np.shape(sample_data)[0]):
for row in xrange(np.shape(sample_data)[1]):
for col in xrange(np.shape(sample_data)[2]):
if sample_data[g][row][col] == 1:
XS.append(g)
YS.append(col)
ZS.append(row)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(XS, YS, ZS)
plt.show()

As suggested by mrcl, to do this in matplotlib you can use trisurf.
However, you have to provide your own triangles as Delaunay won't work on the 2d projection of your points.
To build the triangulation, I suggest to build a parametric representation of your surfece (in terms of s, t) and triangulate in the space (s, t).
It will give something like this
Exemple based on your code below (as your data is very coarse, I added a bit of interpolation):
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.tri as mtri
from matplotlib import cm
sample_data = np.array([
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]
] )
XS, YS, ZS = [],[],[]
for g in xrange(np.shape(sample_data)[0]):
for row in xrange(np.shape(sample_data)[1]):
for col in xrange(np.shape(sample_data)[2]):
if sample_data[g][row][col] == 1:
XS.append(g)
YS.append(col)
ZS.append(row)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(XS, YS, ZS)
XS = np.asarray(XS)
YS = np.asarray(YS)
ZS = np.asarray(ZS)
def re_ordinate(x, y):
ord = np.arange(np.shape(x)[0])
iter = True
itermax = 10
n_iter = 0
while iter and n_iter < itermax:
n_iter += 1
dist1 = (x[0:-2] - x[1:-1])**2 + (y[0:-2] - y[1:-1])**2
dist2 = (x[0:-2] - x[2:])**2 + (y[0:-2] - y[2:])**2
swap = np.argwhere(dist2 < dist1)
for s in swap:
s += 1
t = x[s]
x[s] = x[s+1]
x[s+1] = t
t = y[s]
y[s] = y[s+1]
y[s+1] = t
t = ord[s]
ord[s] = ord[s+1]
ord[s+1] = t
return ord / float(np.size(ord, 0))
# Building parametrisation of the surface
s = np.zeros(np.shape(XS)[0])
t = np.zeros(np.shape(XS)[0])
begin = 0
end = 0
for g in xrange(np.shape(sample_data)[0]):
cut = np.argwhere(XS==g).flatten()
begin = end
end += np.size(cut, 0)
X_loc = XS[cut]
Y_loc = YS[cut]
Z_loc = ZS[cut]
s[begin: end] = g / float(np.size(sample_data, 0))
t[begin: end] = re_ordinate(Y_loc, Z_loc)
#ax.plot(X_loc, Y_loc, Z_loc, color="grey")
triangles = mtri.Triangulation(s, t).triangles
refiner = mtri.UniformTriRefiner(mtri.Triangulation(s, t))
subdiv = 2
_, x_refi = refiner.refine_field(XS, subdiv=subdiv)
_, y_refi = refiner.refine_field(YS, subdiv=subdiv)
triang_param, z_refi = refiner.refine_field(ZS, subdiv=subdiv)
#triang_param = refiner.refine_triangulation()#mtri.Triangulation(XS, YS, triangles)
#print triang_param.triangles
triang = mtri.Triangulation(x_refi, y_refi, triang_param.triangles)
ax.plot_trisurf(triang, z_refi, cmap=cm.jet, lw=0.)
plt.show()

You can use
ax.plot_trisurf(XS, YS, ZS)
instead of
ax.scartter(XS, YS, ZS)
But as tcaswell has commented, mayavi will give you better performance.
Cheers

How to append a vector to a matrix in python

I want to append a vector to a matrix in python. I tried append or concatenate methods but I didn't get the answer. I was previously working with Matlab and there I used this:
m = zeros(10, 4) % define my matrix, 10x4
v = ones(10, 1) % my vecto, 10x1
c = [m,v] % so simple! the result is: 10x5 (the vector added as the last column)
How can I do that in python using numpy?

You're looking for np.r_ and np.c_. (Think "column stack" and "row stack" (which are also functions) but with matlab-style range generations.)
Also see np.concatenate, np.vstack, np.hstack, np.dstack, np.row_stack, np.column_stack etc.
For example:
import numpy as np
m = np.zeros((10, 4))
v = np.ones((10, 1))
c = np.c_[m, v]
Yields:
array([[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.]])
This is also equivalent to np.hstack([m, v]) or np.column_stack([m, v])
If you're not coming from matlab, hstack and column_stack probably seem much more readable and descriptive. (And they're arguably better in this case for that reason.)
However, np.c_ and np.r_ have additional functionality that folks coming from matlab tend to expect. For example:
In [7]: np.r_[1:5, 2]
Out[7]: array([1, 2, 3, 4, 2])
Or:
In [8]: np.c_[m, 0:10]
Out[8]:
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 2.],
[ 0., 0., 0., 0., 3.],
[ 0., 0., 0., 0., 4.],
[ 0., 0., 0., 0., 5.],
[ 0., 0., 0., 0., 6.],
[ 0., 0., 0., 0., 7.],
[ 0., 0., 0., 0., 8.],
[ 0., 0., 0., 0., 9.]])
At any rate, for matlab folks, it's handy to know about np.r_ and np.c_ in addition to vstack, hstack, etc.

In numpy it is similar:
>>> m=np.zeros((10,4))
>>> m
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>> v=np.ones((10,1))
>>> v
array([[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.]])
>>> np.c_[m,v]
array([[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 1.]])

Latent Semantic Analysis in Python discrepancy

I'm trying to follow the Wikipedia Article on latent semantic indexing in Python using the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.]])
u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False)
sigma = diag(s)
## remove extra dimensions...
numberOfDimensions = 4
for i in range(4, len(sigma) -1):
sigma[i][i] = 0
queryVector = array([[ 0.], # same as first column in documentTermMatrix
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 1.],
[ 0.],
[ 0.],
[ 1.]])
How the math says it should work:
dtMatrixToQueryAgainst = dot(u, dot(s,vt))
queryVector = dot(inv(s), dot(transpose(u), queryVector))
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0]
# gives 'matrices are not aligned' error. should be 1 because they're the same
What does work, with math that looks incorrect: ( from here)
dtMatrixToQueryAgainst = dot(s, vt)
queryVector = dot(transpose(u), queryVector)
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0])
# gives 1, which is correct
Why does route work, and the first not, when everything I can find about the math of LSA shows the first as correct? I feel like I'm missing something obvious...

There are several inconsistencies in your code that cause errors before your point of confusion. This makes it difficult to understand exactly what you tried and why you are confused (clearly you did not run the code as it is pasted, or it would have thrown an exception earlier).
That said, if I follow your intent correctly, your first approach is nearly correct. Consider the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.]])
numDimensions = 4
u, s, vt = linalg.svd(documentTermMatrix, full_matrices=False)
u = u[:, :numDimensions]
sigma = diag(s)[:numDimensions, :numDimensions]
vt = vt[:numDimensions, :]
lowRankDocumentTermMatrix = dot(u, dot(sigma, vt))
queryVector = documentTermMatrix[:, 0]
lowDimensionalQuery = dot(inv(sigma), dot(u.T, queryVector))
lowDimensionalQuery
vt[:,0]
You should see that lowDimensionalQuery and vt[:,0] are equal. Think of vt as a representation of the documents in a low-dimensional subspace. First we map our query into that subspace to get lowDimensionalQuery, and then we compare it with the corresponding column of vt. Your mistake was trying to compare the transformed query to the document vector from lowRankDocumentTermMatrix, which lives in the original space. Since the transformed query has fewer elements than the "reconstructed" document, Python complained.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting one-hot vector from text - python

The 7th value is the "."(Dot) in your sentences separated by a " "(space) and split() counts it as a word !!

Related

Trying to compare different sized one-hot-encoded lists

How does dim argument of "Tensor.scatter_" method in PyTorch work?

How can I plot hysteresis in matplotlib?

How to append a vector to a matrix in python

Latent Semantic Analysis in Python discrepancy

Categories

Resources