Confused about X in GaussianHMM.fit([X])

Confused about X in GaussianHMM.fit([X]) - python

With this code:
X = numpy.array(range(0,5))
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
I get
tuple index out of range
self.n_features = obs[0].shape[1]
So what are you supposed to pass .fit() exactly? The hidden states AND emissions in a tuple? If so in what order? The documentation is less than helpful.
I noticed it likes being passed tuples as this does not give an error:
X = numpy.column_stack([range(0,5),range(0,5)])
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
Edit:
Let me clarify a bit, the documentation indicates that the ordinality of the array must be:
List of array-like observation sequences (shape (n_i, n_features)).
This would almost indicate that you pass a tuple for each sample that indicates in a binary fashion which observations are present. However their example indicates otherwise:
# pack diff and volume for training
X = np.column_stack([diff, volume])
hence the confusion

It would appear the GaussianHMM function is for multivariate-emission-only HMM problems, hence the requirement to have >1 emission vectors. When the documentation refers to 'n_features' they are not referring to the number of ways emissions can express themselves but the number of orthogonal emission vectors.
Hence, "features" (the orthogonal emission vectors) are not to be confused with "symbols" which, in sklearn's parlance (which is likely shared with the greater hmm community for all I know), refer to what actual unique values the system is capable of emitting.
For univariate emission-vector problems, use MultinomialHMM.
Hope that clarifies for anyone else who want to use this stuff without becoming the world's foremost authority on HMMs :)

I realize this is an old thread but the problem in the example code is still there. I believe the example is now at this link and still giving the same error:
tuple index out of range
self.n_features = obs[0].shape[1]
The offending line of code is:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit(X)
Which should be:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit([X])

Related

Getting the value at the intersection of a row and column in a pytorch tensor matrix

I am new to pytorch and am looking to get a value at an index from a matrix. There is a matrix called psfm_s that has been initialized with
psfm_s=Var(torch.randn(12,20),requires_grad=True) For example, I would like to to get the number in the first column (out of 12 columns) and the number in the first row (out of 20 rows).
I have tried doing something like index=torch.tensor([0,0])
num_at_index=psfm_s[index] to get the desired number but that just gets me a tensor with a bunch of numbers in it, I'm not really sure what happens with this method.
I just want the one number at the desired index, how can I go about doing this if it's even possible? Thanks for the help!

To reproduce the described code in its completeness (for future reference, please provide a [mcve] in your question), and taking the already correct solution from #jodag in the comments, consider this code snippet:
from torch.autograd import Variable
import torch
psfm_s = Variable(torch.randn(12,20), requires_grad=True)
single_value = psfm_s[0,0].item()
print(single_value) # prints a single random number from your tensor
For some background information, consider the official docs:
Returns the value of this tensor as a standard Python number. This
only works for tensors with one element. For other cases, see tolist().
This operation is not differentiable.
Consequently, getting a complete row (or column), would look like this:
from torch.autograd import Variable
import torch
psfm_s = Variable(torch.randn(12,20), requires_grad=True)
single_row_tensor = psfm_s[0,:]
single_row_list = single_row_tensor.tolist()
single_row_numpy_1 = single_row_tensor.data.numpy()
single_row_numpy_2 = single_row_tensor.detach().numpy()
# the following doesn't work, as it is a torch.Variable with gradient history:
single_row_fail = single_row_tensor.numpy()
In the case you want to get a NumPy array, you have to be careful not to directly cast it to .numpy(), as this causes issues with the gradient history of the Variable. You can either use .data.numpy(), or .detach().numpy().
There seems to be some discussion as to which one is preferred, but both should work for your case.

sci-kit learn output interpretation

When using sklearn, I sometimes have issues correctly assigning the output to the right label. When calling different methods on the result of a fit, sklearn only returns numpy arrays with no further labeling. For example, fitting a simple LDA that is trying to classify into two different groups will give me this output.
result = sklearn_lda.fit(X_train, y_train)
print "Prior probabilities are: \n", result.priors_
print "Group means are: \n", result.means_
Output
Prior probabilities are:
[0.49198397 0.50801603]
Group means are:
[[ 0.04279022 0.03389409]
[-0.03954635 -0.03132544]]
How do I know which prior probability is associated with which class label? Same with the group means. For coefficients I know that sklearn outputs them in the same order as they are put in. In this case I am a little confused.

Use result.classes_ to get the array of classes seen by the model.
All other attributes will be in the order of this array.
Most probably this will be alphabetically sorted. So if you have classes A and B, then the order will be:
['A', 'B']
Please see the documentation for available attributes.

Multiple issues with axes while implementing a Seq2Seq with attention in CNTK

I'm trying to implement a Seq2Seq model with attention in CNTK, something very similar to CNTK Tutorial 204. However, several small differences lead to various issues and error messages, which I don't understand. There are many questions here, which are probably interconnected and all stem from some single thing I don't understand.
Note (in case it's important). My input data comes from MinibatchSourceFromData, created from NumPy arrays that fit in RAM, I don't store it in a CTF.
ins = C.sequence.input_variable(input_dim, name="in", sequence_axis=inAxis)
y = C.sequence.input_variable(label_dim, name="y", sequence_axis=outAxis)
Thus, the shapes are [#, *](input_dim) and [#, *](label_dim).
Question 1: When I run the CNTK 204 Tutorial and dump its graph into a .dot file using cntk.logging.plot, I see that its input shapes are [#](-2,). How is this possible?
Where did the sequence axis (*) disappear?
How can a dimension be negative?
Question 2: In the same tutorial, we have attention_axis = -3. I don't understand this. In my model there are 2 dynamic axis and 1 static, so "third to last" axis would be #, the batch axis. But attention definitely shouldn't be computed over the batch axis.
I hoped that looking at the actual axes in the tutorial code would help me understand this, but the [#](-2,) issue above made this even more confusing.
Setting attention_axis to -2 gives the following error:
RuntimeError: Times: The left operand 'Placeholder('stab_result', [#, outAxis], [128])'
rank (1) must be >= #axes (2) being reduced over.
during creation of the training-time model:
def train_model(m):
#C.Function
def model(ins: InputSequence[Tensor[input_dim]],
labels: OutputSequence[Tensor[label_dim]]):
past_labels = Delay(initial_state=C.Constant(seq_start_encoding))(labels)
return m(ins, past_labels) #<<<<<<<<<<<<<< HERE
return model
where stab_result is a Stabilizer right before the final Dense layer in the decoder. I can see in the dot-file that there are spurious trailing dimensions of size 1 that appear in the middle of the AttentionModel implementation.
Setting attention_axis to -1 gives the following error:
RuntimeError: Binary elementwise operation ElementTimes: Left operand 'Output('Block346442_Output_0', [#, outAxis], [64])'
shape '[64]' is not compatible with right operand
'Output('attention_weights', [#, outAxis], [200])' shape '[200]'.
where 64 is my attention_dim and 200 is my attention_span. As I understand, the elementwise * inside the attention model definitely shouldn't be conflating these two together, therefore -1 is definitely not the right axis here.
Question 3: Is my understanding above correct? What should be the right axis and why is it causing one of the two exceptions above?
Thanks for the explanations!

First, some good news: A couple of things have been fixed in the AttentionModel in the latest master (will be generally available with CNTK 2.2 in a few days):
You don't need to specify an attention_span or an attention_axis. If you don't specify them and leave them at their default values, the attention is computed over the whole sequence. In fact these arguments have been deprecated.
If you do the above the 204 notebook runs 2x faster, so the 204 notebook does not use these arguments anymore
A bug has been fixed in the AttentionModel and it now faithfully implements the Bahdanau et. al. paper.
Regarding your questions:
The dimension is not negative. We use certain negative numbers in various places to mean certain things: -1 is a dimension that will be inferred once based on the first minibatch, -2 is I think the shape of a placeholder, and -3 is a dimension that will be inferred with each minibatch (such as when you feed variable sized images to convolutions). I think if you print the graph after the first minibatch, you should see all shapes are concrete.
attention_axis is an implementation detail that should have been hidden. Basically attention_axis=-3 will create a shape of (1, 1, 200), attention_axis=-4 will create a shape of (1, 1, 1, 200) and so on. In general anything more than -3 is not guaranteed to work and anything less than -3 just adds more 1s without any clear benefit. The good news of course is that you can just ignore this argument in the latest master.
TL;DR: If you are in master (or starting with CNTK 2.2 in a few days) replace AttentionModel(attention_dim, attention_span=200, attention_axis=-3) with
AttentionModel(attention_dim). It is faster and does not contain confusing arguments. Starting from CNTK 2.2 the original API is deprecated.

How to create two dimensional set objects under pyomo.environ module

I tried to create a LP model by using pyomo.environ. However, I'm having a hard time on creating sets. For my problem, I have to create two sets. One set is from a bunch of nodes, and the other one is from several arcs between nodes. I create a network by using Networkx to store my nodes and arcs.
The node data is saved like (Longitude, Latitude) in tuple form. The arcs are saved as (nodeA, nodeB), where nodeA and nodeB are both coordinates in tuple.
So, a node is something like:
(-97.97516252657978, 30.342243012086083)
And, an arc is something like:
((-97.97516252657978, 30.342243012086083),
(-97.976196300350608, 30.34247219922803))
The way I tried to create a set is as following:
# import pyomo.envrion as pe
# create a model m
m = pe.ConcreteModel()
# network is an object I created by Networkx module
m.node_set = pe.Set(initialize= self.network.nodes())
m.arc_set = pe.Set(initialize= self.network.edges())
However, I kept getting an error message on arc_set.
ValueError: The value=(-97.97516252657978, 30.342243012086083,
-97.976196300350608, 30.34247219922803) does not have dimension=2,
which is needed for set=arc_set
I found it's weird that somehow my arc_set turned into one tuple instead of two. Then I tried to convert my nodes and arcs into string but still got the error.
Could somebody show me some hint? Or how do delete this bug?
Thanks!

Underneath the hood, Pyomo "flattens" all indexing sets. That is, it removes nested tuples so that each set member is a single tuple of scalar values. This is generally consistent with other algebraic modeling languages, and helps to make sure that we can consistently (and correctly) retrieve component members regardless of how the user attempted to query them.
In your case, Pyomo will want each member of the the arc set as a single 4-member tuple. There is a utility in PyUtilib that you can use to flatten your tuples when constructing the set:
from pyutilib.misc import flatten
m.arc_set = pe.Set(initialize=(tuple(flatten(x)) for x in self.network.edges())
You can also perform some error checking, in this case to make sure that all edges start and end at known nodes:
from pyutilib.misc import flatten
m.node_set = pe.Set( initialize=self.network.nodes() )
m.arc_set = pe.Set(
within=m.node_set*m.node_set,
initialize=(tuple(flatten(x)) for x in self.network.edges() )
This is particularly important for models like this where you are using floating point numbers as indices, and subtle round-off errors can produce indices that are nearly the same but not mathematically equal.
There has been some discussion among the developers to support both structured and flattened indices, but we have not quite reached consensus on how to best support it in a backwards compatible manner.

What exactly does the "returned value" in langid.py mean?

beside the correct language ID langid.py returns a certain value - "The value returned is a score for the language. It is not a probability esimate, as it is not normalized by the document probability since this is unnecessary for classification."
But what does the value mean??

I'm actually the author of langid.py. Unfortunately, I've only just spotted this question now, almost a year after it was asked. I've tidied up the handling of the normalization since this question was asked, so all the README examples have been updated to show actual probabilities.
The value that you see there (and that you can still get by turning normalization off) is the un-normalized log-probability of the document. Because log/exp are monotonic, we don't actually need to compute the probability to decide the most likely class. The actual value of this log-prob is not actually of any use to the user. I should probably have never included it, and I may remove its output in the future.

I think this is the important chunk of langid.py code:
def nb_classify(fv):
# compute the log-factorial of each element of the vector
logfv = logfac(fv).astype(float)
# compute the probability of the document given each class
pdc = np.dot(fv,nb_ptc) - logfv.sum()
# compute the probability of the document in each class
pd = pdc + nb_pc
# select the most likely class
cl = np.argmax(pd)
# turn the pd into a probability distribution
pd /= pd.sum()
return cl, pd[cl]
It looks to me that the author is calculating something like the multinomial log-posterior of the data for each of the possible languages. logfv calculates the logarithm of the denominator of the PMF (x_1!...x_k!). np.dot(fv,nb_ptc) calculates the
logarithm of the p_1^x_1...p_k^x_k term. So, pdc looks like the list of language conditional log-likelihoods (except that it's missing the n! term). nb_pc looks like the prior probabilities, so pd would be the log-posteriors. The normalization line, pd /= pd.sum() confuses me, since one usually normalizes probability-like values (not log-probability values); also, the examples in the documentation (('en', -55.106250761034801)) don't look like they've been normalized---maybe they were generated before the normalization line was added?
Anyway, the short answer is that this value, pd[cl] is a confidence score. My understanding based on the current code is that they should be values between 0 and 1/97 (since there are 97 languages), with a smaller value indicating higher confidence.

Looks like a value that tells you how certain the engine is that it guessed the correct language for the document. I think generally the closer to 0 the number, the more sure it is, but you should be able to test that by mixing languages together and passing them in to see what values you get out. It allows you to fine tune your program when using langid depending upon what you consider 'close enough' to count as a match.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.