Problem :
Im using glove pre-trained model with vectors to retrain my model with a specific domain say #cars, after training I want to find similar words within my domain but I got words not in my domain corpus, I believe it's from glove's vectors.
model_2.most_similar(positive=['spacious'], topn=10)
[('bedrooms', 0.6275501251220703),
('roomy', 0.6149100065231323),
('luxurious', 0.6105825901031494),
('rooms', 0.5935696363449097),
('furnished', 0.5897485613822937),
('cramped', 0.5892841219902039),
('courtyard', 0.5721820592880249),
('bathrooms', 0.5618442893028259),
('opulent', 0.5592212677001953),
('expansive', 0.555268406867981)]
Here I expect something like leg-room, car's spacious features mentioned in the domain's corpus. How can we exclude the glove vectors while having similar vectors?
Thanks
There may not be enough info in a simple set of generic word-vectors to filter neighbors by domain-of-use.
You could try using a mixed-weighting: combine the similarities to 'spacious', and to 'cars', and return the top results in that combination – and it might help a little.
Supplying more than one positive word to the most_similar() method might approximate this. If you're sure of some major sources of interference/overlap, you might even be able to use negative word examples, similar to how word2vec finds candidate answers for analogies (though this might also suppress useful results that are legitimately related to both domains, like 'roomy'). For example:
candidates = vec_model.most_similar(positive=['spacious', 'car'],
negative=['house'])
(Instead of using single words like 'car' or 'house' you could also try using vectors combined from many words that define a domain.)
But a sharp distinction sounds like a research project, rather than something easily possible with off-the-shelf libraries/vectors – and may requires more sophisticated approaches and datasets.
You could also try using a set of vectors trained only on a dataset of text from the domain of interest – thus ensuring the vocabulary, and senses, of words are all in that domain.
You cannot exclude the words from already trained model. I don't know in which framework you're working on but I'll give you the examples in Keras as it's simple to understand the intentions.
What you could do is use Embedding layer, populate it with GloVe "knowledge" and then resume training with your corpus so that layer will learn the words and fit them for your specific domain. You can read more about it in Keras blog
Related
I'm doing am LDA topic model on a medium sized corpus using gensim in python.
We already know roughly some of the topics we're expecting. In particular, we know that a particular topic definitely exists within the corpus and we want the model to find that topic for us so that we can extract the elements of the corpus that fall under that topic.
Is there a way of manually setting the initial conditions of one of your topics in gensim to give the model a shove in the 'right' direction?
The idea would be to take a handful of known examples of the target topics and set the probabilities of each words to their frequency within the known examples. Or something in the neighborhood of that idea.
Thanks in advance for your help!
As LDA is traditionally an unsupervised method, it's more common to let it tell you what topics it finds by its rules, then see which (if any) of those match your preconceptions.
Gensim has no way to pre-seed an LDA model/session with biases towards finding/defining certain topics.
You might use your conceptions of a topic that "should" exist, or certain documents that "should" be together, to tune your choice of other parameters to ensure final results best meet that goal, or to postprocess the LDA results with labeling/combinations to match your desired groupings.
But also, if one topic is of preeminent importance, or has your best set of labeled training examples, you may want to consider training a binary classifier to predict whether documents are in that topic, or not. Or, as your ideas of preferable topics, with labeled examples, grows, a multi-label classifier to assign documents to topics.
Classifiers are the more appropriate tool when you want a system to deduce known categories, though of course hybrid approaches can also be useful. For example, LDA runs may help suggest new categories, and the outputs of an LDA run could be added as features to assist downstream supervised classifiers. Or documents decorated with extra tokens from supervised classification could be analyzed by downstream LDA.
(In fact, simply decorating documents that are in a desired known category with an extra synthetic token representing that category might be a interesting way to bias an LDA toward reflecting those categories, but you'd want a rigorous evaluation process, for deciding whether such a hack was overall improving your true end goals or not.)
I'm working on word2vec model in order to analysis a corpus of newspaper.
I have a csv which contains some newspaper like tital, journal, and the content of the article.
I know how to train my model in order to get most similar words and their context.
However, I want to do a sentiment analysis on that. I found some ressources in order to do that but in all the test or train dataframe in the examples, there is already a column sentiment (0 or 1). Do you if it's possible to classify automaticaly texts by sentiment ? I mean put 0 or 1 to each text. I search but i don't find any references about that in the word2vec or doc2vec documentation...
Thanks for advance !
Both Word2Vec & Doc2Vec are just ways to turn words or lists-of-words into 'dense' vectors. Alone, they won't tell you sentiment.
When you have a text and want to deduce which categories it belongs to, that's called 'text classification'. Specifically, if you have just two categories (like 'positive-sentiment' vs 'negative-sentiment', or 'spam' vs 'not-spam'), that's called 'binary classification'.
The output of a Word2Vec or Doc2Vec model might be helpful in that task, but mainly as input to some other chosen 'classifier' algorithm. And, such algorithms require some 'labeled examples' of each kind of text - where you supply the right answer – in order to work. So, you will likely have to go through your corpus of newspaper articles & mark a bunch of them with the answer you want.
You should start by working through some examples that use scikit-learn, the most popular Python library with text-classification tools, even without any Word2Vec or Doc2Vec features, at first. For example, in its docs is an intro:
"Working With Text Data"
Only after you've set up some basic code using generic preprocess/feature-extraction/training/evaluation steps, and reviewed some actual results, should you then consider if adding some features based on Word2Vec or Doc2Vec might help.
I have training data as two columns
1.'Sentences'
2.'Relevant_text' (text in this column is a subset of text in the column 'Sentences')
I tried training a RNN with LSTM directly treating 'Sentences' as input and 'Relevant_text' and output but the results were disappointing.
I want to know how to approach this type of problem? Does this kind of problem have a name? Which models should I explore?
If the target text is the subset of the input text, then, I believe, this problem can be solved as a tagging problem: make your neural network for each word predict whether it is "relevant" or not.
On the one hand, the problem of taking a text and selecting its subset that best reflects its meaning is called extractive summarization, and has lots of solutions, from the well known unsupervised textRank algorithm to complex BERT-based neural models.
On the other hand, technically your problem is just binary token-wise classification: you label each token (word or other symbol) of your input text as "relevant" or not, and train any neural network architecture which is good for tagging on this data. Specifically, I would look into architectures for POS tagging, because they are very well studied. Typically, it is BiLSTM, maybe with a CRF head. More modern models are based on pretrained contextual word embeddings, such as BERT (maybe, you won't even need to fine tune them - just use it as a feature extractor, and add a BiLSTM on top). If you want a more lightweight model, you can consider a CNN over pretrained and fixed word embeddings.
One final parameter you should time playing with is the threshold for classifying the word as relevant - maybe, the default one, 0.5, is not the best choice. Maybe, instead of keeping all the tokens with probability-of-being-important higher than 0.5, you would like to keep the top k tokens, where k is fixed or is some percentage of the whole text.
Of course, more specific recommendations would be dataset-specific, so if you could share your dataset, it would be a great help.
I'm currently learning gensim doc2model in Python3.6 to see similarity between sentences.
I created a model but it returns KeyError: "word 'WORD' not in vocabulary" when I input a word which obviously exists in the training dataset, to find a similar word/sentence.
Does it automatically skip some words not very important to define sentences? or is that simply a bug or something?
Very appreciated if I could have any way out to cover all the appearing words in the dataset. thanks.
If a word you expected to be learned in the model isn't in the model, the most likely causes are:
it wasn't really there, in the version the model saw, perhaps because your tokenization/preprocessing is broken. Enable logging at INFO level, and examine your corpus as presented to the model, to ensure it's tokenized as intended
it wasn't part of the surviving vocabulary after the 1st vocabulary-survey of the corpus. The default min_count=5 discards words with fewer than 5 occurrences, as such words both fail to get good vectors for themselves, and effectively serve as 'noise' interfering with the improvement of other vectors.
You can set min_count=1 to retain all words, but it's more likely to hurt than help your overall vector quality. Word2Vec & Doc2Vec require large, varied corpuses – if you want a good vector for a word, find more diverse examples of its usage in an expanded corpus.
(Also note: one of the simple & fast Doc2Vec modes, that's also often a top-performer, especially on shorter texts, is plain PV-DBOW mode: dm=0. This mode will allocate/randomly-initialize word-vectors, but then ignores them for training, only training the doc-vectors. If you use that mode, you can still request word-vectors from the model at the end – but they'll just be random nonsense.)
I am using gensim's doc2vec implementation and I have a few thousand documents tagged with four labels.
yield TaggedDocument(text_tokens, [labels])
I'm training a Doc2Vec model with a list of these TaggedDocuments. However, I'm not sure how to infer the tag for a document that was not seen during training. I see that there is a infer_vector method which returns the embedding vector. But how can I get the most likely label from that?
An idea would be to infer the vectors for every label that I have and then calculate the cosine similarity between these vectors and the vector for the new document I want to classify. Is this the way to go? If so, how can I get the vectors for each of my four labels?
The infer_vector() method will train-up a doc-vector for a new text, which should be a list-of-tokens that were preprocessed just like the training texts).
And, as you've noted, model.docvecs['my_tag'] will get the pre-trained doc-vector for one of the tags that was known during training.
Checking the similarity of a new vector, against the vectors for all known-tags, is a reasonable baseline way to see what existing tags a new document is similar-to. The closest tag, or closest few tags, might be reasonable labels for an unknown document, as a sort of 'nearest-neighbor' approach.
But, note that the original/usual Doc2Vec approach is to give each document a unique ID, and let each ID-tag get its own vector. And then, perhaps, use those vectors with known-labels to train some other classifier that maps vectors to labels. (This might work better in some cases, if the "areas of the doc-vector space" that humans associate with a particular label aren't neat radiuses around a single centroid point for each label.)
Your approach of using, or adding, known-labels as doc-tags can often help. But also note that if you're only using 4 unique tags across thousands of documents, that's functionally very similar to just training the model with 4 giant documents – which may not be good at positioning those 4 vectors in a large-dimensional space (>4 dimensions), because there's not so much of the variety/subtle-contrasts that are needed to nudge the trained vectors into useful arrangements. (Typical published Doc2Vec work uses tens-of-thousands to millions of unique docs and doc-tags.)
I found the solution:
model.docvecs['my_tag']
gives me the vector for a given tag. Easy