Gensim topic modelling with suggested initial inputs? - python

I'm doing am LDA topic model on a medium sized corpus using gensim in python.
We already know roughly some of the topics we're expecting. In particular, we know that a particular topic definitely exists within the corpus and we want the model to find that topic for us so that we can extract the elements of the corpus that fall under that topic.
Is there a way of manually setting the initial conditions of one of your topics in gensim to give the model a shove in the 'right' direction?
The idea would be to take a handful of known examples of the target topics and set the probabilities of each words to their frequency within the known examples. Or something in the neighborhood of that idea.
Thanks in advance for your help!

As LDA is traditionally an unsupervised method, it's more common to let it tell you what topics it finds by its rules, then see which (if any) of those match your preconceptions.
Gensim has no way to pre-seed an LDA model/session with biases towards finding/defining certain topics.
You might use your conceptions of a topic that "should" exist, or certain documents that "should" be together, to tune your choice of other parameters to ensure final results best meet that goal, or to postprocess the LDA results with labeling/combinations to match your desired groupings.
But also, if one topic is of preeminent importance, or has your best set of labeled training examples, you may want to consider training a binary classifier to predict whether documents are in that topic, or not. Or, as your ideas of preferable topics, with labeled examples, grows, a multi-label classifier to assign documents to topics.
Classifiers are the more appropriate tool when you want a system to deduce known categories, though of course hybrid approaches can also be useful. For example, LDA runs may help suggest new categories, and the outputs of an LDA run could be added as features to assist downstream supervised classifiers. Or documents decorated with extra tokens from supervised classification could be analyzed by downstream LDA.
(In fact, simply decorating documents that are in a desired known category with an extra synthetic token representing that category might be a interesting way to bias an LDA toward reflecting those categories, but you'd want a rigorous evaluation process, for deciding whether such a hack was overall improving your true end goals or not.)

Related

NLP data preparation and sorting for text-classification task

I read a lot of tutorials on the web and topics on stackoverflow but one question is still foggy for me. If consider just the stage of collecting data for multi-label training, what way (see below) are better and whether are both of them acceptable and effective?
Try to find 'pure' one-labeled examples at any cost.
Every example can be multi labeled.
For instance, I have articles about war, politics, economics, culture. Usually, politics tied to economics, war connected to politics, economics issues may appear in culture articles etc. I can assign strictly one main theme for each example and drop uncertain works or assign 2, 3 topics.
I'm going to train data using Spacy, volume of data will be about 5-10 thousand examples per topic.
I'd be grateful for any explanation and/or a link to some relevant discussion.
You can try OneVsAll / OneVsRest strategy. This will allow you to do both: predict exact one category without the need to strictly assign one label.
Also known as one-vs-all, this strategy consists in fitting one
classifier per class. For each classifier, the class is fitted against
all the other classes. In addition to its computational efficiency
(only n_classes classifiers are needed), one advantage of this
approach is its interpretability. Since each class is represented by
one and one classifier only, it is possible to gain knowledge about
the class by inspecting its corresponding classifier. This is the most
commonly used strategy for multiclass classification and is a fair
default choice.
This strategy can also be used for multilabel learning, where a
classifier is used to predict multiple labels for instance, by fitting
on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and
0 otherwise.
Link to docs:
https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

Gensim pretrained model similarity

Problem :
Im using glove pre-trained model with vectors to retrain my model with a specific domain say #cars, after training I want to find similar words within my domain but I got words not in my domain corpus, I believe it's from glove's vectors.
model_2.most_similar(positive=['spacious'], topn=10)
[('bedrooms', 0.6275501251220703),
('roomy', 0.6149100065231323),
('luxurious', 0.6105825901031494),
('rooms', 0.5935696363449097),
('furnished', 0.5897485613822937),
('cramped', 0.5892841219902039),
('courtyard', 0.5721820592880249),
('bathrooms', 0.5618442893028259),
('opulent', 0.5592212677001953),
('expansive', 0.555268406867981)]
Here I expect something like leg-room, car's spacious features mentioned in the domain's corpus. How can we exclude the glove vectors while having similar vectors?
Thanks
There may not be enough info in a simple set of generic word-vectors to filter neighbors by domain-of-use.
You could try using a mixed-weighting: combine the similarities to 'spacious', and to 'cars', and return the top results in that combination – and it might help a little.
Supplying more than one positive word to the most_similar() method might approximate this. If you're sure of some major sources of interference/overlap, you might even be able to use negative word examples, similar to how word2vec finds candidate answers for analogies (though this might also suppress useful results that are legitimately related to both domains, like 'roomy'). For example:
candidates = vec_model.most_similar(positive=['spacious', 'car'],
negative=['house'])
(Instead of using single words like 'car' or 'house' you could also try using vectors combined from many words that define a domain.)
But a sharp distinction sounds like a research project, rather than something easily possible with off-the-shelf libraries/vectors – and may requires more sophisticated approaches and datasets.
You could also try using a set of vectors trained only on a dataset of text from the domain of interest – thus ensuring the vocabulary, and senses, of words are all in that domain.
You cannot exclude the words from already trained model. I don't know in which framework you're working on but I'll give you the examples in Keras as it's simple to understand the intentions.
What you could do is use Embedding layer, populate it with GloVe "knowledge" and then resume training with your corpus so that layer will learn the words and fit them for your specific domain. You can read more about it in Keras blog

How find the most decisive sentences or words in a document via Doc2Vec?

I've trained a Doc2Vec model in order to do a simple binary classification task, but I would also love to see which words or sentences weigh more in terms of contributing to the meaning of a given text. So far I had no luck finding anything relevant or helpful. Any ideas how could I implement this feature? Should I switch from Doc2Vec to more conventional methods like tf-idf?
You are asking about model interpretability. Some ways I have seen this explored:
Depending on your classifier, the parameters of the model may tell you what it is looking at. For example, in attention-based models, what the model attends to is telling.
Tools like Lime and Anchor are useful for any black box model, and will probably work in this case. The documentation for both shows how to use it with text data.

Looking to cluster short descriptions of reports. Should I use Word2Vec or Doc2Vec

So, I have close to 2000 reports and each report has an associated short description of the problem. My goal is to cluster all of these so that we can find distinct trends within these reports.
One of the features I'd like to use some sort of contextual text vector. Now, I've used Word2Vec and think this would be a good option but I also so Doc2Vec and I'm not quite sure what would be a better option for this use case.
Any feedback would be greatly appreciated.
They're very similar, so just as with a single approach, you'd try tuning parameters to improve results in some rigorous manner, you should try them both, and compare the results.
Your dataset sounds tiny compared to what either needs to induce good vectors – Word2Vec is best trained on corpuses of many millions to billions of words, while Doc2Vec's published results rely on tens-of-thousands to millions of documents.
If composing some summary-vector-of-the-document from word-vectors, you could potentially leverage word-vectors that are reused from elsewhere, but that will work best if the vectors' original training corpus is similar in vocabulary/domain-language-usage to your corpus. For example, don't expect words trained on formal news writing to work well with, or even cover the same vocabulary as, informal tweets, or vice-versa.
If you had a larger similar-text corpus of documents to train a Doc2Vec model, you could potentially train a good model on the full set of documents, but then just use your small subset, or re-infer vectors for your small subset, and get better results than a model that was only trained on your subset.
Strictly for clustering, and with your current small corpus of short texts, if you have good word-vectors from elsewhere, it may be worth looking at the "Word Mover's Distance" method of calculating pairwise document-to-document similarity. It can be expensive to calculate on larger docs and large document-sets, but might support clustering well.

NLTK - Multi-labeled Classification

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.
For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.
For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.
I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.
The issues I am having:
Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?
Thanks!
Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.
For the issues which you are facing,
nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.
Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.

Categories