How to apply nltk to categorize questions - python

I have a list of questions in a text file extracted from online website. I am new to nltk (in Python) and going through initial chapters from ( http://http://shop.oreilly.com/product/9780596516499.do ) . Please anybody help me out for categorizing my topics under different headings.
I don't know the heading of the questions. So, how to create headings and categorize then thenafter ???

Your task consists of document clustering, where each question is a document, and cluster labeling, where label designates topic.
Note that if your questions are short and/or hardly separable, e.g. belong to similar categories, then quality would be not so high.
Take a look at simple recipe for document clustering and related questions first and second.
As a baseline for labels, try max tf-idf words from cluster words or from centroids.

Related

How to compare similarities between paragraphs NLP

I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.
I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.

Documents in training data belongs to a particular topic in LDA

I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.
You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...

Which clustering method is the standard way to go for text analytics?

Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.

LDA: Assign more than one topic to a document

I´m new to LDA and doing some experiments with Python + LDA and some sample datasets.
I already got some very interesting results and now I asked myself a question but couldn´t find an answer so far.
Since I worked with customer reviews/ratings of a certain app the documents contain different topics (e.g. one talks about the app performance, price, functionality). So for my understanding I got three topics within one document.
My question: Is LDA capable to assign more than one topic to one document?
Thank you for your answer!
Yes, this is the whole idea of LDA - is searched for topic distribution inside the documents. So it will assign several topics to documents (unless they are too short).

Feature extraction

Suppose I have been given data sets with headers :
id, query, product_title, product_description, brand, color, relevance.
Only id and relevance is in numeric format while all others consists of words and numbers. Relevance is the relevancy or ranking of a product with respect to a given query. For eg - query = "abc" and product_title = "product_x" --> relevance = "2.3"
In training sets, all these fields are filled but in test set, relevance is not given and I have to find out by using some machine learning algorithms. I am having problem in determining which features should I use in such a problem ? for example, I should use TF-IDF here. What other features can I obtain from such data sets ?
Moreover, if you can refer to me any book/ resources specifically for 'feature extraction' topic that will be great. I always feel troubled in this phase. Thanks in advance.
I think there is no book that will give the answers you need, as feature extraction is the phase that relates directly to the problem being solved and the existing data,the only tip you will find is to create features that describe the data you have. In the past i worked in a problem similar to yours and some features i used were:
Number of query words in product title.
Number of query words in product description.
n-igram counts
tf-idf
Cosine similarity
All this after some preprocessing like taking all text to upper(or lower) case, stemming, standard dictionary normalization.
Again, this depends on the problmen and the data and you will not find the direct answer, its like posting a question: "i need to develop a product selling system, how do i do it? Is there any book?" . You will find books on programming and software engineering, but you will not find a book on developing your specific system,you'll have to use general knowledge and creativity to craft your solution.

Categories