I am trying to create an recommendation system for similar articles. I do have a list of articles as reference and I want other new articles that I acquire from a certain API needs to be similar to those reference articles.
One way I could is just merge all of those reference article into one big article and run cosine similarity and get list of articles that are similar to to merged reference articles. Is there any other way I could implement cosine similarity?
Thanks
Have you looked into an NLP technique called Topic Modelling?
This following notebook from priya-dwivedi uses LDA (Latent Dirichlet Allocation):
https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb
Related
Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.
I´m new to LDA and doing some experiments with Python + LDA and some sample datasets.
I already got some very interesting results and now I asked myself a question but couldn´t find an answer so far.
Since I worked with customer reviews/ratings of a certain app the documents contain different topics (e.g. one talks about the app performance, price, functionality). So for my understanding I got three topics within one document.
My question: Is LDA capable to assign more than one topic to one document?
Thank you for your answer!
Yes, this is the whole idea of LDA - is searched for topic distribution inside the documents. So it will assign several topics to documents (unless they are too short).
Currently I am working on projet to cluster 2 millions of Text Memos. My objective is to create a standard for these Memos (Actually, when I say Memo, I mean text containing the description of something). To do so, I want first to cluster similar memos (gathering those which are probably having the same meaning) and then create a label for each cluster or group.
Since I am new to NLP, I want to know how to proceed to do so and what are some references / materials and similar projects that have been done before ?
I bet this a classic problem in NLP and many projects have been done about that subject.
I can work with R and Python
Finding hidden topics in unstructured data like text, which accurately represent the document is called Topic Modelling.
Gensim is a great library using which you can find Memos with similar topic. It has LSA and LDA algorithms implemented in python. The difference between LSA and LDA is in their implementation. LSA is a online learning algorithm, which means if the nature of data changes, it will reorient itself.
topicmodels is the R package that implements LDA. Here is a quick tutorial on LDA.
Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:
I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...
Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...
I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?
Could anyone give me some suggestions? Thank you!
I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...
"Another advantage that is particularly useful for the appli- cation
presented in this paper is that NMF is capable of identifying niche
topics that tend to be under-reported in traditional LDA approaches" (p.6)
Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF
Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.
If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.
I have a set of urls retrieved for a person. I want to try and classify each url as being about that person (his/her linkedin profile or blog or news article mentioning the person) or not about that person.
I am trying to apply a rudimentary approach where I tokenize each webpage and compare to all others to see how many similar words (excluding stop words) there are between each document and then take the most similar webpages to be positive matches.
I am wondering if there is a machine learning approach I can take to this which will make my task easier and more accurate. Essentially I want to compare webpage content (tokenized into words) between two webpages and determine a score for how similar they are based on their content.
If you are familiar with python this NLP classifier should help you greatly:
http://www.nltk.org/api/nltk.classify.html#module-nltk.classify
For unsupervised clustering you can use this:
http://www.nltk.org/api/nltk.cluster.html#module-nltk.cluster
If you are simply looking for similarity scores then the metrics module should be useful:
http://www.nltk.org/api/nltk.metrics.html#module-nltk.metrics
NLP-toolkit has the answer, just browse through the modules to find what you want, and don't implement it by hand.