Clustering text data based on sentiment? - python

I am scraping reviews off Amazon with the intent to perform sentiment analysis to classify them into positive, negative and neutral. Now the data I would get would be text and unlabeled.
My approach to this problem would be as following:-
1.) Label the data using clustering algorithms like DBScan, HDBScan or KMeans. The number of clusters would obviously be 3.
2.) Train a Classification algorithm on the labelled data.
Now I have never performed clustering on text data but I am familiar with the basics of clustering. So my question is:
Is my approach correct?
Any articles/blogs/tutorials I can follow for text based clustering since I am kinda new to this?

I have never done such an experiment but as far as I know, the most challenging part of this work is transforming the sentences or documents into fixed-length vectors (mapping into semantic space). I highly suggest using a sentiment analysis pipeline from huggingface library for embedding the sentences (in this way you might exploit some supervision). There are other options as well:
Using sentence-transformers library. (straightforward and still good)
Using BoW. (simplest way but hard to get what you want)
Using TF-IDF (still simple but may simply do the work)
After you reach this point (every review ==> fixed-length vector) you can exploit whatever you want to cluster them and look after the results.

Related

Which clustering method is the standard way to go for text analytics?

Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.

NER(Named Entity Recognition) Similarity between sentences in documents

I have been using spacy to find the NER of sentences.My problem is I have to calculate the NER similarity between sentences of two different documents. Is there any formula or package available in python for the same?
TIA
I believe you are asking, how similar are two named entities?
This is not so trivial, since we have to define what "similar" means.
If we use a naive bag-of-words approach, two entities are more similar when more of their tokens are identical.
If we put the entity tokens into sets, the calculation would just be the jaccard coefficient.
Sim(ent1, ent2) = |ent1 ∩ ent2| / |ent1  ∪ ent2|
Which in python would be:
ent1 = set(map(str, spacy_entity1))
ent2 = set(map(str, spacy_entity2))
similarity = len(ent1 & ent2) / len(ent1 | ent2)
Where spacy_entity is one of the entities extracted by spacy
We then just create entity sets ent by creating a set of the strings that represent them.
I've been tackling the same problem and heres how I'm Solving.
Since youve given no context about your problem, ill keep the solution as general as possible
Links to tools:
Spacy Spacy Similarity
Flair NLP: Flair NLP zero shot few shot
Roam Research for research articles and their approach Roam Research home page
DISCLAIMER: THIS IS A TRIAL EXPERIMENT. AND NOT THE SOLUTION
Steps:
try a Large language model with a similarity coefficient (Like GPT-3 or TARS) on your dataset
(Instead of similarity you could also use the zero shot/few shot classification and then manually see the accuracy or compute it if you have labelled data)
I am then grab the verbs (assuming you have a corpus and not single word inputs) and calculating their similariy (attaching x-n and x+n words [exlude stopwords according to your domain] if x is position of verb)
This step mainly allows you give more context to the large language model so it is not biased on its large corpus (this is experimental)
And finally Im grabbing all the named entities and hopefully their labels (Example India: Country/Location) And Ask the same LLM (Large Language model) after youve constructed its Prompt to see how many of the same buckets/categories your entities fall into (probability comparison maybe)
Even if you dont reproduce these steps, what you must understand is that these tools give you atomic information about your raw input. you have to put mathematic functions to compare and make the algorithm
In my case im averaging all 3 similarity index of the above steps and ensuring that all classification is multilabel classification
And to validate any of this, human pattern matching maybe.
You need probably http://uima.apache.org/d/uimacpp-2.4.0/docs/Python.html/ plus a CoNLL-U parser attached to it https://universaldependencies.org/format.html. With this approach NERs are based to a dictionary in UIMA Pipeline. You need to develop some proprietary NER search/match algorithms (in Python or in other supported language).

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.
If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

Document Clustering in python using SciKit

I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.
My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.
For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.
This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering

Categories