Python NTL - Identifying text interest / topic - python

I am attempting to build a model that will attempt to identify the interest category / topic of supplied text. For example:
"Enjoyed playing a game of football earlier."
would resolve to a top level category like:
"Sport".
I'm not sure what the correct terminology is for what I am trying to achieve here so Google hasn't turned up any libraries that may be able to help. With that in mind, my approach would be something like:
Extract features from text. Use tagging to classify each feature / identify names / places. Would probably use NTLK for this, or Topia.
Run a Naive Bayes classifier for each interest category ("Sport", "Video Games", "Politics" etc.) and get a relevancy % for each category.
Identify which category has the highest % accuracy and categorise the text.
My approach would likely involve having individual corpora for each interest category and I'm sure the accuracy would be fairly miserable - I understand it will never be that accurate.
Generally looking for some advice on the viability of what I am trying to accomplish, but the crux of my question: a) is my approach is correct? b) are there any libraries / resources that may be of assistance?

You seem to know a lot of the right terminology. Try searching for "document classification." That is the general problem you are trying to solve. A classifier trained on a representative corpus will be more accurate than you think.
(a) There is no one correct approach. The approach you outline will
work, however.
(b) Scikit
Learn
is a wonderful library for this sort of work.
There is plenty of other information, including tutorials, online about this topic:
This Naive Bayesian Classifier on github probably already does most of what you want to accomplish.
This NLTK tutorial explains the topic in depth.
If you really want to get into it, I am sure a Google Scholar search will turn up thousands of academic articles in computer science and linguistics about exactly this topic.

You should check out Latent Dirichlet Allocation it will give you categories without labels , as always ed chens bolg is a good start.

Related

Using Topic Modelling or another NLP approach, is it possible to define words that go into topics/categories for better defined topic model?

I have a problem where I am using topic modelling and taking into consideration LDA & LSA approaches however have found that some of the topics are not being defined as accurately as I like. Is it possible to define words into topics to help the allow the machine to learn better and easier? If not, what techniques could I alternatively use to counter this problem?
As previously explained, I have tried LDA and LSA techniques for topic modelling and found LDA to be most accurate giving a coherence score of 0.46, and have redefined the topic names. However, the words in the topics do not reflect the topic names, and this requires tuning of the model.
I have researched into other NLP solutions such as keyword extractions and named entity relationship (NER) but do not think they are suitable for my problem.
I am wanting to have 2 levels of categorization if possible, where level 1 is an overview and level 2 is in more detail. The example below is a loosely summarized customer feedback example:
Level 1
Training
Communication
Technology
Products & Services
Other
Level 2
Internal
External
Resolution Good
Resolution Bad
Unclear feedback
Ideally this is the format I would like the topic modelling output to produce but unsure if this is viable?
Realistically, working on the weighting of the text would work. Example:
'Great training from the company' - Would be categorized as Training (Level 1) and Resolution Good (level 2). The words being picked up here are great and training as they outweigh the other words in terms of categorization.
Happy to provide further information if required.
As you understand, topic modelling is generally an unsupervised technique, so I hardly imagine you can solve your complex problem (2 levels of classification) just using this approach. Perhaps topic modelling could be a first step, which can help you in a subsequent supervised approach.
In any case, if you want to try to provide some words in order to guide the topic modelling task, there are at least two libraries to take a look at:
GuidedLDA (a bit old, but maybe coherent with your approach)
BERTopic (a breath of fresh air on topic modeling, also implements semi-supervised techniques)
Please share your updates on this task.
It seems that it is not possible to get multiple levels to answer my questions however a way around this is by running the topic modelling approach twice to get 2 different levels. However, this requires more sueprvision in terms to the definition of the topic outputs and what you are trying to define in each topic.
The technique approach I found useful after extensive research was CorEx -https://github.com/gregversteeg/corex_topic
It allow you to self define the number of topics and more importantly the words you want in each topic. I found that this answered my query to a more supervised approach.

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

Recommendation for ML algorithm to differentiate between ban or allowance

I am new to Machine Learning and wanted to see if any of you could recommend an algorithm I could apply onto a project I'm doing. Basically I want to scrape popular housing websites and look at their descriptions to see if they allow/disallow something, for example pets. The problem is a simple search for pets leads contradictory results: 'pets allowed' and 'no additional cost for pets' or 'no pets' or 'I don't accept at this time'. As seen from these examples, often negative keywords 'no' are used to indicate pets are allowed, whereas positive keywords like 'accept' are used to indicate a ban. As such, I was wondering if there was any algorithm I could use (preferably in python) differentiate between the two. (Note: I can't run training data to generate an algorithm myself as the thing I am actually looking for is quite niche).
Thank you very much for your help!!
The keyword you're looking for is "document classification". This is a document classification problem. You start with documents (i.e. webpages) and you want to classify them as "allows pets" or "doesn't allow pets" (or whatever). There are a lot of good tutorials out there for performing document classification but a full explanation is beyond the scope of a StackOverflow answer.
You won't be able to do this for your particular niche case without providing at least some training data but you could gather, say 30 example websites, extract their text, manually add labels ("does fit my niche" vs "doesn't fit my niche"), and then run through a standard document classification system and see if that gets you the accuracy you want. Also in order for this to work with a small amount of training data (like your 30 documents), you'll need to start from a pretrained model.
Good luck!

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.
If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

Categories