find short text's model - python

I have a CSV file(like 1.3GB) and a column is about the short text. I want to do topic modeling for all short texts. But I don't know what should I do? I have tried the LDA method, but my data is too large I can not run it. By the way, I want to achieve this: the input is a short text and the output is some Exact categories or one category. (Sport, politics, entertainment, and so on)Can someone give me some idea? Code will be better! Thanks!

Related

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

Grouped Text classification

I have thousands groups of paragraphs and I need to classify these paragraphs. The problem is that I need to classify each paragraph based on other paragraphs in the group! For example, a paragraph individually maybe belongs to class A but according to other paragraph in the group it belongs to class B.
I have tested lots of traditional and deep approaches( in fields like text classification, IR, text understanding, sentiment classification and so on) but those couldn't classify correctly.
I was wondering if anybody has worked in this area and could give me some suggestion. Any suggestions are appreciated. Thank you.
Update 1:
Actually we are looking for manual sentences/paragraph for some fields, so we first need to recognize if a sentence/paragraph is a manual or not second we need to classify it to it's fields and we can recognize its field only based on previous or next sentences/paragraphs.
To classify the paragraphs to manual/no-manual we have developed some promising approaches but the problem come up when we should recognize the field according to previous or next sentences/paragraphs, but which one?? we don't know the answer would be in any other sentences!!.
Update 2:
We can not use whole text of group as input because those are too big (sometimes tens of thousands of words) and contain some other classes and machine can't learn properly which lead to the drop the accuracy sharply.
Here is a picture that maybe help to better understanding the problem:

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.
If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

How to apply nltk to categorize questions

I have a list of questions in a text file extracted from online website. I am new to nltk (in Python) and going through initial chapters from ( http://http://shop.oreilly.com/product/9780596516499.do ) . Please anybody help me out for categorizing my topics under different headings.
I don't know the heading of the questions. So, how to create headings and categorize then thenafter ???
Your task consists of document clustering, where each question is a document, and cluster labeling, where label designates topic.
Note that if your questions are short and/or hardly separable, e.g. belong to similar categories, then quality would be not so high.
Take a look at simple recipe for document clustering and related questions first and second.
As a baseline for labels, try max tf-idf words from cluster words or from centroids.

Guess tags of a paragraph programmatically using python

I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.
I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.
My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.
Thanks,
Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.
It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?
You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.

Categories