Grouped Text classification - python

I have thousands groups of paragraphs and I need to classify these paragraphs. The problem is that I need to classify each paragraph based on other paragraphs in the group! For example, a paragraph individually maybe belongs to class A but according to other paragraph in the group it belongs to class B.
I have tested lots of traditional and deep approaches( in fields like text classification, IR, text understanding, sentiment classification and so on) but those couldn't classify correctly.
I was wondering if anybody has worked in this area and could give me some suggestion. Any suggestions are appreciated. Thank you.
Update 1:
Actually we are looking for manual sentences/paragraph for some fields, so we first need to recognize if a sentence/paragraph is a manual or not second we need to classify it to it's fields and we can recognize its field only based on previous or next sentences/paragraphs.
To classify the paragraphs to manual/no-manual we have developed some promising approaches but the problem come up when we should recognize the field according to previous or next sentences/paragraphs, but which one?? we don't know the answer would be in any other sentences!!.
Update 2:
We can not use whole text of group as input because those are too big (sometimes tens of thousands of words) and contain some other classes and machine can't learn properly which lead to the drop the accuracy sharply.
Here is a picture that maybe help to better understanding the problem:

Related

How to compare similarities between paragraphs NLP

I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.
I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.

Exclude values existing in a list that contains words like

I have a list of merchant category:
[
'General Contractors–Residential and Commercial',
'Air Conditioning, Heating and Plumbing Contractors',
'Electrical Contractors',
....,
'Insulation, Masonry, Plastering, Stonework and Tile Setting Contractors'
]
I want to exclude merchants from my dataframe if df['merchant_category'].str.contains() any of such merchant categories.
However, I cannot guarantee that the value in my dataframe has the long name as in the list of merchant category. It could be that my dataframe value is just air conditioning.
As such, df = df[~df['merchant_category'].isin(list_of_merchant_category)] will not work.
If you can collect a long list of positive examples (categories you definitely want to keep), & negative examples (categories you definitely want to exclude), you could try to train a text classifier on that data.
It would then be able to look at new texts and make a reasonable guess as to whether you want them included or excluded, based on their similarity to your examples.
So, as you're working in Python, I suggest you look for online tutorials and examples of "binary text classification" using Scikit-Learn.
While there's a bewildering variety of possible approaches to both representing/vectorizing your text, and then learning to make classifications from those vectors, you may have success with some very simple ones commonly used in intro examples. For example, you could represent your textual categories with bag-of-words and/or character-n-gram (word-fragments) representations. Then try NaiveBayes or SVC classifiers (and others if you need to experiment for possibly-bettr results).
Some of these will even report a sort of 'confidence' in their predictions - so you could potentially accept the strong predictions, but highlight the weak predictions for human review. When a human then looks at, an definitively rules on, a new 'category' string – because it was highlighted an iffy prediction, or noticed as an error, you can then improve the overall system by:
adding that to the known set that are automatically included/excluded based on an exact literal comparison
re-training the system, so that it has a better chance at getting other new similar strings correct
(I know this is a very high-level answer, but once you've worked though some attempts based on other intro tutorials, and hit issues with your data, you'll be able to ask more specific questions here on SO to get over any specific issues.)

find short text's model

I have a CSV file(like 1.3GB) and a column is about the short text. I want to do topic modeling for all short texts. But I don't know what should I do? I have tried the LDA method, but my data is too large I can not run it. By the way, I want to achieve this: the input is a short text and the output is some Exact categories or one category. (Sport, politics, entertainment, and so on)Can someone give me some idea? Code will be better! Thanks!

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.
If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

Guess tags of a paragraph programmatically using python

I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.
I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.
My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.
Thanks,
Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.
It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?
You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.

Categories