Text Classification - Label Pre Process [closed]

Text Classification - Label Pre Process [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have a data set of 1M+ observations of customer interactions with a call center. The text is free text written by the representative taking the call. The text is not well formatted nor is it close to being grammatically correct (a lot of short hand). None of the free text has a label on the data as I do not know what labels to provide.
Given the size of the data, would a random sample of the data (to give a high level of confidence) be reasonable first step in determining what labels to create? Is it possible not to have to manually label 400+ random observations from the data, or is there no other method to pre-process the data in order to determine the a good set of labels to use for classification?
Appreciate any help on the issue.

Text Pre-Processing:
Convert all text to lower case, tokenize into unigrams, remove all stop words, use stemmer to normalize a token to it's base word.
There are 2 approaches I can think of for classifying the documents a.k.a. the free text you spoke about. Each free text is a document:
1) Supervised classification Take some time and randomly pick few samples of documents and assign them a category. Do this until you have multiple documents per category and all categories that you want to predict are covered.
Next, create a Tf-Idf matrix from this text. Select the top K features (tune value of K to get best results). Alternatively, you can use SVD to reduce the number of features by combining correlated features into one. Please bare in mind that you can use other features like the department of the customer service executive and many others also as predictors. Now train a machine learning model and test it out.
2) Unsupervised learning: If you know how many categories you have in your output variable, you can use that number as the number of clusters you want to create. Use the Tf-Idf vector from above technique and create k clusters. Randomly pick a few documents from each cluster and decide which category the documents belong to. Supposing you picked 5 documents and noticed that they belong to the category "Wanting Refund". Label all documents in this cluster to "Wanting Refund". Do this for all the remaining clusters.
The advantage of unsupervised learning is that it saves you the pain of pre-classification and data preparation, but beware of unsupervised learning. The accuracy might not be as good as supervised learning.
The 2 method explained are an abstract overview of what can be done. Now that you have an idea, read up more on the topics and use a tool like rapidminer to achieve your task much faster.

Manual annotation is a good option since you have a very good idea of an ideal document corresponding to your label.
However, with the large dataset size, I would recommend that you fit an LDA to the documents and look at the topics generated, this will give you a good idea of labels that you can use for text classification.
You can also use LDA for text classification eventually by finding out representative documents for your labels and then finding the closest documents to that document by a similarity metric(say cosine).
Alternatively, once you have an idea of labels, you can also assign them without any manual intervention using LDA, but then you will get restricted to unsupervised learning.
Hope this helps!
P.S. - Be sure to remove all the stopwords and use a stemmer to club together words of similar king example(managing,manage,management) at the pre-processing stage.

Related

How to use ML for text classification in Python? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have two columns of data that is roughly 25k rows long. The first column contains a list of income statement line items and was created from OCR, so there are lots of errors in there. For example, There might be 20 line items for 'Income', but they might show as 'I ncome' or 'Imcome' or '...Incom', etc.
The second column contains a list of classifications that have been hand-coded so that line items can be categorized. For example, 'Miscellaneous Fees', 'Application Fees', 'Insurance Fees' would all be classified as 'Other Income'.
I'd like to train a model using my existing dataset to predict that 'I ncome' should be placed in the 'Income' category, 'Mscelaneous Fees' should be placed in the 'Other Income' category and so on.
My experience with ML is limited to the examples I've worked on in classes that all use continuous variables in the data sets, so I have practically zero experience working with text classification. I could convert the text categories to numerical values, but wouldn't be able to do so with the line items so I don't know that it would help me.
Can I accomplish this with sklearn? Pytorch? Tensorflow? Spark?
Really appreciate if someone can point me in the right direction!

First you have to correct the words, because all Tensorflow and PyTorch pre trained models work in proper formatted words. For this you can use pyspellchecker or autocorrect in Python, for instance.
After that you will have to prepare data (try nltk or spacy), working on lower/upper case letters, removal of punctuation, special characters, maybe stemming and lemmatization. Then you will tokenize the phrases with nltk.word_tokenize.
Only after that you can map the first column to embeddings, vectors that represent that word/sentence.
For the embeddings, try this option, as is one of the fastest ones (choose the language in TF hub):
import seaborn as sns
from sklearn.metrics import pairwise
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text # Imports TF ops for preprocessing.
BERT_MODEL = "https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/4"
PREPROCESS_MODEL = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
sentences = [
"Here We Go Then, You And I is a 1999 album by Norwegian pop artist Morten Abel. It was Abel's second CD as a solo artist.",
"If it rains, it pours.", "The quick brown fox jumps over the lazy dog."]
preprocess = hub.load(PREPROCESS_MODEL)
bert = hub.load(BERT_MODEL)
inputs = preprocess(sentences)
outputs = bert(inputs)
Then you will map X (first column, now lots of columns - embeddings) to Y (second column - classes). In this step you can use whatever classification algorithm you want: logistic regression, naive bayes, SVM, decision trees, random forests, gradient boosting or even a neural network.
Ah, remember also to turn your second column of classes in numeric classes with dataframe['column_2'].cat.codes

How to compare similarities between paragraphs NLP

I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.

I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.

Is it possible to let a neural network classify entities based on classified documents? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I tagged a dataset of texts with independent categories. When running a CNN classifier in Keras, I receive an accuracy of > 90%.
My texts are customer reviews "I really liked the camera of this phone." Classes are e.g. "phone camera", "memory" etc.
What I am searching for is whether I can tag the sentences with categories that appear in them while the classifier marks the entities that indicate the class. Or more specifically: How can I extract those parts of the input sentence that made a CNN network in Keras opt (i.e. classify) for 1, 2 or more categories?

My pipilene (in general) for similar task.
I don't use nn to solve a whole task
First, I don't use NNs directly to label separate entities like "camera", "screen" etc. There's some good approaches which might be useful, like a pointer networks or just attention, but it just didn't wotk in my case.
I guess, this architectures don't work well because there are a lot of noise, aka "I'm so glad I bought this TV" or so in my dataset. Approx. 75% overall, and the rest of the data is not so clean, to.
Because of this, I do some additional actions:
Split sentences into chunks (sometimes they contatins desired entities)
Label this chunks by hand into "non-useful" (aka "I'm so happy/so upset" etc.) and useful: "good camera", "bad phone" etc.
Train classifier(s) to classify this data.
Details about a pipeline
How to "recognize" entities
I just used regexps and part-of-speech tags to split my data. But I work with russian language dataset, so there's not good free syntax parser / library for russian. If you work with english or another language, well-presented in spacy or nltk libraries, you can use it for parsing to separate entities. Also, english grammar is so strict in contrast to russian - it's make your task easier probably.
Anyway, try to start with regexes and parsing.
Vocabularies with keywords for topics like "camera", "battery", ... are very helpful, too.
Another approach to recognize entities is topic modellig - PLSA/LDA (gensim rocks), but it's hard to tune, imo, because there are lot of noise in texts. You'll get a lot of topics {"happy", "glad", "bought", "family", ...} and so on - but you can try topic modelling anyway.
Also you can create a dataset with an entities labels for each text and train a NN with attention, so you can recognize it by high attention, but create this dataset is very tedious.
Create dataset and train NN's
I start to create dataset only when I've got acceptable quality of "named entities" - because if you change this (footing) part later, you probalby can throw away a dataset and start it from scratch again.
Better decide which labels you will use once and then don't change them - it's critical part of work.
Training NN's on such data is easiest part of the work probably - just any good classifier, as for the whole texts. Even not a nn, but a simplier calssifiers might be useful - use blending, bagging etc.
Possible troubles
There's a trap - some reviews / features not so obvious for NN classifier or even for a human, like "loud sound" or "gets very hot". Often they context-depentend. So, I use a little help of our team to mark a dataset - so, each entry was labeld by a group of humans to get better quality. Also I use context labels - category of a product - adding a context for each entity: so, "loud sound" for audio system and for washing mashing bears controversal sentiment and model can learn it. Most cases category labels easy accessable throug databases/web parsing.
Hope it helps, also I hope someone knows a better approach.

How to improve performance of LDA (latent dirichlet allocation) in sci-kit learn?

I am running LDA on health-related data. Specifically I have ~500 documents that contain interviews that last around 5-7 pages. While I cannot really go into the details of the data or results due to preserving data integrity/confidentiality, I will describe the results and go through the procedure to give a better idea of what I am doing and where I can improve.
For the results, I chose 20 topics and outputted 10 words per topic. Although 20 was somewhat arbitrary and I did not have a clear idea of a good amount of topics, that seemed like a good amount given the size of the data and that they are all health-specific. However, the results highlighted two issues: 1) it is unclear what the topics were since the words within each topic did not necessarily go together or tell a story and 2) many of the words among the various topics overlapped, and there were a few words that showed up in most topics.
In terms of what I did, I first preprocessed the text. I converted everything to lowercase, removed punctuation, removed unnecessary codings specific to the set of documents at hand. I then tokenized the documents, lemmatized the words, and performed tf-idf. I used sklearn's tf-idf capabilities and within tf-idf initialization, I specified a customized list of stopwords to be removed (which added to nltk's set of stopwords). I also set max_df to 0.9 (unclear what a good number is, I just played around with different values), min_df to 2, and max_features to 5000. I tried both tf-idf and bag of words (count vectorizer), but I found tf-idf to give slightly clearer and more distinct topics while analyzing the LDA output. After this was done, I then ran an LDA model. I set the number of topics to be 20 and the number of iterations to 5.
From my understanding, each decision I made above may have contributed to the LDA model's ability to identify clear, meaningful topics. I know that text processing plays a huge role in LDA performance, and the better job I do there, the more insightful the LDA will be.
Is there anything glaringly wrong or something I missed out. Do you have any suggested values/explorations for any of the parameters I described above?
How detailed, nit-picky should I be when filtering out potential domain-specific stopwords?
How do I determine a good number of topics and iterations during the LDA step?
How can I go about validating performance, other than qualitatively comparing output?
I appreciate all insights and input. I am completely new to the area of topic modeling and while I have read some articles, I have a lot to learn! Thank you!

How do I determine a good number of topics and iterations during the LDA step?
This is the most difficult question in clustering algorithms like LDA. There is a metric that can determine which number of cluster is the best https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_tutorial.ipynb
In my experience optimizing this metric by tuning number of topics, iterations or another hyper-parameters won't necessarily give you interpretable topics.
How can I go about validating performance, other than qualitatively comparing output?
Again you may use the above metric to validate the performance, but I also found useful visualization of the topics http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
This not only gives you topic histograms but also shows how apart are the topics, which again may help with finding out optimal number of topics.
In my studies I was not using scikit, but rather gensim.

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.

Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.