Ways of obtaining a similarity metric between two full text documents? - python

So imagine I have three text documents, for example (let 3 randomly generated texts).
Document 1:
"Whole every miles as tiled at seven or. Wished he entire esteem mr oh by. Possible bed you pleasure civility boy elegance ham. He prevent request by if in pleased. Picture too and concern has was comfort. Ten difficult resembled eagerness nor. Same park bore on be...."
Document 2:
"Style too own civil out along. Perfectly offending attempted add arranging age gentleman concluded. Get who uncommonly our expression ten increasing considered occasional travelling. Ever read tell year give may men call its. Piqued son turned fat income played end wicket..."
If I want to obtain in python (using libraries) a metric on how similar these 2 documents are to a third one (in other words, which one of the 2 documents is more similar to a third one) , what would be the best way to proceed?
edit: I have observed other questions that they answer by comparing individual sentences to other sentences, but I am not interested on that, as I want to compare a full text (consisting on related sentences) against another full text, and obtaining a number (which for example may be bigger than another comparison obtained with a different document which is less similar to the target one)

There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.
Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.
Other useful links for this topics include:
Introduction to Information Retrieval (free book)
Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)

You could try the Simphile NLP text similarity library (disclosure: I'm the author). It offers several language agnostic methods: JaccardSimilarity, CompressionSimilarity, EuclidianSimilarity. Each has their advantages, but all work well on full document comparison:
Install:
pip install simphile
This example shows Jaccard, but is exactly the same with Euclidian or Compression:
from simphile import jaccard_similarity
text_a = "I love dogs"
text_b = "I love cats"
print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")

Related

definite vs indefinite article usage corrector

I'm writing a program that corrects 'a/an' vs 'the' article usage . I've been able to detect case of plurality ( article is always 'the' when the corresponding noun is plural ) .
I'm stumped on how to solve this issue for singular nouns. Without context, both "an apple" and the "apple" are correct. How would I approach such cases ?
I don't think this is something you will be able to get 100% accuracy on, but it seems to me that one of the most important cues is previous mention. If no apple has been mentioned before, then it is a little odd to say 'the apple'.
A very cheap (and less accurate) approach is to literally check for a token 'apple' in the preceding context and use that as a feature, possibly in conjunction with many other features, such as:
position in text (definiteness becomes likelier as the text progresses)
grammatical function via a dependency parse (grammatical subjects more likely to be definite)
phrase length (definite mentions are typically shorter, fewer adjectives)
etc. etc.
A better but more complex approach would be to insert "the" and then use a coreference resolution component to attempt to find a previous mention. Although automatic coreference resolution is not perfect, it is the best way to determine if there is a previous mention using NLP, and most systems will also attempt to resolve non-trivial cases, such as "John has Malaria ... the disease", which a simple string lookup will miss, as well as distiguishing non-co-referring mentions: a red apple ... != a green apple.
Finally, there is a large amount of nouns which can appear with an article despite not being mentioned previously, including names ("the Olympic Games"), generics ("the common ant"), contextually inferable words ("pass the salt") and uniquely identifiable ("the sun"). All of these could be learned from a training corpus, but that would probably require a separate classifier.
Hope this helps!

Use NLTK to find reasons within text

For my project at work I am tasked with going through a bunch of user generated text, and in some of that text are reasons for cancelling their internet service, as well as how often that reason is occurring. It could be they are moving, just don't like it, or bad service, etc.
While this may not necessarily be a Python question, I am wondering if there is some way I can use NLTK or Textblob in some way to determine reasons for cancellation. I highly doubt there is anything automated for such a specialized task and I realize that I may have to build a neural net, but any suggestions on how to tackle this problem would be appreciated.
This is what I have thought about so far:
1) Use stemming and tokenization and tally up most frequent words. Easy method, not that accurate.
2) n-grams. Computationally intensive, but may hold some promise.
3) POS tagging and chunking, maybe find words which follow conjunctions such as "because".
4) Go through all text fields manually and keep a note of reasons for cancellation. Not efficient, defeats the whole purpose of finding some algorithm.
5) NN, have absolutely no idea, and I have no idea if it is feasible.
I would really appreciate any advice on this.
Don't worry if this answer is too general or you can't understand
something - this is academic stuff and needs some basic preparations.
Feel free to contact me with questions, if you want (ask for my mail
in comment or smth, we'll figure something out).
I think that this question is more suited for CrossValidated.
Anyway, first thing that you need to do is to create some training set. You need to find as many documents with reasons as you can and annotate them, marking phrases specifying reason. The more documents the better.
If you're gonna work with user reports - use example reports, so that training data and real data will come from the same source.
This is how you'll build some kind of corpus for your processing.
Then you have to specify what features you'll need. This may be POS tag, n-gram feature, lemma/stem, etc. This needs experimentation and some practice. Here I'd use some n-gram features (probably 2-gram or 3-gram) and maybe some knowledge basing on some Wordnet.
Last step is building you chunker or annotator. It is a component that will take your training set, analyse it and learn what should it mark.
You'll meet something called "semantic gap" - this term describes situation when your program "learned" something else than you wanted (it's a simplification). For example, you may use such a set of features, that your chunker will learn finding "I don't " phrases instead of reason phrases. It is really dependent on your training set and feature set.
If that happens, you should try changing your feature set, and after a while - try working on training set, as it may be not representative.
How to build such chunker? For your case I'd use HMM (Hidden Markov Model) or - even better - CRF (Conditional Random Field). These two are statistical methods commonly used for stream annotation, and you text is basically a stream of tokens. Another approach could be using any "standard" classifier (from Naive Bayes, through some decision tress, NN to SVM) and using it on every n-gram in text.
Of course choosing feature set is highly dependent on chosen method, so read some about them and choose wisely.
PS. This is oversimplified answer, missing many important things about training set preparation, choosing features, preprocessing your corpora, finding sources for them, etc. This is not walk-through - these are basic steps that you should explore yourself.
PPS. Not sure, but NLTK may have some CRF or HMM implementation. If not, I can recommend scikit-learn for Markov and CRFP++ for CRF. Look out - the latter is powerful, but is a b*tch to install and to use from Java or python.
==EDIT==
Shortly about features:
First, what kinds of features can we imagine?
lemma/stem - you find stems or lemmas for each word in your corpus, choose the most important (usually those will have the highest frequency, or at least you'll start there) and then represent each word/n-gram as binary vector, stating whether represented word or sequence after stemming/lemmatization contains that feature lemma/stem
n-grams - similiar to above, but instead of single words you choose most important sequences of length n. "n-gram" means "sequence of length n", so e.g. bigrams (2-grams) for "I sat on a bench" will be: "I sat", "sat on", "on a", "a bench".
skipgrams - similiar to n-grams, but contains "gaps" in original sentence. For example, biskipgrams with gap size 3 for "Quick brown fox jumped over something" (sorry, I can't remember this phrase right now :P ) will be: ["Quick", "over"], ["brown", "something"]. In general, n-skipgrams with gap size m are obtained by getting a word, skipping m, getting a word, etc unless you have n words.
POS tags - I've always mistaken them with "positional" tags, but this is acronym for "Part Of Speech". It is useful when you need to find phrases that have common grammatical structure, not common words.
Of course you can combine them - for example use skipgrams of lemmas, or POS tags of lemmas, or even *-grams (choose your favourite :P) of POS-tags of lemmas.
What would be the sense of using POS tag of lemma? That would describe part of speech of basic form of word, so it would simplify your feature to facts like "this is a noun" instead of "this is plural female noun".
Remember that choosing features is one of the most important parts of the whole process (the other is data preparation, but that deserves the whole semester of courses, and feature selection can be handled in 3-4 lectures, so I'm trying to put basics here).
You need some kind of intuition while "hunting" for chunks - for example, if I wanted to find all expressions about colors, I'd probably try using 2- or 3-grams of words, represented as binary vector described whether such n-gram contains most popular color names and modifiers (like "light", "dark", etc) and POS tag. Even if you'd miss some colors (say, "magenta") you could find them in text if your method (I'd go with CRF again, this is wonderful tool for this kind of tasks) generalized learned knowledge enough.
While FilipMalczak's answer states the state-of-the-art method to solve your problem, a simpler solution (or maybe a preliminary first step) would be to do simple document clustering. This, done right, should cluster together responses that contain similar reasons. Also for this, you don't need any training data. The following article would be a good place to start: http://brandonrose.org/clustering

NLP with Python - how to build a corpus, which classifier to use?

I’m trying to figure out which direction to take my Python NLP project in, and I’d be very grateful to the SO community for any advice.
Problem:
Let’s say I have 100 .txt files that contain the minutes of 100 meetings held by a decisionmaking body. I also have 100 .txt files of corresponding meeting outcomes, which contain the resolutions passed by this body. The outcomes fall into one of seven categories – 1 – take no action, 2 – take soft action, 3 – take stronger action, 4 – take strongest action, 5 – cancel soft action previously taken, 6 – cancel stronger action previously taken, 7 – cancel strongest action previously taken. Alternatively, this can be presented on a scale from -3 to +3, with 0 signifying no action, +1 signifying soft action, -1 signifying cancellation of soft action previously taken, and so on.
Based on the text of the inputs, I’m interested in predicting which of these seven outcomes will occur.
I’m thinking of treating this as a form of sentiment analysis, since the decision to take a certain kind of action is basically a sentiment. However, all the sentiment analysis examples I’ve found have focused on positive/negative dichotomies, sometimes adding in neutral sentiment as a category. I haven’t found any examples with more than 3 possible classifications of outcomes – not sure whether this is because I haven’t looked in the right places, because it just isn’t really an approach of interest for whatever reason, or because this approach is a silly idea for some reason of which I’m not yet quite sure.
Question 1. Should be I approaching this as a form of sentiment analysis, or is there some other approach that would work better? Should I instead treat this as a kind of categorization matter, similar to classifying news articles by topic and training the model to recognize the "topic" (outcome)?
Corpus:
I understand that I will need to build a corpus for training/test data, and it looks like I have two immediately evident options:
1 – hand-code a CSV file for training data that would contain some key phrases from each input text and list the value of the corresponding outcome on a 7-point scale, similar to what’s been done here: http://help.sentiment140.com/for-students
2 – use the approach Pang and Lee used (http://www.cs.cornell.edu/people/pabo/movie-review-data/) and put each of my .txt files of inputs into one of seven folders based on outcomes, since the outcomes (what kind of action was taken) are known based on historical data.
The downside to the first option is that it would be very subjective – I would determine which keywords/phrases I think are the most important to include, and I may not necessarily be the best arbiter. The downside to the second option is that it might have less predictive power because the texts are pretty long, contain lots of extraneous words/phrases, and are often stylistically similar (policy speeches tend to use policy words). I looked at Pang and Lee’s data, though, and it seems like that may not be a huge problem, since the reviews they’re using are also not very varied in terms of style. I’m leaning towards the Pang and Lee approach, but I’m not sure if it would even work with more than two types of outcomes.
Question 2. Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?
Question 3. Given all of the above, which classifier should I be using? I’m thinking maximum entropy would work best; I’ve also looked into random forests, but I have no experience with the latter and really have no idea what I’m doing (yet) when it comes to them.
Thank you very much in advance :)
Question 1 - The most straightforward way to think of this is as a text classification task (sentiment analysis is one kind of text classification task, but by no means the only one).
Alternatively, as you point out, you could consider your data as existing on a continuum ranging from -3 (cancel strongest action previously taken) to +3 (take strongest action), with 0 (take no action) in the middle. In this case you could treat the outcome as a continuous variable with a natural ordering. If so, then you could treat this as a regression problem rather than a classification problem. It's hard to know whether this is a sensible thing to do without knowing more about the data. If you suspect you will have a number of words/phrases that will be very probable at one end of the scale (-3) and very improbable at the other (+3), or vice versa, then regression may make sense. On the other hand, if the relevant words/phrases are associated with strong emotion and are likely to appear at either end of the scale but not in the middle, then you may be better off treating it as classification. It also depends on how you want to evaluate your results. If your algorithm predicts that a document is a -2 and it's actually a -3, will it be penalized less than if it had predicted +3? If so, it might be better to treat this as a regression task.
Question 2. "Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?"
Note that the set of documents (the .txt files of meeting minutes and corresponding outcomes) is your corpus -- the typical thing to do is randomly select 20% or so to be set aside as test data and use the remaining 80% as training data. The two general options you consider above are options for selecting the set of features that your classification or regression algorithm should attend to.
You correctly identify the upsides and downsides of the two most obvious approaches for coming up with features (hand-picking your own vs. Pang & Lee's approach of just using unigrams (words) as phrases).
Personally I'd also lean towards this latter approach, given that it's notoriously hard for humans to predict which phrases will be useful for classification--although there's no reason why you couldn't combine the two, having your initial set of features include all words plus whatever phrases you think might be particularly relevant. As you point out, there will be a lot of extraneous words, so it may help to throw out words that are very infrequent, or that don't differ enough in frequency between classes to provide any discriminative power. Approaches for reducing an initial set of features are known as "feature selection" techniques - one common method is mentioned here. Or see this paper for a more comprehensive list.
You could also consider features like the percent of high-valence words, high-arousal words, or high-dominance words, using the dataset here (click Supplementary Material and download the zip).
Depending on how much effort you want to put into this project, another common thing to do is to try a whole bunch of approaches and see which works best. Of course, you can't test which approach works best using data in the test set--that would be cheating and would run the risk of overfitting to the test data. But you can set aside a small part of your training set as 'validation data' (i.e. a mini-test set that you use for testing different approaches). Given that you don't have that much training data (80 documents or so), you could consider using cross validation.
Question 3 - The best way is probably to try different approaches and pick whatever works best in cross-validation. But if I had to pick one or two, I personally have found that k-nearest neighbor classification (with low k) or SVMs often work well for this kind of thing. A reasonable approach might be
having your initial features be all unigrams (words) + phrases that
you think might be predictive after you look at some training data;
applying a feature selection technique to trim down your feature set;
applying any
algorithm that can deal with high-dimensional/text features, such as those in http://www.csc.kth.se/utbildning/kth/kurser/DD2475/ir10/forelasningar/Lecture9_4.pdf (lots of good tips in that pdf!), or those that achieved decent performance in the Pang & Lee paper.
Other possibilities are discussed in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf . Often the specific algorithm matters less than the features that go into it. Frankly it sounds like a very difficult sort of classification task, so it's possible that nothing will work very well.
If you decide to treat it as a regression rather than a classification task, you could go with k nearest neighbors regression ( http://www.saedsayad.com/k_nearest_neighbors_reg.htm ) or ridge regression.
Random forests often do not work well with large numbers of dependent features (words), though they may work well if you end up deciding to go with a smaller number of features (for example, a set of words/phrases you manually select, plus % of high-valence words and % of high-arousal words).

Clustering Company Curriculum Vitae (CV's) in python (Clustering pieces of text)

I am trying to classify (cluster) our companies Curriculum Vitae (CVs). There is about 100 CV's in total. The idea is to find similar people based on their CV content. I have already transformed the word docs into text files and read all of the candidates into a python dictionary with the format:
cvdict = { 'name1' : "cv text", 'name2', : 'cv text', ... }
I have also remove most punctuation, lowercased it, removed numbers etc., and removed words with a length less than x (4)
My questions:
Is clustering the correct approach? If not, which Machine Learning algorithm would be a suitable initial focus for this task.
Any pointers as to some python code i can use to transverse this dictionary and 'cluster' the content. Based on the clustering of the content it should output the ‘keys’=candidate names as clustered groups.
So from what I understood you want to see potential groups/clusters in the set of CVs.
the idea of cvdict is great, but you also need to convert all texts to numbers ! you are half way through. so think about matrix/excel sheet/table. where you have the profile of each employee in each line.
name1,cv_text1
name2,cv_text2
name3,cv_text3 ...
Yes, as you can guess, the length of cv_text can vary. Some people have a lengthy resume some other not ! which words can categorize the company employee. Some how we need to make them all equal size; Also, not all words are informative, you need to think which words can capture your idea; In Machine Learning they call it "Feature" vector or matrix. So my suggestion would be drive a set of words and mark if the person has mentioned that word in his skill.
managment marketing customers statistics programming
name1 1 1 0 0 0
name2 0 0 0 1 1
name3 0 0 1 1 0
or instead of a 0/1 matrix you can put how many times that word was mentioned in the resume.
again you can just extract all possible words from all resumes. NLTK is an awesome module for doing text analysis and it has some built-in function for you to polish you text. have a look at the first half of this slide.
Then you can use any kind of clustering method, for example hierarchical https://code.activestate.com/recipes/578834-hierarchical-clustering-heatmap-python/
there are already packages for doing such analysis; either in scipy or scikit and I am sure for each you can find a tons of examples. The key step is the one you are already working on; representing your data as a matrix.
Couple more hints to earlier comment:
I would not throw away words less than 4 characters long. Instead I would use a stop list of common words. You don't want to throw away things like C++ or C#
One good technique of building a matrix above is to use TF-IDF metric. What it is is essentially a measure of how frequently some word occurs in a particular document vs. how frequently it occurs in the entire collection. So things like 'the' are very common so they will be downgraded very quickly. If only 5 people in your company now C++ this will boost up the metric for this word a lot.
You might want to consider to use a stemmer like a 'porter algorithm'. This algorithm will combine words like 'statistics' and 'statistical'.
Most machine learning algorithms have a problem with very wide matrices. Unfortunately, your resume base is only 100 documents which is considered quite low vs how many potential terms you will have. The reason these techniques work for google and NSA is because human languages tend to have tens of thousands words in active use vs billions of documents they have to index. For your task I would try to shrink you dataset to no more than 30-40 columns. Be very aggressive on throwing away the common words.
Unfortunately the biggest weakness of most of the clustering techniques is that you have to set a number of clusters in advance. A common approach that people use is to set up some type of measure of how good your clusters are and start running the clustering algorithm first with very few clusters and keep increasing until your metrics starts to drop off. Look up Andrew Ng machine learning course on the interwebs. He explains these technique very well.
Of course hierarchical clustering is not affected by the point 5.
Instead of clustering you can try building decision tree. Although not super accurate, decision trees have a great advantage to visualize the built model. By looking at the three you can easily see the reason where built the way they are.
Besides scipy and scikit, which are very good. Take a look at Orange Toolbox. It has a lot of good algorithms with good visualization tools. They way you program it is just by connecting boxes with arrows. After you got satisfied with your model you can easily dump it out to the run as a script.
Hope this helps.

Defining the context of a word - Python

I think this is an interesting question, at least for me.
I have a list of words, let's say:
photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet
and I have a list of contexts:
Programming
World news
Technology
Web Design
I need to try and match words with the appropriate context/contexts if possible.
Maybe discovering word relationships in some way.
Any ideas?
Help would be much appreciated!
This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.
I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.
Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.
For example if you have two documents like this:
D1: Need to find meaning.
D2: Need to separate Apples from oranges
you matrix will look like this:
Need to find meaning Apples Oranges Separate From
D1: 1 1 1 1 0 0 0 0
D2: 1 1 0 0 1 1 1 1
This is called term by document matrix
Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD
I just found this a couple days ago: ConceptNet
It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.
If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.
The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.
Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.
See here for a list of other ontologies / knowledge bases you could use.

Categories