NER(Named Entity Recognition) Similarity between sentences in documents

NER(Named Entity Recognition) Similarity between sentences in documents - python

I have been using spacy to find the NER of sentences.My problem is I have to calculate the NER similarity between sentences of two different documents. Is there any formula or package available in python for the same?
TIA

I believe you are asking, how similar are two named entities?
This is not so trivial, since we have to define what "similar" means.
If we use a naive bag-of-words approach, two entities are more similar when more of their tokens are identical.
If we put the entity tokens into sets, the calculation would just be the jaccard coefficient.
Sim(ent1, ent2) = |ent1 ∩ ent2| / |ent1  ∪ ent2|
Which in python would be:
ent1 = set(map(str, spacy_entity1))
ent2 = set(map(str, spacy_entity2))
similarity = len(ent1 & ent2) / len(ent1 | ent2)
Where spacy_entity is one of the entities extracted by spacy
We then just create entity sets ent by creating a set of the strings that represent them.

I've been tackling the same problem and heres how I'm Solving.
Since youve given no context about your problem, ill keep the solution as general as possible
Links to tools:
Spacy Spacy Similarity
Flair NLP: Flair NLP zero shot few shot
Roam Research for research articles and their approach Roam Research home page
DISCLAIMER: THIS IS A TRIAL EXPERIMENT. AND NOT THE SOLUTION
Steps:
try a Large language model with a similarity coefficient (Like GPT-3 or TARS) on your dataset
(Instead of similarity you could also use the zero shot/few shot classification and then manually see the accuracy or compute it if you have labelled data)
I am then grab the verbs (assuming you have a corpus and not single word inputs) and calculating their similariy (attaching x-n and x+n words [exlude stopwords according to your domain] if x is position of verb)
This step mainly allows you give more context to the large language model so it is not biased on its large corpus (this is experimental)
And finally Im grabbing all the named entities and hopefully their labels (Example India: Country/Location) And Ask the same LLM (Large Language model) after youve constructed its Prompt to see how many of the same buckets/categories your entities fall into (probability comparison maybe)
Even if you dont reproduce these steps, what you must understand is that these tools give you atomic information about your raw input. you have to put mathematic functions to compare and make the algorithm
In my case im averaging all 3 similarity index of the above steps and ensuring that all classification is multilabel classification
And to validate any of this, human pattern matching maybe.

You need probably http://uima.apache.org/d/uimacpp-2.4.0/docs/Python.html/ plus a CoNLL-U parser attached to it https://universaldependencies.org/format.html. With this approach NERs are based to a dictionary in UIMA Pipeline. You need to develop some proprietary NER search/match algorithms (in Python or in other supported language).

Related

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?

You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

Sentences embedding using word2vec

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.

Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance

If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this

Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.

Documents in training data belongs to a particular topic in LDA

I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.

You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.

If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

Feature space reduction for tag prediction

I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.
Work Done
Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.
tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())
Questions
How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
Once the feature set is reduced, the way I have planned to use it is as follows:
Preprocess the test data.
Find the unigrams and bigrams.
For the ones stored in redis, find the corresponding best-k tags
Apply some kind of weight for the title and body text
Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.
Are there any suggestions on how to improve upon this? Can a classifier come in handy?
Footnote
I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.

Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.

A baseline statistical approach would treat this as a classification problem. Features are bags-of-words processed by a maximum entropy classifier like Mallet http://mallet.cs.umass.edu/classification.php. Maxent (aka logistic regression) is good at handling large feature spaces. Take the probability associated with each each tag (i.e., the class labels) and choose some decision threshold that gives you a precision/recall tradeoff that works for your project. Some of the Mallet documentation even mentions topic classification, which is very similar to what you are trying to do.
The open questions are how well Mallet handles the size of your data (which isn't that big) and whether this particular tool is a non-starter with the technology stack you mentioned. You might be able to train offline (dump the reddis database to a text file in Mallet's feature format) and run the Mallet-learned model in Python. Evaluating a maxent model is simple. If you want to stay in Python and have this be more automated, there are Python-based maxent implementations in NLTK and probably in scikit-learn. This approach is not at all state-of-the-art, but it'll work okay and be a decent baseline with which to compare more complicated methods.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.