How can I calculate the string similarity (semantic meaning) between 2 string?
For example if I have 2 string like "Display" and "Screen" the string similarity must be close to 100%
If I have "Display" and "Color" the screen similarity must be close to 0%
I'm writing my script in Python... My question is if exists some library or framework to do this kind or think... In alternative can someone suggest me a good approach?
Based on your examples, I think you are looking for semantical similarity. You can do this for instance by using WordNet, but you will have to add for instance that you are working with nouns and possible iterate over the different meanings of the word. The link shows two examples that calculate the similarity according to various implementations.
Most implementations are however computationally expensive: they make use of a large amount of text to calculate how often two words are close to each other, etc.
What you're looking to solve is an NLP problem; which, if you're not familiar with, can be a hassle. The most popular library out there is NTLK, which has a lot of AI tools. A quick google of what you're looking for yields logic of semantics: http://www.nltk.org/book/ch10.html
This is a computationally heavy process, since it involves loading a dictionary of the entire English language. If you have a small subset of examples, you might be better off creating a mapping yourself.
I am not good at in NPL, but I think Levenshtein Distance Algorithm can help you solve this problem.Becuase I use this algorithm to calculate the similarity between to strings. And the preformance is not bad.
The following are my CPP code, click the link, maybe you can transform the code to Python.I will post the Python code later.
If you understance Dynamic Programming, I think you can understande it.
enter link description here
Have a look in following libraries:
https://simlibrary.wordpress.com/
http://sourceforge.net/projects/semantics/
http://radimrehurek.com/gensim/
Check out word2vec as implemented in the Gensim library. One of its features is to compute word similarity.
https://radimrehurek.com/gensim/models/word2vec.html
More details and demos can be found here.
I believe this is the state of the art right now.
As another user suggested, the Gensim library can do this using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. From your examples:
word_vectors.similarity('display', 'color')
>>> 0.3068566
word_vectors.similarity('display', 'screen')
>>> 0.32314363
Compare those resulting numbers and you will see the words display and screen are more similar than display and color are.
Related
I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.
I am working on a document similarity problem. For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector.
I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. How could I do that?
You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. That being said, I think I can help you out generally, even without know what you want gensim to do. gensim is quite powerful, and offers lots of different functionality.
Assuming your dictionary is already in gensim format, you can load it like this:
from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')
There - now you can use it with gensim, and run analyses and model to your heart's desire. For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc.
For document similarity with a trained LDA model, see this stackoverflow question.
For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents.
gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim, I would recommend looking at the documentation and examples.
Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec), and might lose valuable information.
I wonder how the similarity works with gensim ? How the different shards are created and does it increase performance when looking only for top-N similar document ? More generally, is there a documentation about the internal structures of gensim ?
The documentation of the internals of gensim is the full source code:
https://github.com/RaRe-Technologies/gensim
With high-dimensional data like this, finding the exact top-N most-similar vectors generally requires an exhaustive search against all candidates. That is, there's no simple sharding that allows most vectors to be ignored as too-far away and still gives precise results.
There are approximate indexing techniques, like ANNOY, that can speed searches... but they tend to miss some of the true top-N results. Gensim includes a demo notebook of using ANNOY-indexing with gensim's word2vec support. (It should be possible to do something similar with other text-vectors, like the bag-of-words representations in the tutorial you link.)
When I use word2vec.word2vec(train="corpus.txt"), how does it parse words out of the file?
Could somebody give me an example or related resources? Thanks in advance.
There are more different resources how to do it. One of the possible way to use word2vec technique with gensim is here or on git.
The main idea to use word2vec is the opportunity to handle words like vectors. It is very comfortable from calculation process.
Assuming you have a text where there are many words. If you will create dictionary only using those words you will have misunderstanding later because their meaning into multi dimension space will be wrong. If you will be use vectors from base on given word2vec models from Google, etc. you will have better distribution of words into defined space.
Having model, you can easy calculate similarity and so on to extract meaning from the text. It's already a logical part and will be related with you intention.
I am wondering if there is a way to use NLP (specifically the nltk module in python) to find similarities between the subjects within sentences. The problem is that the texts refer back to subjects within a separate sentence, and don't specifically refer to them by name (E.g. www.legaltips.org/Alabama/alabama_code/2-2-30.aspx). Any ideas or experience with this would be super helpful.
The short answer to your question is yes. :)
It sounds like the problem you are trying to solve is what we call anaphora or co-reference resolution in NLP - although that only refers to tracking the same referent through different sentences. You can try getting started here: http://nlp.stanford.edu/software/dcoref.shtml
If you want to find simply similarities then this is a different problem entirely - you should let people know what kind of similarities you are talking about - semantic, syntatic, etc... and then you can get an answer (if that is your problem).