I am working on natural language programming in the German Language in which I need to categorize words according to the meaning of the words. E.g 'Communication', 'Social skills', 'Interpersonal Skills' belongs to 'Communication skills' and so forth.
Basically, the words need to sort based on the similarity of the meaning it has with given set of standard words.
I have tried Levenstein-distance, edit-distance and open-source fuzzy string matching technique but the result are not satisfying.
Best results come from using Longest-common Subsequence the list of words but I want to match the words based on the underlying meaning of the words.
What you are looking for is "semantic similarity". One possible option is to use Spacy or another NLP framework. You would want to explore word 2 vector algorithms to help with you task.
Semantic Similarity with Spacy
Related
I try to find, effectively, a similarity between a short phrase and a large corpus. For example, suppose my corpus is the book Moby Dick. This book has tens of thousands of words.
In addition to that, I have a few short phrases. for example:
phrase1 = "Call me Ishmael" # This is the first sentence in the book exactly.
phrase2 = "Call me Isabel" # This is like the previous with changes of few letters from the third word.
phrase3 = "Call me Is mael" #It's a similar sentence but one word split in two.
In addition, I have of course many other sentences that are not similar to sentences from the book.
I wonder what is the generic and effective way to identify sentences that have a sentence similar to them in the book.
What have I tried to do that seems less appropriate to me?
I split all the input sentences into 3/4/5/6 n-grams.
I split all the corpus sentences into 3/4/5/6 n-grams.
Then I tried to find an approximate match (with FuzzySet) between all possible combinations of corpus n-grams and input n-grams (The combinations are required to grasp even cases where words have split or merged.)
It is quite clear to me that this method is very wasteful and probably also not the most accurate for my needs. I would love to understand how best to do it.
You can use corpus-based spell correction followed by fuzzyset. For spell correction, you can use a python implementation of symspell algorithm. Here you can find the list of repository implementing symspell. Use symspell with compounding configuration.
Use a speller trained on the corpus to spell correct short sentences.
Use fuzzyset/fuzzywuzzy to find a similarity score between spell
corrected sentence and each sentence in the corpus.
Define a threshold by experiment, if the similarity is above the
threshold,call it a match
I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.
Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.
Data: The data is in German and contains a lot of technical jargon.
My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:
no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE
However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.
I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.
What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?
Thanks a lot for any suggestions and experiences!
I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.
Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.
If it is of any help, Spacey supports German Language. You use it to train your model.
In general, tRuEcasIng helps.
Truecasing is the process of restoring case information to badly-cased or noncased text.
See
How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
definitely convert the majority of the words to lowercase, cut consider the following cases:
Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A
I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?
Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further.
Thus some sort of lemmatization and tokenization are needed
I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)
Given the sentence: "the quick brown fox jumped over the lazy dog", I would like to get a score of how frequent each word is from an nltk corpus (which ever corpus is most generic/comprehensive)
EDIT:
This question is in relation to this question: python nltk keyword extraction from sentence where #adi92 suggested using the technique of idf to calculate the 'rareness' of a word. I would like to see what this would look like in practice. The broader problem here is, how do you calculate the rareness of a word's use in the english language. I appreciate that this is a hard problem to solve, but nonetheless nltk idf (with something like the brown or reuters corpus??) might get us part of the way there?
If you want to know word frequencies you need a table of word frequencies. Words have different frequencies depending on text genre, so the best frequency table might be based on a domain-specific corpus.
If you're just messing around, it's easy enough to pick a corpus at random and count the words-- use <corpus>.words() and the nltk's FreqDist, and/or see the NLTK book for details.
But for serious use, don't bother counting words yourself: If you're not interested in a specific domain, grab a large word frequency table. There are gazillions out there (it's evidently the first thing a corpus creator thinks of), and the largest one is probably the "1-gram" tables compiled by google. You can download them at http://books.google.com/ngrams/datasets
I'm trying to write a program to evaluate semantic similarity between texts. I have already compared n-gram frequencies between texts (a lexical measure). I wanted something a bit less shallow than this, and I figured that looking at similarity in sentence construction would be one way to evaluate text similarity.
However, all I can figure out how to do is to count the POS (for example, 4 nouns per text, 2 verbs, etc.). This is then similar to just counting n-grams (and actually works less well than the ngrams).
postags = nltk.pos_tag(tokens)
self.pos_freq_dist = Counter(tag for word,tag in postags)
for pos, freq in self.pos_freq_dist.iteritems():
self.pos_freq_dist_relative[pos] = freq/self.token_count #normalise pos freq by token counts
Lots of people (Pearsons, ETS Research, IBM, academics, etc.) use Parts-of-Speech for deeper measures, but no one says how they have done it. How can Parts-of-Speech be used for a 'deeper' measure of semantic text similarity?
A more sophisticated tagger is required such as http://phpir.com/part-of-speech-tagging/.
You will need to write algorithms and create word banks to determine the meaning or intention of sentences. Semantic analysis is artificial intelligence.
Nouns and capitalized nouns will be the subjects of the content. Adjectives will give some hint as to the polarity of the content. Vagueness, clarity, power, weakness, the types of words used. The possibilities are endless.
Take a look at chapter 6 of the NLTK Book. It should give you plenty of ideas for features you can use to classify text.
How can you determine the semantic similarity between two texts in python using WordNet?
The obvious preproccessing would be removing stop words and stemming, but then what?
The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?
One thing that you can do is:
Kill the stop words
Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.
There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).
Hope this helps