python nltk keyword extraction from sentence - python

"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)

I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.

One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]

in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages

Related

python word decomposition into subwords: e.g. motorbike -> motor, bike

I have a list of words like [bike, motorbike, copyright].
Now I want to check if th word consists of subwords which are also stand alone words. That means the ouput of my algorithm should be something like: [bike, motor, motorbike, copy, right, copyright].
I already now how to check if a word is a english word:
import enchant
english_words = []
arr = [bike, motorbike, copyright, apfel]
d_brit = enchant.Dict("en_GB")
for word in arr:
if d_brit.check(word):
english_words.append(word)
I also found an algorithm which decomposes the word in all possible ways: Splitting a word into all possible 'subwords' - All possible combinations
Unfortunately, splitting the word like this and then check if it is an english word takes simply to long, because my dataset is way to huge.
Can anyone help?
The nested for loops used in the code are extremely slow in Python. As performance seems to be the main issue, I would recommend to look for available Python packages to do parts of the job, or to build your own extension module, e.g. using Cython, or to not use Python at all.
Some alternatives to splitting the word like this:
searching for words that start with the first characters of str. If found word is start of str, check if rest is a word in dataset
split the str in two portions that make sense when looking at the length distribution of the dataset i.e. what are the most common word lengths? Then searching for matches with basic comparison. (just a wild idea)
These are a few quick ideas for faster algorithms i can think of. But if these are not quick enough, then BernieD is right.

Match and Group similar words that are related to each other (relevant) in a list

It is not just grouping the words in similarities but also meaning. Say that I have the following list:
func = ['Police Man','Police Officer','Police','Admin','Administrator','Dietitian','Food specialist','Administrative Assistant','Economist','Economy Consultant']
I want to find words with similar meaning and function. I tried fuzzywuzzy but it does not achieve what I want:
for i in func:
for n in func:
print(i,":",n)
print(fuzz.ratio(i, n))
This is part of the fuzzing and it does not do the job:
Dietitian : Dietitian
100
Dietitian : Food specialist
25
I believe I should use library nltk or stemming? What is the best approach to find relevant words and functions in a list?
I believe I should use ... stemming?
You definitely don't want to use stemming. Stemming will only take words to their roots, so stem("running") = "run". It doesn't do anything based on meaning, so stem("sprinting") = "sprint" != "run". :(
I believe I should use nltk ...
WordNet will let you search for sets of synonyms called "synsets" and you can access it through nltk or even through a web interface. It's not great at compound words, though. :( It's mostly just individual words.
So, you can look up "officer" and "policeman" and see that they have an overlapping meaning. Of course, "officer" also has OTHER meanings; how close do words have to be to qualify for your search? E.g. if "Food Specialist" is the same as "Dietician", is "Food Specialist" also the same as "Chef"?
If WordNet does seem like a useful tool, check out their Python API. You'd want something like
common = [synset for synset in wn.synsets("officer") if synset in wn.synsets("policeman")
return len(common) > 0

categorize/ get hypernym type word using wordnet in python

In my project I have to find the category/hypernym type of a specific word.
For example if i type Sushi/lion, the output will show food/animal. The main concept is to categorize the word. So, how can I get this using nltk and WordNet in Python?
I am unsure if your goal is achievable with an out-of-the-box solution since the abstraction level needed is quite high. In terms of nltk/wordnet, you are looking for the hypernym (supertype/superordinate) of a word. For example, the hypernym of "sushi" might be "seafood" on a first level, whereas "apple" might be just a "fruit". Probably you will have to go through several levels of hypernyms to arrive at your desired output. As a starting point to get the hypernyms, you can use this code (see All synonyms for word in python?):
from nltk.corpus import wordnet as wn
from itertools import chain
for i,j in enumerate(wn.synsets('apple')):
print('Meaning', i, 'NLTK ID', j.name())
print('Definition:', j.definition())
print('Hypernyms:', ', '.join(list(chain(*[l.lemma_names() for l in j.hypernyms()]))))
Notice also that one single word can have different meanings with different hypernyms, which further complicates your task.
EDIT
Actually, there is an out-of-the-box solution to this problem called lowest_common_hypernym:
wn.synset('apple.n.01').lowest_common_hypernyms(wn.synset('sushi.n.01'))
While this function is pretty nice, it does not necessarily return the most obvious solution. Here, it returns [Synset('matter.n.03')].

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically

Categories