iterate through nltk dictionaries - python

I'd like to know whether it's possible to iterate through some of the available nltk dictionaries, ie: Spanish dictionary. I'd like to find certain words matching some requirements.
Let's say I got this list ["tv", "tb", "tp", "dv", "db", "dp"], the algorithm would give me words like ["tapa", "tubo", "tuba", ...]. As you can see, if you get rid of the vowels in those words they'll be in the initial list:
tapa => tp
tubo => tb
tuba => tb
Anyway, I just want to know whether it's possible to iterate through spanish words on nltk dictionaries and how, that's pretty much

The nltk has plenty of Spanish language resources, but I'm not aware of a dictionary. So I'll leave the choice of wordlist up to you, and go on from there.
In general, the nltk represents wordlists as corpus readers with the usual method words() for the individual words. So here's how you could find words matching your template in the English wordlist:
templates = set(["tv", "tb", "tp", "dv", "db", "dp"])
for w in nltk.corpus.words.words("en"):
<remove vowels and check if it is in `templates`>
I notice there's a Spanish stopwords list; here's how you would iterate over it:
for w in nltk.corpus.stopwords.words("spanish"):
...
You could also create your own "wordlist" from a Spanish-language corpus. I used the scare quotes because the best data structure for this purpose is a set. In python, iterating over a set or dict will give you its keys:
mywords = set(w.lower() for w in nltk.corpus.conll2002.words("esp.train"))
for w in mywords:
...

Related

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

Performance - Fastest way to compare 2 large lists of strings in Python

I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.
This is what I have so far:
import re
for sentence in sentences:
for phrase in phrases:
if phrase in sentence.lower():
iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
newsentence = iphrase.sub("**"+phrase+"**", sentence)
newlist.append(newsentence)
So far this approach takes about 60 seconds to complete.
I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.
I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.
So essentially, what would be the fastest possible way to perform this check?
Build your regex once, sorting by longest phrase so you encompass the **s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
'can be really really',
'characters',
'some sentences'
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]
Gives you a results of:
['**some sentences** are longer',
'**some sentences** **can be really really** ... really long, about 1000 **characters**.']
What about set comprehension?
found = {'**' + p + '**' for s in sentences for p in phrases if p in s}
You could try update (by reduction) the phrases list if you don't mind altering it:
found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
for i in range(len(phrases)):
phrase = phrases[i]
if phrase in s:
p.remove(phrase)
found.append('**'+ phrase + '**')
phrases = p[:]
Basically each iteration reduces the phrases container. We iterate through the latest container until we find a phrase that is in at least one sentence.
We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). We do this since we only need to see a phrase at least once, so checking again (although it may exist in another sentence) is unnecessary.

How to un-stem a word in Python?

I want to know if there is anyway that I can un-stem them to a normal form?
The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.
But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other
How should I do it? Is there a way to un-stem a word?
I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.
A possible implementations could be something like this:
import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()
An stemmer to use. Here a text to use:
complete_text = ''' cats catlike catty cat
stemmer stemming stemmed stem
fishing fished fisher fish
argue argued argues arguing argus argu
argument arguments argument '''
Create a list with the different words:
my_list = []
#for i in complete_text.decode().split():
try:
aux = complete_text.decode().split()
except:
aux = complete_text.split()
for i in aux:
if i not in my_list:
my_list.append(i.lower())
my_list
with output:
['cats',
'catlike',
'catty',
'cat',
'stemmer',
'stemming',
'stemmed',
'stem',
'fishing',
'fished',
'fisher',
'fish',
'argue',
'argued',
'argues',
'arguing',
'argus',
'argu',
'argument',
'arguments']
An now create the dictionary:
aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict
Which output is:
{'argu': 'argue, argued, argues, arguing, argus, argu',
'argument': 'argument, arguments',
'cat': 'cats, cat',
'catlik': 'catlike',
'catti': 'catty',
'fish': 'fishing, fished, fish',
'fisher': 'fisher',
'stem': 'stemming, stemmed, stem',
'stemmer': 'stemmer'}
Companion notebook here.
No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.
tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.
You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA
On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.
First, you will stem some documents, here short (French language) strings with their stop words removed for example:
['sup chat march trottoir',
'sup chat aiment ronron',
'chat ronron',
'sup chien aboi',
'deux sup chien',
'combien chien train aboi']
Then the trick is to have kept the count of the most popular original words with counts for each stemmed word:
{'aboi': {'aboie': 1, 'aboyer': 1},
'aiment': {'aiment': 1},
'chat': {'chat': 1, 'chats': 2},
'chien': {'chien': 1, 'chiens': 2},
'combien': {'Combien': 1},
'deux': {'Deux': 1},
'march': {'marche': 1},
'ronron': {'ronronner': 1, 'ronrons': 1},
'sup': {'super': 4},
'train': {'train': 1},
'trottoir': {'trottoir': 1}}
Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py
Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.
I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.
check out the pattern package
pip install pattern
Then use the en.lemma function to return a verb's base form.
import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"
Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.
For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.
But again theoretically I believe it could work. I haven't seen that though in any implementation.

is it possible to create an if statement dynamically using list elements and OR?

I'm trying to adapt the code I wrote below to work with a dynamic list of required values rather than with a string, as it works at present:
required_word = "duck"
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if required_word not in sentence:
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
Say for example I had two required words (only one of which are actually required), I could do this:
required_words = ["dog", "fox"]
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if (required_words[0] not in sentence) or (required_words[1]not in sentence):
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
>>> [['Hello', 'duck']]
However, what I need is for someone to steer me in the direction of a method of dealing with a list that will vary in size (number of items), and satisfy the if statement if any of the list's items are not in the list named 'sentence'. However, being quite new to Python, I'm stumped, and don't know how to better phrase the question. Do I need to come up with a different approach?
Thanks in advance!
(Note that the real code will do something more complicated than printing sentences_not_containing_required_words.)
You can construct this list pretty easily with a combination of a list comprehension and the any() built-in function:
non_matches = [s for s in sentences if not any(w in s for w in required_words)]
This will iterate over the list sentences while constructing a new list, and only include sentences where none of the words from required_words are present.
If you are going to end up with longer lists of sentences, you may consider using a generator expression instead to minimize memory footprint:
non_matches = (s for s in sentences if not any(w in s for w in required_words))
for s in non_matches:
# do stuff

All synonyms for word in python? [duplicate]

This question already has answers here:
How to get synonyms from nltk WordNet Python
(8 answers)
Closed 7 years ago.
The code to get the synonyms of a word in python is say:
from nltk.corpus import wordnet
dog = wordnet.synset('dog.n.01')
print dog.lemma_names
>>['dog', 'domestic_dog', 'Canis_familiaris']
However dog.n.02 gives different words. For any words i can't know how many words there may be. How can i return all of the synonyms for a word?
Using wn.synset('dog.n.1').lemma_names is the correct way to access the synonyms of a sense. It's because a word has many senses and it's more appropriate to list synonyms of a particular meaning/sense. To enumerate words with similar meanings, possibly you can also look at the hyponyms.
Sadly, the size of Wordnet is very limited so there are very few lemma_names available for each senses.
Using Wordnet as a dictionary/thesarus is not very apt per se, because it was developed as an inventory of sense/meaning rather than a inventory of words. However you can use access the a particular sense and several (not a lot) related words to the sense. One can use Wordnet as a:
Dictionary: given a word, what are the different meaning of the word
for i,j in enumerate(wn.synsets('dog')):
print "Meaning",i, "NLTK ID:", j.name
print "Definition:",j.definition
Thesarus: given a word, what are the different words for each meaning of the word
for i,j in enumerate(wn.synsets('dog')):
print "Meaning",i, "NLTK ID:", j.name
print "Definition:",j.definition
print "Synonyms:", ", ".join(j.lemma_names)
print
Ontology: given a word, what are the hyponyms (i.e. sub-types) and hypernyms (i.e. super-types).
for i,j in enumerate(wn.synsets('dog')):
print "Meaning",i, "NLTK ID:", j.name
print "Hypernyms:", ", ".join(list(chain(*[l.lemma_names for l in j.hypernyms()])))
print "Hyponyms:", ", ".join(list(chain(*[l.lemma_names for l in j.hyponyms()])))
print
[Ontology Output]
Meaning 0 NLTK ID: dog.n.01
Hypernyms words domestic_animal, domesticated_animal, canine, canid
Hyponyms puppy, Great_Pyrenees, basenji, Newfoundland, Newfoundland_dog, lapdog, poodle, poodle_dog, Leonberg, toy_dog, toy, spitz, pooch, doggie, doggy, barker, bow-wow, cur, mongrel, mutt, Mexican_hairless, hunting_dog, working_dog, dalmatian, coach_dog, carriage_dog, pug, pug-dog, corgi, Welsh_corgi, griffon, Brussels_griffon, Belgian_griffon
Meaning 1 NLTK ID: frump.n.01
Hypernyms: unpleasant_woman, disagreeable_woman
Hyponyms:
Meaning 2 NLTK ID: dog.n.03
Hypernyms: chap, fellow, feller, fella, lad, gent, blighter, cuss, bloke
Hyponyms:
Meaning 3 NLTK ID: cad.n.01
Hypernyms: villain, scoundrel
Hyponyms: perisher
Note this other answer:
>>> wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
Keep in mind that in your code you were trying to get the lemmas, but that's one level too deep for what you want. The synset level is about meaning, while the lemma level gives you words. In other words:
In WordNet (and I’m speaking of English WordNet here, though I think
the ones in other langauges are similarly organized) a lemma has
senses. Specifically, a lemma (that is, a base word form that is
indexed in WordNet) has exactly as many senses as the number of
synsets that it participates in. Conversely, and as you say, synsets
contain one more more lemmas, which means that multiple lemmas (words)
can represent the same sense, or meaning.
Also have a look at the NLTK's WordNet how to for a few more ways of exploring around a meaning or a word.
The documentation suggests
wordnet.synsets('dog')
to get all synsets for dog.

Categories