parsing emails to identify keywords

parsing emails to identify keywords - python

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]

Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]

This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!

Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())

If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

Related

python to search separated words

I am trying to extract separated multi words from a python list with two different list as a query string. My sentences list is
lst = ['we have the terrible HIV epidemic that takes down the life expectancy of the African ','and I take the regions down here','The poorest are down']
lst_verb = ['take','go','wake']
lst_prep = ['down','up','in']
import re
output=[]
item = 'down'
p = re.compile(r'(?:\w+\s+){1,20}'+item)
for i in lst:
output.append(p.findall(i))
for item in output:
print(item)
with this i am able to extract word from the list, However I am only want to extract separated multiwords, i.e it should extract the word from the list "and I take the regions down here".
furthermore, I want to use the word from lst_verb and lst_prep as query string.
for example
re.findall(r \lst_verb+'*.\b'+ \lst_prep)
Thank you for your answer.

You can use regex like
(?is)^(?=.*\b(take)\b)(?=.*?\b(go)\b)(?=.*\b(where)\b)(?=.*\b(wake)\b).*
To match Multiple words
like this your example
use functions to create regex string from the verbs and prep.
hope this helps

Performance - Fastest way to compare 2 large lists of strings in Python

I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.
This is what I have so far:
import re
for sentence in sentences:
for phrase in phrases:
if phrase in sentence.lower():
iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
newsentence = iphrase.sub("**"+phrase+"**", sentence)
newlist.append(newsentence)
So far this approach takes about 60 seconds to complete.
I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.
I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.
So essentially, what would be the fastest possible way to perform this check?

Build your regex once, sorting by longest phrase so you encompass the **s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
'can be really really',
'characters',
'some sentences'
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]
Gives you a results of:
['**some sentences** are longer',
'**some sentences** **can be really really** ... really long, about 1000 **characters**.']

What about set comprehension?
found = {'**' + p + '**' for s in sentences for p in phrases if p in s}
You could try update (by reduction) the phrases list if you don't mind altering it:
found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
for i in range(len(phrases)):
phrase = phrases[i]
if phrase in s:
p.remove(phrase)
found.append('**'+ phrase + '**')
phrases = p[:]
Basically each iteration reduces the phrases container. We iterate through the latest container until we find a phrase that is in at least one sentence.
We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). We do this since we only need to see a phrase at least once, so checking again (although it may exist in another sentence) is unnecessary.

In Python, how to check if words in a string are keys in a dictionary?

For a class I am talking the twitter sentiment analysis problem. I have looked at the other questions on the site and they don't help for my particular issue.
I am given a string that is one tweet with its letters changed so that they are all in lowercase. For example,
'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
as well as a dictionary of words where the key is the word and the value is the value for the sentiment for that word. To be more specific, a key can be a single word (such as 'hello'), more than one word separated by a space (such as 'yellow hornet'), or a hyphenated compound word (such as '2-dimensional'), or a number (such as '365').
I need to find the sentiment of the tweet by adding the sentiments for every eligible word and dividing by the number of eligible words (by eligible word, I mean word that is in the dictionary). I'm not sure what's the best way to go about checking if a tweet has a word in the dictionary.
I tried using the "key in string" convention with looping through all the keys, but this was problematic because there are a lot of keys and word-in-words would be counted (e.g. eradicate counts cat, ate, era, etc. as well)
I then tried using .split(' ') and looping through the elements of the resultant list but I ran into problems because of punctuation and keys which are two words.
Anyone have any ideas on how I can more suitably tackle this?
For example: using the example above, still : -0.625, love : 0.625, every other word is not in the dictionary. so this should return (-0.625 + 0.625)/2 = 0.

The whole point of dictionaries is that they are quick at looking things up:
for word in instring.split():
if wordsdict.has_key(word):
print word
You would probably do better at getting rid of punctuation, etc, (thank-you Soke), by using regular expressions rather than split, e.g.
for word in re.findall(r'[\w]', instring):
if wordsdict.get(word) is not None:
print word
Of course you will have to have some maximum length of word groupings, possibly generated with a single run through of the dictionary and then take your pairs, triples, etc. and also check them.

you can use nltk its very powerfull what you want to do, it can be done by split too:
>>> import string
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> import nltk
>>> my_dict = {'still' : -0.625, 'love' : 0.625}
>>> words = nltk.word_tokenize(a)
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(', '#', 'tel', 'aviv', 'kosher', 'pizza', ')', 'http', ':', '//t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
using split:
>>> words = a.split()
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(#', 'tel', 'aviv', 'kosher', 'pizza)', 'http://t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
my_dict.get(key,default), so get will return value if key is found in dictionary else it will return default. In this case '0'
check this example: you asked for place
>>> import string
>>> my_dict = {'still' : -0.625, 'love' : 0.625,'place':1}
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> words = nltk.word_tokenize(a)
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.5

going by length of the dictionary key might be one solution.
For example, you have the dict as:
Sentimentdict = {"habit":5, "bad habit":-1}
the sentence might be:
s1="He has good habit"
s2="He has bad habit"
s1 should be getting good sentiment compare to s2. Now, you can do this:
for w in sorted(Sentimentdict.keys(), key=lambda x: len(x)):
if w in s1:
remove the word and do your sentiment calculation

How to find collocations in text, python

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list

Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])

import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)

A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?

Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !

Project Gutenberg Python problem?

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:
Enter a sentence as input -this is called trigger string-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents() function but regex - case insensitive search is practically impossible since the gutenberg.sents() outputs the sentences in books as following -in a list of list format-:
EXAMPLE: all the sentences of shakespeare's macbeth is called by typing
import nltk
from nltk.corpus import gutenberg
gutenberg.sents('shakespeare-macbeth.txt')
into the python shell command line and output is:
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
['Actus', 'Primus', '.'], .......]
with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.
How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.

Given a list L of words, and a target word t,
any(t.lower()==w.lower() for w in L)
tells you whether L has word t in a case-insensitive way. It's faster, of course, to do
lt = t.lower()
any(lt==w.lower() for w in L)
since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.
Given a list of lists lol, the longest sub-list including t can be found by
longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)
If multiple sub-lists include t and are of the same maximal length, this will give you the first one, as it happens.

How about using the built-in function: str.lower()¶
Return a copy of the string converted to lowercase.
Then just compare the strings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.