Checking superset of list in given order - python

I have a list of tuples in format (float,string) sorted in descending order.
print sent_scores
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.'),
(0.078586381821416265,'Deadly stampede in Shanghai - Police and medical staff help injured people after the stampede.'),
(0.072031446647399661, '- Emergency personnel help victims.')]
In case there two cases in the list which four words same in continuinty. I want to remove the tuple with lesser score from the list. The new list should also preserve order.
The output of above:
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.')]
This will be first certainly involve tokenization of the words, which can be done the code below:
from nltk.tokenize import TreebankWordTokenizer
def tokenize_words(text):
tokens = TreebankWordTokenizer().tokenize(text)
contractions = ["n't", "'ll", "'m","'s"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)
fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1
return tokens
tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]
I earlier tried to convert the words of each sentences in groups of 4 contained a set and then use issuperset for other sentences. But it doesn't check continuity then.

I suggest taking sequences of 4 tokens in a row from your tokenized list, and making a set of those tokens. By using Python's itertools module, this can be done rather elegantly:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)
Output of the above code (nicely formatted for you):
[('The', 'quick', 'brown', 'fox'),
('quick', 'brown', 'fox', 'jumps'),
('brown', 'fox', 'jumps', 'over'),
('fox', 'jumps', 'over', 'the'),
('jumps', 'over', 'the', 'lazy'),
('over', 'the', 'lazy', 'dog')]
Actually, even more elegant would be:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)
Same output as before.
Now that you have your list of four consecutive tokens (as 4-tuples) for each list, you can stick those tokens in a set, and check whether the same 4-tuple appears in two different sets:
my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))
other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))
print set1.intersection(set2) # Empty set
if set1.intersection(set2):
print "Found something in common"
else:
print "Nothing in common"
# Prints "Nothing in common"
third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))
print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
print "Found something in common"
else:
print "Nothing in common"
# Prints "Found something in common"
NOTE: If you're using Python 3, just replace all the print "Something" statements with print("Something"): in Python 3, print became a function rather than a statement. But if you're using NLTK, I suspect you're using Python 2.
IMPORTANT NOTE: Any itertools.islice objects you create will iterate through their original list once, and then become "exhausted" (they've returned all their data, so putting them in a second for loop will produce nothing, and the for loop just won't do anything). If you want to iterate through the same list multiple times, create multiple iterators (as you see I did in my examples).
Update: Here's how to eliminate the lesser-scoring words. First, replace this line:
tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]
with:
tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]
Now what you have is a list of (score,sentence) tuples. Then we'll construct a list called scores_and_sets that will be a list of (score,sets_of_four_words) tuples (where sets_of_four_words is a list of four-word slices like in the example above):
scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]
That one-liner may be a bit too clever, actually, so let's unpack it to be a bit more readable:
scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)
Go ahead and experiment with those two code snippets, and you'll find that they do exactly the same thing.
Okay, so now we have a list of (score, sentence, set_of_four_word_groups) tuples. So we'll go through the list in order, and build up a result list consisting of ONLY the sentences we want to keep. Since the list is already sorted in descending order, that makes things a little easier, because it means that at any point in the list, we only have to look at the items that have already been "accepted" to see if any of them have a duplicate; if any of the accepted items are a duplicate of the one we've just looked at, we don't even need to look at the scores, because we know the accepted item came earlier than the one we're looking at, and therefore it must have a higher score than the one we're looking at.
So here's some code that should do what you want:
accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
score, sentence, set_of_four_words = current_tuple
found = False
for accepted_tuple in accepted_items:
accepted_score, accepted_sentence, accepted_set = accepted_tuple
if set_of_four_words.intersection(accepted_set):
found = True
break
if not found:
accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences

Related

How to make index read entire list?

I'm working on a project for my Python course and I'm still pretty new to coding in general. I'm having issues with one of the snippets of my code. I'm trying to make Python find every instance of the word "the" (or any input word really, it doesn't matter.) and return the word immediately after it. I am able to make it return the word after "the", but it stops after one instance when I need it to scan the entire list.
Here is my code:
the_list=['the']
animal_list=['the', 'cat', 'the', 'dog', 'the', 'axolotl']
for the_list in animal_list:
nextword=animal_list[animal_list.index("the")+1]
continue
print(nextword)
All I'm returning is cat whereas dog and axolotl should pop up as well. I tried using a for loop and a continue in order to make the code go through the same process for dog and axolotl, but it didn't work.
I am not clear what you are asking for, but I think what you want is to get the animals that are in the list animal_list, and assuming that the word 'the' is in the even indeces, you can use this;
animals = [animal for animal in animal_list if animal != 'the']
Since you are a beginner, the previous code uses a comprehension which is a pythonic way to iterate over a loop without a for loop, the equivalent code to the previous one using a for loop is:
animals = []
for animal in animal_list:
if animal != 'the':
animals.append(animal)
index only will get the first instance.
The typical pythonic way is to use a list comprehension:
[animal_list[i+1] for i,val in enumerate(animal_list) if val=='the']
list.index will only find the first occurrence, however you can specify a start and stop value to skip over other indexes.
Now we also need to use a try/except block because list.index will raise a ValueError in the case that it doesn't find a match.
animal_list=['the', 'cat', 'the', 'dog', 'the', 'axolotl']
match = 'the'
i = 0
while True:
try:
i = animal_list.index(match, i) + 1 # start search at index i
except ValueError:
break
# can remove this check if certain that your list won't end with 'the'
# otherwise could raise IndexError
if i < len(animal_list):
print(animal_list[i])
However in case you don't have to use list.index, I would suggest the following instead. (Again can remove the check if list won't end with 'the'.
for i, item in enumerate(animal_list):
if item == match and i + 1 < len(animal_list):
print(animal_list[i + 1])
Or more compact is to use list comprehension. Which will output a list of all items after 'the'.
animals = [ animal_list[i + 1] for i, v in enumerate(animal_list)
if v == match and i + 1 < len(animal_list) ]
print(animals)
Note: The use of continue is not correct. continue is used when you want to end the current iteration of the loop and move on to the next. For example
for i in range(5):
print(i)
if i == 2:
continue
print(i)
# Output
0
0
1
1
2 # Notice only '2' is printed once
3
3
4
4
One approach is to zip the list to a shifted version of itself:
keyword = 'the'
animal_list=['the', 'cat', 'the', 'dog', 'the', 'axolotl']
zipped = zip(animal_list, animal_list[1:])
# zipped contains [('the', 'cat'), ('cat', 'the'), ('the', 'dog') etc.]
found_words = [after for before, after in zipped if before == 'the']
This will deal with a list that ends in 'the' without raising an error (the final 'the' will simply be ignored).
the_word = 'the'
animal_list = ['the', 'cat', 'the', 'dog', 'the', 'axolotl']
# Iterate through animal_list by index, so it is easy to get the next element when we find the_word
for i in range(len(animal_list) - 1):
if animal_list[i] == the_word: # if the current word == the word we want to find
print(animal_list[i+1]) # print the next word
We dont want to check the last element in animal_list. That is why I subtract 1 from the length of animal_list. That way i will have values of 0, 1, 2, 3, 4.
Try this:
the_list=['the']
animal_list=['the', 'cat', 'the', 'dog', 'the', 'axolotl']
i=0
for i in range(len(animal_list)):
if animal_list[i] in the_list:
nextword=animal_list[i+1]
print nextword
This is a very UN-PYTHONIC way of doing this...but perhaps it'll help you understand indexes:
animal_list = ['the', 'cat', 'the', 'dog', 'the', 'axolotl']
index=0
for x in animal_list:
if x == "the":
print(animal_list[(index + 1)])
index +=1

Python :: Intersecting lists of strings gone wrong

I am trying to intersect lists of sentences divided into strings:
user = ['The', 'Macbeth', 'Tragedie'] #this list
plays = []
hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')
plays.append(hamlet)
plays.append(macbeth)
plays.append(caesar)
shakespeare = list(chain.from_iterable(plays)) # with this list
'shakespeare' prints as follows:
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ['Scoena', 'Prima', '.'], ['Enter', 'Barnardo', 'and', 'Francisco', 'two', 'Centinels', '.']...['FINIS', '.'], ['THE', 'TRAGEDIE', 'OF', 'IVLIVS', 'CaeSAR', '.']]
bestCount = 0
for sent in shakespeare:
currentCount = len(set(user).intersection(sent))
if currentCount > bestCount:
bestCount = currentCount
answer = ' '.join(sent)
return ''.join(answer).lower(), bestCount
return, however, does not intersect right, that is, "hamlet" intersects with "macbeth"...
('the tragedie of hamlet , prince of denmarke .', 3)
where is the bug?
Doesn't sound like you should be using sets here. The most obvious problem is that you care about the number of occurrences of a word in a sentence (which starts life as a list), and by converting to a set you collapse all repeated words down to one occurrence, losing that information.
I would suggest rather converting each sentence's members into lowercase, like so:
mapped = map(str.lower, sentence) # may want list(map(...)) if on Py3
Initialize a dict of counts like this:
In [6]: counts = {word.lower(): 0 for word in user}
In [7]: counts
Out[7]: {'macbeth': 0, 'the': 0, 'tragedie': 0}
Then as you loop over the sentences, you can do something like this:
In [8]: for word in counts:
...: counts[word] = max(counts[word], mapped.count(word))
...:
In [9]: counts
Out[9]: {'macbeth': 0, 'the': 1, 'tragedie': 1}
I just used one example sentence, but you get the idea. At the end you'll have the maximum number of times the user's word appeared in a sentence. You can make the data structure a little more complex or use an if statement test if you want to also keep the sentence it occurred the most times in.
Good luck!

How to eliminate duplicate list entries in Python while preserving case-sensitivity?

I'm looking for a way to remove duplicate entries from a Python list but with a twist; The final list has to be case sensitive with a preference of uppercase words.
For example, between cup and Cup I only need to keep Cup and not cup. Unlike other common solutions which suggest using lower() first, I'd prefer to maintain the string's case here and in particular I'd prefer keeping the one with the uppercase letter over the one which is lowercase..
Again, I am trying to turn this list:
[Hello, hello, world, world, poland, Poland]
into this:
[Hello, world, Poland]
How should I do that?
Thanks in advance.
This does not preserve the order of words, but it does produce a list of "unique" words with a preference for capitalized ones.
In [34]: words = ['Hello', 'hello', 'world', 'world', 'poland', 'Poland', ]
In [35]: wordset = set(words)
In [36]: [item for item in wordset if item.istitle() or item.title() not in wordset]
Out[36]: ['world', 'Poland', 'Hello']
If you wish to preserve the order as they appear in words, then you could use a collections.OrderedDict:
In [43]: wordset = collections.OrderedDict()
In [44]: wordset = collections.OrderedDict.fromkeys(words)
In [46]: [item for item in wordset if item.istitle() or item.title() not in wordset]
Out[46]: ['Hello', 'world', 'Poland']
Using set to track seen words:
def uniq(words):
seen = set()
for word in words:
l = word.lower() # Use `word.casefold()` if possible. (3.3+)
if l in seen:
continue
seen.add(l)
yield word
Usage:
>>> list(uniq(['Hello', 'hello', 'world', 'world', 'Poland', 'poland']))
['Hello', 'world', 'Poland']
UPDATE
Previous version does not take care of preference of uppercase over lowercase. In the updated version I used the min as #TheSoundDefense did.
import collections
def uniq(words):
seen = collections.OrderedDict() # Use {} if the order is not important.
for word in words:
l = word.lower() # Use `word.casefold()` if possible (3.3+)
seen[l] = min(word, seen.get(l, word))
return seen.values()
Since an uppercase letter is "smaller" than a lowercase letter in a comparison, I think you can do this:
orig_list = ["Hello", "hello", "world", "world", "Poland", "poland"]
unique_list = []
for word in orig_list:
for i in range(len(unique_list)):
if unique_list[i].lower() == word.lower():
unique_list[i] = min(word, unique_list[i])
break
else:
unique_list.append(word)
The min will have a preference for words with uppercase letters earlier on.
Some better answers here, but hopefully something simple, different and useful. This code satisfies the conditions of your test, sequential pairs of matching words, but would fail on anything more complicated; such as non-sequential pairs, non-pairs or non-strings. Anything more complicated and I'd take a different approach.
p1 = ['Hello', 'hello', 'world', 'world', 'Poland', 'poland']
p2 = ['hello', 'Hello', 'world', 'world', 'Poland', 'Poland']
def pref_upper(p):
q = []
a = 0
b = 1
for x in range(len(p) /2):
if p[a][0].isupper() and p[b][0].isupper():
q.append(p[a])
if p[a][0].isupper() and p[b][0].islower():
q.append(p[a])
if p[a][0].islower() and p[b][0].isupper():
q.append(p[b])
if p[a][0].islower() and p[b][0].islower():
q.append(p[b])
a +=2
b +=2
return q
print pref_upper(p1)
print pref_upper(p2)

Return items around all instances of an item in a list

Say I have a list...
['a','brown','cat','runs','another','cat','jumps','up','the','hill']
...and I want to go through that list and return all instances of a specific item as well as the 2 items leading up to and proceeding that item. Exactly like this if I am searching for 'cat'
[('a','brown','cat','runs','another'),('runs','another','cat','jumps','up')]
the order of the returned list of tuples is irrelevant, ideally the code handle instances where the word was the first or last in a list, and an efficient and compact piece of code would be better of course.
Thanks again everybody, I am just getting my feet wet in Python and everybody here has been a huge help!
Without error checking:
words = ['a','brown','cat','runs','another','cat','jumps','up','the','hill']
the_word = 'cat'
seqs = []
for i, word in enumerate(words):
if word == the_word:
seqs.append(tuple(words[i-2:i+3]))
print seqs #Prints: [('a', 'brown', 'cat', 'runs', 'another'), ('runs', 'another', 'cat', 'jumps', 'up')]
A recursive solution:
def context(ls, s):
if not s in ls: return []
i = ls.index('cat')
return [ tuple(ls[i-2:i+3]) ] + context(ls[i + 1:], s)
ls = ['a','brown','cat','runs','another','cat','jumps','up','the','hill']
print context(ls, 'cat')
Gives:
[('a','brown','cat','runs','another'),('runs','another','cat','jumps','up')]
With error checking:
def grep(in_list, word):
out_list = []
for i, val in enumerate(in_list):
if val == word:
lower = i-2 if i-2 > 0 else 0
upper = i+3 if i+3 < len(in_list) else len(in_list)
out_list.append(tuple(in_list[lower:upper]))
return out_list
in_list = ['a', 'brown', 'cat', 'runs', 'another', 'cat', 'jumps', 'up', 'the', 'hill']
grep(in_list, "cat")
# output: [('a', 'brown', 'cat', 'runs', 'another'), ('runs', 'another', 'cat', 'jumps', 'up')]
grep(in_list, "the")
# output: [('jumps', 'up', 'the', 'hill')]

Extract words surrounding a search word

I have this script that does a word search in text. The search goes pretty good and results work as expected. What I'm trying to achieve is extract n words close to the match. For example:
The world is a small place, we should try to take care of it.
Suppose I'm looking for place and I need to extract the 3 words on the right and the 3 words on the left. In this case they would be:
left -> [is, a, small]
right -> [we, should, try]
What is the best approach to do this?
Thanks!
def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]
This allows you to specify how many words either side you want to capture. It works by constructing the regular expression dynamically. With
t = "The world is a small place, we should try to take care of it."
search(t,3)
(('is', 'a', 'small'), ('we', 'should', 'try'))
While regex would work, I think it's overkill for this problem. You're better off with two list comprehensions:
sentence = 'The world is a small place, we should try to take care of it.'.split()
indices = (i for i,word in enumerate(sentence) if word=="place")
neighbors = []
for ind in indices:
neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
Note that if the word that you're looking for appears multiple times consecutively in the sentence, then this algorithm will include the consecutive occurrences as neighbors.
For example:
In [29]: neighbors = []
In [30]: sentence = 'The world is a small place place place, we should try to take care of it.'.split()
In [31]: sentence
Out[31]:
['The',
'world',
'is',
'a',
'small',
'place',
'place',
'place,',
'we',
'should',
'try',
'to',
'take',
'care',
'of',
'it.']
In [32]: indices = [i for i,word in enumerate(sentence) if word == 'place']
In [33]: for ind in indices:
....: neighbors.append(sentence[ind-3:ind]+sentence[ind+1:ind+4])
In [34]: neighbors
Out[34]:
[['is', 'a', 'small', 'place', 'place,', 'we'],
['a', 'small', 'place', 'place,', 'we', 'should']]
import re
s='The world is a small place, we should try to take care of it.'
m = re.search(r'((?:\w+\W+){,3})(place)\W+((?:\w+\W+){,3})', s)
if m:
l = [ x.strip().split() for x in m.groups()]
left, right = l[0], l[2]
print left, right
Output
['is', 'a', 'small'] ['we', 'should', 'try']
If you search for The, it yields:
[] ['world', 'is', 'a']
Handling the scenario where the search keyword appears multiple times. For example below is the input text where search keyword : place appears 3 times
The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible
Here is the function
import re
def extract_surround_words(text, keyword, n):
'''
text : input text
keyword : the search keyword we are looking
n : number of words around the keyword
'''
#extracting all the words from text
words = words = re.findall(r'\w+', text)
#iterate through all the words
for index, word in enumerate(words):
#check if search keyword matches
if word == keyword:
#fetch left side words
left_side_words = words[index-n : index]
#fetch right side words
right_side_words = words[index+1 : index + n + 1]
print(left_side_words, right_side_words)
Calling the function
text = 'The world is a small place, we should try to take care of this small place by planting trees in every place wherever is possible'
keyword = "place"
n = 3
extract_surround_words(text, keyword, n)
output :
['is', 'a', 'small'] ['we', 'should', 'try']
['we', 'should', 'try'] ['to', 'microsot', 'is']
['also', 'take', 'care'] ['googe', 'is', 'one']
Find all of the words:
import re
sentence = 'The world is a small place, we should try to take care of it.'
words = re.findall(r'\w+', sentence)
Get the index of the word that you're looking for:
index = words.index('place')
And then use slicing to find the other ones:
left = words[index - 3:index]
right = words[index + 1:index + 4]

Categories