Remove strings containing words from list, without duplicate strings - python

I'm trying to get my code to extract sentences from a file that contain certain words. I have the code seen here below:
import re
f = open('RedCircle.txt', 'r')
text = ' '.join(f.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
def finding(q):
for item in sentences:
if item.lower().find(q.lower()) != -1:
list.append(item)
for sentence in list:
outfile.write(sentence+'\r\n')
finding('cats')
finding('apples')
finding('doggs')
But this will of course give me (in the outfile) three times the same sentence if the sentences is:
'I saw doggs and cats eating apples'
Is there a way to easily remove these duplicates, or make the code so that there will not be any duplicates in the file?

There are few options in Python that you can leverage to remove duplicate elements (In this case I believe its sentence).
Using Set.
Using itertools.groupby
OrderedDict as an OrderedSet, if Order is important
All you need to do, is to collect the result in a single list and use the links provided in this answer, to create your own recipe to remove duplicates.
Also instead of dumping the result after each search to the file, defer it until all duplicates has been removed.
Few Suggestive Changes
Using Sets
Convert Your function to a Generator
def finding(q):
return (item for item in sentences
if item.lower().find(q.lower()) != -1)
Chain the result of each search
from itertools import chain
chain.from_iterable(finding(key) for key in ['cats', 'apples'. 'doggs'])
Pass the result to a Set
set(chain.from_iterable(finding(key) for key in ['cats', 'apples'. 'doggs']))
Using Decorators
def uniq(fn):
uniq_elems = set()
def handler(*args, **kwargs):
uniq_elems.update(fn(*args, **kwargs))
return uniq_elems
return handler
#uniq
def finding(q):
return (item for item in sentences
if item.lower().find(q.lower()) != -1)
If Order is Important
Change the Decorator to use OrderedDict
def uniq(fn):
uniq_elems = OrderedDict()
def handler(*args, **kwargs):
uniq_elems.update(uniq_elems.fromkeys(fn(*args, **kwargs)))
return uniq_elems.keys()
return handler
Note
Refrain from naming variables that conflicts with reserve words in Python (like naming the variable as list)

Firstly, does the order matter?
Second, should duplicates appear if they're actually duplicated in the original text file?
If no to the first and yes to the second:
If you rewrite the function to take a list of search strings and iterate over that (such that it checks the current sentence for each of the words you're after), then you could break out of the loop once you find it.
If yes to the first and yes to the second,
Before adding an item to the list, check whether it's already there. Specifically, keep a note of which list items you've passed in the original text file and which is going to be the next one you'll see. That way you don't have to check the whole list, but only a single item.
A set as Abhijit suggests would work if you answer no to the first question and yes to the second.

Related

How to understand the flaw in my simple three part python code?

My Python exercise in 'classes' is as follows:
You have been recruited by your friend, a linguistics enthusiast, to create a utility tool that can perform analysis on a given piece of text. Complete the class "analyzedText" with the following methods:
Constructor (_init_) - This method should take the argument text, make is lowercase and remove all punctuation. Assume only the following punctuation is used: period (.), exclamation mark (!), comma (,), and question mark (?). Assign this newly formatted text to a new attribute called fmtText.
freqAll - This method should create and return dictionary of all unique words in the text along with the number of times they occur in the text. Each key in the dictionary should be the unique word appearing in the text and the associated value should be the number of times it occurs in the text. Create this dictionary from the fmtText attribute.
This was my code:
class analysedText(object)
def __init__ (self, text):
formattedText = text.replace('.',' ').replace(',',' ').replace('!',' ').replace('?',' ')
formattedText = formattedText.lower()
self.fmtText = formattedText
def freqAll(self):
wordList = self.fmtText.split(' ')
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList(word)
return wordDict
I get errors on both of these and I can't seem to figure it out after a lot of little adjustments. I suspect the issue in the first part is when I try to assign a value to the newly formatted text but I cannot think of a workable solution. As for the second part, I am at a complete loss - I was wrongfully confident my answer was correct but I received a fail error when I ran it through the classroom's code cell to test it.
On the assumption that by 'errors' you mean a TypeError, this is caused because of line 13, wordDict[word] = wordList(word).
wordList is a list, and by using the ()/brackets you're telling Python that you want to call that list as a function. Which it cannot do.
According to your task, you are to instead find the occurrences of words in the list, which you could achieve with the .count() method. This method basically returns the total number of occurrences of an element in a list. (Feel free to read more about it here)
With this modification, (this is assuming you want wordDict to contain a dictionary with the word as the key, and the occurrence as the value) your freqAll function would look something like this:
def freqAll(self):
wordList = self.fmtText.split()
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList.count(word) # wordList.count(word) returns the number of times the string word appears as an element in wordList
return wordDict
Although you could also achieve this same task with a class known as collections.Counter, (of course this means you have to import collections) which you can read more about here

Find the word from the list given and replace the words so found

My question is pretty simple, but I haven't been able to find a proper solution.
Given below is my program:
given_list = ["Terms","I","want","to","remove","from","input_string"]
input_string = input("Enter String:")
if any(x in input_string for x in given_list):
#Find the detected word
#Not in bool format
a = input_string.replace(detected_word,"")
print("Some Task",a)
Here, given_list contains the terms I want to exclude from the input_string.
Now, the problem I am facing is that the any() produces a bool result and I need the word detected by the any() and replace it with a blank, so as to perform some task.
Edit: any() function is not required at all, look for useful solutions below.
Iterate over given_list and replace them:
for i in given_list:
input_string = input_string.replace(i, "")
print("Some Task", input_string)
No need to detect at all:
for w in given_list:
input_string = input_string.replace(w, "")
str.replace will not do anything if the word is not there and the substring test needed for the detection has to scan the string anyway.
The problem with finding each word and replacing it is that python will have to iterate over the whole string, repeatedly. Another problem is you will find substrings where you don't want to. For example, "to" is in the exclude list, so you'd end up changing "tomato" to "ma"
It seems to me like you seem to want to replace whole words. Parsing is a whole new subject, but let's simplify. I'm just going to assume everything is lowercase with no punctuation, although that can be improved later. Let's use input_string.split() to iterate over whole words.
We want to replace some words with nothing, so let's just iterate over the input_string, and filter out the words we don't want, using the builtin function of the same name.
exclude_list = ["terms","i","want","to","remove","from","input_string"]
input_string = "one terms two i three want to remove"
keepers = filter(lambda w: w not in exclude_list, input_string.lower().split())
output_string = ' '.join(keepers)
print (output_string)
one two three
Note that we create an iterator that allows us to go through the whole input string just once. And instead of replacing words, we just basically skip the ones we don't want by having the iterator not return them.
Since filter requires a function for the boolean check on whether to include or exclude each word, we had to define one. I used "lambda" syntax to do that. You could just replace it with
def keep(word):
return word not in exclude_list
keepers = filter(keep, input_string.split())
To answer your question about any, use an assignment expression (Python 3.8+).
if any((word := x) in input_string for x in given_list):
# match captured in variable word

Searching for duplicates and remove them

sometimes I have a string like this
string = "Hett, Agva,"
and sometimes I will have duplicates in it.
string = "Hett, Agva, Delf, Agva, Hett,"
how can I check if my string has duplicates and then if it does remove them?
UPDATE.
So in the second string i need to remove Agva, and Hett, because there is 2x of them in the string
Iterate over the parts (words) and add each part to a set of seen parts and to a list of parts if it is not already in that set. Finally. reconstruct the string:
seen = set()
parts = []
for part in string.split(','):
if part.strip() not in seen:
seen.add(part.strip())
parts.append(part)
no_dups = ','.join(parts)
(note that I had to add some calls to .strip() as there are spaces at the start of some of the words which this method removes)
which gives:
'Hett, Agva, Delf,'
Why use a set?
To query whether an element is in a set, it is O(1) average case - since they are stored by a hash which makes lookup constant time. On the other hand, lookup in a list is O(n) as Python must iterate over the list until the element is found. This means that it is much more efficient for this task to use a set since, for each new word, you can instantly check to see if you have seen in before whereas you'd have to iterate over a list of seen elements otherwise which would take much longer for a large list.
Oh and to just check if there are duplicates, query whether the length of the split list is the same as the set of that list (which removes the duplicates but looses the order).
I.e.
def has_dups(string):
parts = string.split(',')
return len(parts) != len(set(parts))
which works as expected:
>>> has_dups('Hett, Agva,')
False
>>> has_dups('Hett, Agva, Delf, Agva, Hett,')
True
You can use toolz.unique, or equivalently the unique_everseen recipe in the itertools docs, or equivalently #JoeIddon's explicit solution.
Here's the solution using 3rd party toolz:
x = "Hett, Agva, Delf, Agva, Hett,"
from toolz import unique
res = ', '.join(filter(None, unique(x.replace(' ', '').split(','))))
print(res)
'Hett, Agva, Delf'
I've removed whitespace and used filter to clean up a trailing , which may not be required.
if you will receive a string in only this format then you can do the following:
import numpy as np
string_words=string.split(',')
uniq_words=np.unique(string_words)
string=""
for word in uniq_words:
string+=word+", "
string=string[:-1]
what this code does is that it splits words into a list, finds unique items, and then merges them into a string like before
If order of words id important then you can make a list of words in the string and then iterate over the list to make a new list of unique words.
string = "Hett, Agva, Delf, Agva, Hett,"
words_list = string.split()
unique_words = []
[unique_words.append(w) for w in words_list if w not in unique_words]
new_string = ' '.join(unique_words)
print (new_String)
Output:
'Hett, Agva, Delf,'
Quick and easy approach:
', '.join(
set(
filter( None, [ i.strip() for i in string.split(',') ] )
)
)
Hope it helps. Please feel free to ask if anything is not clear :)

How to make python check EACH value

I am working on this function and I want to Return a list of the elements of L that end with the specified token in the order they appear in the original list.
def has_last_token(s,word):
""" (list of str, str) -> list of str
Return a list of the elements of L that end with the specified token in the order they appear in the original list.
>>> has_last_token(['one,fat,black,cat', 'one,tiny,red,fish', 'two,thin,blue,fish'], 'fish')
['one,tiny,red,fish', 'two,thin,blue,fish']
"""
for ch in s:
ch = ch.replace(',' , ' ')
if word in ch:
return ch
So I know that when I run the code and test out the example I provided, it checks through
'one,fat,black,cat'
and sees that the word is not in it and then continues to check the next value which is
'one,tiny,red,fish'
Here it recognizes the word fish and outputs it. But the code doesn't check for the last input which is also valid. How can I make it check all values rather then just check until it sees one valid output?
expected output
>>> has_last_token(['one,fat,black,cat', 'one,tiny,red,fish', 'two,thin,blue,fish'], 'fish')
>>> ['one,tiny,red,fish', 'two,thin,blue,fish']
I'll try to answer your question altering your code and your logic the least I can, in case you understand the answer better this way.
If you return ch, you'll immediately terminate the function.
One way to accomplish what you want is to simply declare a list before your loop and then append the items you want to that list accordingly. The return value would be that list, like this:
def has_last_token(s, word):
result = []
for ch in s:
if ch.endswith(word): # this will check only the string's tail
result.append(ch)
return result
PS: That ch.replace() is unnecessary according to the function's docstring
You are returning the first match and this exits the function. You want to either yield from the loop (creating a generator) or build a list and return that. I would just use endswith in a list comprehension. I'd also rename things to make it clear what's what.
def has_last_token(words_list, token):
return [words for words in words_list if words.endswith(token)]
Another way is to use rsplit to split the last token from the rest of the string. If you pass the second argument as 1 (could use named argument maxsplit in py3 but py2 doesn't like it) it stops after one split, which is all we need here.
You can then use filter rather than an explicit loop to check each string has word as its final token and return a list of only those strings which do have word as their final token.
def has_last_token(L, word):
return filter(lambda s: s.rsplit(',', 1)[-1] == word, L)
result = has_last_token(['one,fat,black,cat',
'one,tiny,red,fish',
'two,thin,blue,fish',
'two,thin,bluefish',
'nocommas'], 'fish')
for res in result:
print(res)
Output:
one,tiny,red,fish
two,thin,blue,fish

Python- Remove all words that contain other words in a list

I have a list populated with words from a dictionary. I want to find a way to remove all words, only considering root words that form at the beginning of the target word.
For example, the word "rodeo" would be removed from the list because it contains the English-valid word "rode." "Typewriter" would be removed because it contains the English-valid word "type." However, the word "snicker" is still valid even if it contains the word "nick" because "nick" is in the middle and not at the beginning of the word.
I was thinking something like this:
for line in wordlist:
if line.find(...) --
but I want that "if" statement to then run through every single word in the list checking to see if its found and, if so, remove itself from the list so that only root words remain. Do I have to create a copy of wordlist to traverse?
So you have two lists: the list of words you want to check and possibly remove, and a list of valid words. If you like, you can use the same list for both purposes, but I'll assume you have two lists.
For speed, you should turn your list of valid words into a set. Then you can very quickly check to see if any particular word is in that set. Then, take each word, and check whether all its prefixes exist in the valid words list or not. Since "a" and "I" are valid words in English, will you remove all valid words starting with 'a', or will you have a rule that sets a minimum length for the prefix?
I am using the file /usr/share/dict/words from my Ubuntu install. This file has all sorts of odd things in it; for example, it seems to contain every letter by itself as a word. Thus "k" is in there, "q", "z", etc. None of these are words as far as I know, but they are probably in there for some technical reason. Anyway, I decided to simply exclude anything shorter than three letters from my valid words list.
Here is what I came up with:
# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)
lst = ["ark", "booze", "kite", "live", "rodeo"]
def subwords(word):
for i in range(len(word) - 1, 0, -1):
w = word[:i]
yield w
newlst = []
for word in lst:
# uncomment these for debugging to make sure it works
# print "subwords", [w for w in subwords(word)]
# print "valid subwords", [w for w in subwords(word) if w in valid]
if not any(w in valid for w in subwords(word)):
newlst.append(word)
print(newlst)
If you are a fan of one-liners, you could do away with the for list and use a list comprehension:
newlst = [word for word in lst if not any(w in valid for w in subwords(word))]
I think that's more terse than it should be, and I like being able to put in the print statements to debug.
Hmm, come to think of it, it's not too terse if you just add another function:
def keep(word):
return not any(w in valid for w in subwords(word))
newlst = [word for word in lst if keep(word)]
Python can be easy to read and understand if you make functions like this, and give them good names.
I'm assuming that you only have one list from which you want to remove any elements that have prefixes in that same list.
#Important assumption here... wordlist is sorted
base=wordlist[0] #consider the first word in the list
for word in wordlist: #loop through the entire list checking if
if not word.startswith(base): # the word we're considering starts with the base
print base #If not... we have a new base, print the current
base=word # one and move to this new one
#else word starts with base
#don't output word, and go on to the next item in the list
print base #finish by printing the last base
EDIT: Added some comments to make the logic more obvious
I find jkerian's asnwer to be the best (assuming only one list) and I would like to explain why.
Here is my version of the code (as a function):
wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];
def root_words(wordlist):
result = []
base = wordlist[0]
for word in wordlist:
if not word.startswith(base):
result.append(base)
base=word
result.append(base)
return result;
print root_words(wordlist);
As long as the word list is sorted (you could do this in the function if you wanted to), this will get the result in a single parse. This is because when you sort the list, all words made up of another word in the list, will be directly after that root word. e.g. anything that falls between "arc" and "arcane" in your particular list, will also be eliminated because of the root word "arc".
You should use the built-in lambda function for this. I think it'll make your life a lot easier
words = ['rode', 'nick'] # this is the list of all the words that you have.
# I'm using 'rode' and 'nick' as they're in your example
listOfWordsToTry = ['rodeo', 'snicker']
def validate(w):
for word in words:
if w.startswith(word):
return False
return True
wordsThatDontStartWithValidEnglishWords = \
filter(lambda x : validate(x), listOfWordsToTry)
This should work for your purposes, unless I misunderstand your question.
Hope this helps
I wrote an answer that assumes two lists, the list to be pruned and the list of valid words. In the discussion around my answer, I commented that maybe a trie solution would be good.
What the heck, I went ahead and wrote it.
You can read about a trie here:
http://en.wikipedia.org/wiki/Trie
For my Python solution, I basically used dictionaries. A key is a sequence of symbols, and each symbol goes into a dict, with another Trie instance as the data. A second dictionary stores "terminal" symbols, which mark the end of a "word" in the Trie. For this example, the "words" are actually words, but in principle the words could be any sequence of hashable Python objects.
The Wikipedia example shows a trie where the keys are letters, but can be more than a single letter; they can be a sequence of multiple letters. For simplicity, my code uses only a single symbol at a time as a key.
If you add both the word "cat" and the word "catch" to the trie, then there will be nodes for 'c', 'a', and 't' (and also the second 'c' in "catch"). At the node level for 'a', the dictionary of "terminals" will have 't' in it (thus completing the coding for "cat"), and likewise at the deeper node level of the second 'c' the dictionary of terminals will have 'h' in it (completing "catch"). So, adding "catch" after "cat" just means one additional node and one more entry in the terminals dictionary. The trie structure makes a very efficient way to store and index a really large list of words.
def _pad(n):
return " " * n
class Trie(object):
def __init__(self):
self.t = {} # dict mapping symbols to sub-tries
self.w = {} # dict listing terminal symbols at this level
def add(self, word):
if 0 == len(word):
return
cur = self
for ch in word[:-1]: # add all symbols but terminal
if ch not in cur.t:
cur.t[ch] = Trie()
cur = cur.t[ch]
ch = word[-1]
cur.w[ch] = True # add terminal
def prefix_match(self, word):
if 0 == len(word):
return False
cur = self
for ch in word[:-1]: # check all symbols but last one
# If you check the last one, you are not checking a prefix,
# you are checking whether the whole word is in the trie.
if ch in cur.w:
return True
if ch not in cur.t:
return False
cur = cur.t[ch] # walk down the trie to next level
return False
def debug_str(self, nest, s=None):
"print trie in a convenient nested format"
lst = []
s_term = "".join(ch for ch in self.w)
if 0 == nest:
lst.append(object.__str__(self))
lst.append("--top--: " + s_term)
else:
tup = (_pad(nest), s, s_term)
lst.append("%s%s: %s" % tup)
for ch, d in self.t.items():
lst.append(d.debug_str(nest+1, ch))
return "\n".join(lst)
def __str__(self):
return self.debug_str(0)
t = Trie()
# Build valid list from /usr/dict/share/words, which has every letter of
# the alphabet as words! Only take 2-letter words and longer.
wfile = "/usr/share/dict/words"
for line in open(wfile):
word = line.strip()
if len(word) >= 2:
t.add(word)
# add valid 1-letter English words
t.add("a")
t.add("I")
lst = ["ark", "booze", "kite", "live", "rodeo"]
# "ark" starts with "a"
# "booze" starts with "boo"
# "kite" starts with "kit"
# "live" is good: "l", "li", "liv" are not words
# "rodeo" starts with "rode"
newlst = [w for w in lst if not t.prefix_match(w)]
print(newlst) # prints: ['live']
I don't want to provide an exact solution, but I think there are two key functions in Python that will help you greatly here.
The first, jkerian mentioned: string.startswith() http://docs.python.org/library/stdtypes.html#str.startswith
The second: filter() http://docs.python.org/library/functions.html#filter
With filter, you could write a conditional function that will check to see if a word is the base of another word and return true if so.
For each word in the list, you would need to iterate over all of the other words and evaluate the conditional using filter, which could return the proper subset of root words.
I only had one list - and I wanted to remove any word from it that was a prefix of another.
Here is a solution that should run in O(n log N) time and O(M) space, where M is the size of the returned list. The runtime is dominated by the sorting.
l = sorted(your_list)
removed_prefixes = [l[g] for g in range(0, len(l)-1) if not l[g+1].startswith(l[g])] + l[-1:]
If the list is sorted then the item at index N is a prefix if it begins the item at index N+1.
At the end it appends the last item of the original sorted list, since by definition it is not a prefix.
Handling it last also allows us to iterate over an arbitrary number of indexes w/o going out of range.
If you have the banned list hardcoded in another list:
banned = tuple(banned_prefixes]
removed_prefixes = [ i for i in your_list if not i.startswith(banned)]
This relies on the fact that startswith accepts a tuple. It probably runs in something close to N * M where N is elements in list and M is elements in banned. Python could conceivably be doing some smart things to make it a bit quicker. If you are like OP and want to disregard case, you will need .lower() calls in places.

Categories