How can i find a phrase duplicates in list?

How can i find a phrase duplicates in list? - python

there is a list like that:
my_list = ['beautiful moments','moments beautiful']
don`t look at grammar, the main idea is that those two strings are about same thing.
The question is how to detect that those phrases are duplicate WITHOUT splitting and sorting each phrase?

You can take advantage of frozensets here because they are hashable(They can be added to the set - Time complexity of membership testing for sets is O(1)) and have equality comparison of sets(Two sets are equal if they have the same items in any order).
Basically we iterate through the items of the list, split them and make frozenset out of them. There is a unique set that we check to see if our item is present there or not.
my_list = ["beautiful moments", "moments beautiful", "hi bye", "hi hi", "bye hi"]
unique = set()
result = []
for i in my_list:
f = frozenset(i.split())
if f not in unique:
unique.add(f)
result.append(i)
print(result)
ourput:
['beautiful moments', 'hi bye', 'hi hi']

Related

scrape data and sort it using Python 2.7 and selenium

i'm trying to scrape data in a website using selenium and python 2.7. Here is the code from the data that i want to scrape
<textarea>let, either, and, have, rather, because, your, with, other, that, neither, since, however, its, will, some, own, than, should, wants, they, got, may, what, least, else, cannot, like, whom, which, who, why, his, these, been, had, the, all, likely, their, must, our</textarea>
i need to insert all that words to list and sort it. for now this is my progres
wordlist = []
data = browser.find_element_by_tag_name("textarea")
words = data.get_attribute()
wordlist.append(words)
print words
print wordlist.sort()
any help or clue would be useful for me

Note that wordlist.sort() doesn't return list, but just sorts existed list, so you might need to do
wordlist.sort()
print wordlist
or try below code to get required output
data = driver.find_element_by_tag_name("textarea")
words = data.get_attribute('value')
sorted_list = sorted(words.split(', '))
print sorted_list
# ['all,', 'and,', 'because,', 'been,', 'cannot,', 'either,', 'else,', 'got,', 'had,', 'have,', 'his,', 'however,', 'its,', 'least,', 'let,', 'like,', 'likely,', 'may,', 'must,', 'neither,', 'other,', 'our', 'own,', 'rather,', 'should,', 'since,', 'some,', 'than,', 'that,', 'the,', 'their,', 'these,', 'they,', 'wants,', 'what,', 'which,', 'who,', 'whom,', 'why,', 'will,', 'with,', 'your,']

I was able to recreate your issue using the following code:
words = ["hello", "world", "abc", "def"]
wordlist = []
wordlist.append(words)
print(words)
print(wordlist.sort())
This outputs:
['hello', 'world', 'abc', 'def']
None
Which I believe is the issue you are having.
To fix it I did two things:
1) wordlist.append(words) for wordlist = words.copy() - this copies the array rather than appending the array to an array element and 2) move the wordlist.sort() out of the print function - sort returns nothing and is an in place sort so returns nothing.
So, the complete updated example is:
words = ["hello", "world", "abc", "def"]
wordlist = []
wordlist = words.copy()
wordlist.sort()
print(words)
print(wordlist)
Which now outputs the sorted list (as you required):
['hello', 'world', 'abc', 'def']
['abc', 'def', 'hello', 'world']

Find words in text, very slow solution

I have to make a function that given a text of concatenated words without spaces and a list that both contains words that appear and do not appear in said text.
I have to create a tuple that contains a new list that only includes the words that are in the text in order of appearance and the word that appears the most in the text. If there are two words that appear the most number of times, the function will chose one on an alphabetical order (if the words appear like "b"=3,"c"=3,"a"=1, then it will chose "b")
Also I have to modify the original list so that it only includes the words that are not in the text without modifying its order.
For example if I have a
list=["tree","page","law","gel","sand"]
text="pagelawtreetreepagepagetree"`
then the tuple will be
(["page","tree","law"], "page")
and list will become
list=["gel","sand"]
Now I have done this function but it's incredibly slow, can someone help?
ls=list
def es3(ls,text):
d=[]
v={}
while text:
for item in ls:
if item in text[:len(item)]:
text=text[len(item):]
if item not in d:
d+=[item]
v[item]=1
else:
v[item]+=1
if text=="":
p=sorted(v.items())
f=max(p, key=lambda k: k[1])
M=(d,f[0])
for b in d:
if b in lista:
ls.remove(b)
return (M)

In python strings are immutable - if you modify them you create new objects. Object creation is time/memory inefficient - almost all of the times it is better to use lists instead.
By creating a list of all possible k-lenght parts of text - k being the (unique) lenghts of the words you look for ( 3 and 4 in your list) you create all splits that you could count and filter out that are not in your word-set:
# all 3+4 length substrings as list - needs 48 lookups to clean it up to whats important
['pag', 'page', 'age', 'agel', 'gel', 'gela', 'ela', 'elaw', 'law', 'lawt', 'awt',
'awtr', 'wtr', 'wtre', 'tre', 'tree', 'ree', 'reet', 'eet', 'eetr', 'etr', 'etre',
'tre', 'tree', 'ree', 'reep', 'eep', 'eepa', 'epa', 'epag', 'pag', 'page', 'age',
'agep', 'gep', 'gepa', 'epa', 'epag', 'pag', 'page', 'age', 'aget', 'get', 'getr',
'etr', 'etre', 'tre', 'tree']
Using a set for "is A in B" checks makes the coder faster as well - sets have O(1) lookup - list take the longer the more lements are in it (worst case: n). So you eliminate all words from the k-lenght parts list that do not match any of the words you look for (i.e. 'eter'):
# whats left after the list-comprehension including the filter criteria is done
['page', 'gel', 'law', 'tree', 'tree', 'page', 'page', 'tree']
For counting iterables I use collections.Counter - a specialiced dictionary .. that counts things. It's most_common() method returns sorted tuples (key,count) sorted by most occured first which I format to a return-value that matches your OP.
One version to solve it respection overlapping results:
from collections import Counter
def findWordsInText(words,text):
words = set(words) # set should be faster for lookup
lens = set(len(w) for w in words)
# get all splits of len 3+4 (in this case) from text
splitted = [text[i:i+ll] for i in range(len(text)-min(lens)) for ll in lens
if text[i:i+ll] in words] # only keep whats in words
# count them
counted = Counter(splitted)
# set-difference
not_in = words-set(splitted)
# format as requested: list of words in order, most common word
most_common = counted.most_common()
ordered_in = ( [w for w,_ in most_common], most_common[0][0] )
return list(not_in), ordered_in
words = ["tree","page","law","gel","sand"]
text = "pagelawtreetreepagepagetree"
not_in, found = findWordsInText(words,text)
print(not_in)
print(found)
Output:
['sand']
(['page', 'tree', 'gel', 'law'], 'page')

is it possible to create an if statement dynamically using list elements and OR?

I'm trying to adapt the code I wrote below to work with a dynamic list of required values rather than with a string, as it works at present:
required_word = "duck"
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if required_word not in sentence:
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
Say for example I had two required words (only one of which are actually required), I could do this:
required_words = ["dog", "fox"]
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if (required_words[0] not in sentence) or (required_words[1]not in sentence):
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
>>> [['Hello', 'duck']]
However, what I need is for someone to steer me in the direction of a method of dealing with a list that will vary in size (number of items), and satisfy the if statement if any of the list's items are not in the list named 'sentence'. However, being quite new to Python, I'm stumped, and don't know how to better phrase the question. Do I need to come up with a different approach?
Thanks in advance!
(Note that the real code will do something more complicated than printing sentences_not_containing_required_words.)

You can construct this list pretty easily with a combination of a list comprehension and the any() built-in function:
non_matches = [s for s in sentences if not any(w in s for w in required_words)]
This will iterate over the list sentences while constructing a new list, and only include sentences where none of the words from required_words are present.
If you are going to end up with longer lists of sentences, you may consider using a generator expression instead to minimize memory footprint:
non_matches = (s for s in sentences if not any(w in s for w in required_words))
for s in non_matches:
# do stuff

How to create a list from splitting a list elements in python?

Let's say I have:
sentences = ['The girls are gorgeous', 'I'm mexican']
And I want to obtain:
words = ['The','girls','are','gorgeous', 'I'm', 'mexican']
I tried:
words = [w.split(' ') for w in sentences]
but got not expected result.
Will this work for Counter(words) as I need to obtain the frequency?

Try like this
sentences = ["The girls are gorgeous", "I'm mexican"]
words = [word for sentence in sentences for word in sentence.split(' ')]

Your method didn't work because, split returns a list. So, your code creates a nested list. You need to flatten it to use it with Counter. You can flatten it in so many ways.
from itertools import chain
from collections import Counter
Counter(chain.from_iterable(words))
would have been the best way to flatten the nested list and find the frequency. But you can use a generator expression, like this
sentences = ['The girls are gorgeous', "I'm mexican"]
from collections import Counter
print Counter(item for items in sentences for item in items.split())
# Counter({'mexican': 1, 'girls': 1, 'are': 1, 'gorgeous': 1, "I'm": 1, 'The':1})
This takes each sentence, splits that to get the list of words, iterates those words and flattens the nested structure.
If you want to find top 10 words, then you can use Counter.most_common method, like this
Counter(item for items in sentences for item in items.split()).most_common(10)

Try this:
words = ' '.join(sentences).split()

python-How to delete an object in a list based on its data member

So i made the class below and stored instances of it in a list.
class Word:
def __init__(self,word,definition,synonyms):
self.w=word
self.defi=definition
self.synonyms=synonyms
I'm sure i can loop through the list and check every instance but i'm trying to do so using the remove method of a list.
x is a list of Word objects
word="hi"
x.remove(Word if Word.w==word)
this gave me an error. So is there a similar way to do it?
EDIT
I was trying to simplify my question but apparently my intentions weren't clear.
I have a dictionary whose keys are the 2 last letters of words (that users add) and whose values are lists of the words with those 2 last characters. Example:
w1=Word("lengthen","make something taller","some synonym")
w2=Word("woken","let someone wake up","some synonym")
w3=Word("fax","machine used for communication","some synonym")
w4=Word("wax","chemical substance","some synonym")
['en':(w1,w2),'ax':(w3,w4)]
I am trying to define a method delete which take the dictionary and a word(STRING), then it will delete the OBJECT Word containing the following word.
def delete(dictionary,word):
if word[-2:] in dictionary:
x=dictionary[word[-2:]]
if(x.count(word)!=0):
x.remove(Word if Word.w==word)

You would generally use a list comprehension to build a new list without matches, otherwise you end up trying to modify a list while iterating over it:
x = [word for word in x if word.w != "hi"]
Note lowercase word; using Word shadows the class itself.
If altering the list in-place is crucial, you can use slicing:
x[:] = [word for word in x if word.w != "hi"]

I think this will do what you want:
class Word:
def __init__(self,word,definition,synonyms):
self.w=word
self.defi=definition
self.synonyms=synonyms
def __repr__(self): # added to facilitate printing of tuples of Word objects
return 'Word({}, {}, {})'.format(self.w, self.defi, self.synonyms)
w1=Word("lengthen", "make something taller", "some synonym")
w2=Word("woken", "let someone wake up", "some synonym")
w3=Word("fax", "machine used for communication", "some synonym")
w4=Word("wax", "chemical substance", "some synonym")
dictionary = {'en': (w1, w2), 'ax': (w3, w4)}
def delete(dictionary, word):
suffix = word[-2:]
if suffix in dictionary:
dictionary[suffix] = tuple(word_obj for word_obj in dictionary[suffix]
if word_obj.w != word)
delete(dictionary, 'woken')
print(dictionary)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i find a phrase duplicates in list? - python

there is a list like that: my_list = ['beautiful moments','moments beautiful'] don`t look at grammar, the main idea is that those two strings are about same thing. The question is how to detect that those phrases are duplicate WITHOUT splitting and sorting each phrase?

Related

scrape data and sort it using Python 2.7 and selenium

Find words in text, very slow solution

is it possible to create an if statement dynamically using list elements and OR?

How to create a list from splitting a list elements in python?

python-How to delete an object in a list based on its data member

Categories

Resources