Find words in text, very slow solution

Find words in text, very slow solution - python

I have to make a function that given a text of concatenated words without spaces and a list that both contains words that appear and do not appear in said text.
I have to create a tuple that contains a new list that only includes the words that are in the text in order of appearance and the word that appears the most in the text. If there are two words that appear the most number of times, the function will chose one on an alphabetical order (if the words appear like "b"=3,"c"=3,"a"=1, then it will chose "b")
Also I have to modify the original list so that it only includes the words that are not in the text without modifying its order.
For example if I have a
list=["tree","page","law","gel","sand"]
text="pagelawtreetreepagepagetree"`
then the tuple will be
(["page","tree","law"], "page")
and list will become
list=["gel","sand"]
Now I have done this function but it's incredibly slow, can someone help?
ls=list
def es3(ls,text):
d=[]
v={}
while text:
for item in ls:
if item in text[:len(item)]:
text=text[len(item):]
if item not in d:
d+=[item]
v[item]=1
else:
v[item]+=1
if text=="":
p=sorted(v.items())
f=max(p, key=lambda k: k[1])
M=(d,f[0])
for b in d:
if b in lista:
ls.remove(b)
return (M)

In python strings are immutable - if you modify them you create new objects. Object creation is time/memory inefficient - almost all of the times it is better to use lists instead.
By creating a list of all possible k-lenght parts of text - k being the (unique) lenghts of the words you look for ( 3 and 4 in your list) you create all splits that you could count and filter out that are not in your word-set:
# all 3+4 length substrings as list - needs 48 lookups to clean it up to whats important
['pag', 'page', 'age', 'agel', 'gel', 'gela', 'ela', 'elaw', 'law', 'lawt', 'awt',
'awtr', 'wtr', 'wtre', 'tre', 'tree', 'ree', 'reet', 'eet', 'eetr', 'etr', 'etre',
'tre', 'tree', 'ree', 'reep', 'eep', 'eepa', 'epa', 'epag', 'pag', 'page', 'age',
'agep', 'gep', 'gepa', 'epa', 'epag', 'pag', 'page', 'age', 'aget', 'get', 'getr',
'etr', 'etre', 'tre', 'tree']
Using a set for "is A in B" checks makes the coder faster as well - sets have O(1) lookup - list take the longer the more lements are in it (worst case: n). So you eliminate all words from the k-lenght parts list that do not match any of the words you look for (i.e. 'eter'):
# whats left after the list-comprehension including the filter criteria is done
['page', 'gel', 'law', 'tree', 'tree', 'page', 'page', 'tree']
For counting iterables I use collections.Counter - a specialiced dictionary .. that counts things. It's most_common() method returns sorted tuples (key,count) sorted by most occured first which I format to a return-value that matches your OP.
One version to solve it respection overlapping results:
from collections import Counter
def findWordsInText(words,text):
words = set(words) # set should be faster for lookup
lens = set(len(w) for w in words)
# get all splits of len 3+4 (in this case) from text
splitted = [text[i:i+ll] for i in range(len(text)-min(lens)) for ll in lens
if text[i:i+ll] in words] # only keep whats in words
# count them
counted = Counter(splitted)
# set-difference
not_in = words-set(splitted)
# format as requested: list of words in order, most common word
most_common = counted.most_common()
ordered_in = ( [w for w,_ in most_common], most_common[0][0] )
return list(not_in), ordered_in
words = ["tree","page","law","gel","sand"]
text = "pagelawtreetreepagepagetree"
not_in, found = findWordsInText(words,text)
print(not_in)
print(found)
Output:
['sand']
(['page', 'tree', 'gel', 'law'], 'page')

Related

How can i find a phrase duplicates in list?

there is a list like that:
my_list = ['beautiful moments','moments beautiful']
don`t look at grammar, the main idea is that those two strings are about same thing.
The question is how to detect that those phrases are duplicate WITHOUT splitting and sorting each phrase?

You can take advantage of frozensets here because they are hashable(They can be added to the set - Time complexity of membership testing for sets is O(1)) and have equality comparison of sets(Two sets are equal if they have the same items in any order).
Basically we iterate through the items of the list, split them and make frozenset out of them. There is a unique set that we check to see if our item is present there or not.
my_list = ["beautiful moments", "moments beautiful", "hi bye", "hi hi", "bye hi"]
unique = set()
result = []
for i in my_list:
f = frozenset(i.split())
if f not in unique:
unique.add(f)
result.append(i)
print(result)
ourput:
['beautiful moments', 'hi bye', 'hi hi']

Python; use list of initial characters to retrieve full word from other list?

I'm trying to use the list of shortened words to select & retrieve the corresponding full word identified by its initial sequence of characters:
shortwords = ['appe', 'kid', 'deve', 'colo', 'armo']
fullwords = ['appearance', 'armour', 'colored', 'developing', 'disagreement', 'kid', 'pony', 'treasure']
Trying this regex match with a single shortened word:
import re
shortword = 'deve'
retrieved=filter(lambda i: re.match(r'{}'.format(shortword),i), fullwords)
print(retrieved*)
returns
developing
So the regex match works but the question is how to adapt the code to iterate through the shortwords list and retrieve the full words?
EDIT: The solution needs to preserve the order from the shortwords list.

Maybe using a dictionary
# Using a dictionary
test= 'appe is a deve arm'
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
#Building the dictionary
d={}
for i in range(len(shortwords)):
d[shortwords[i]]=fullwords[i]
# apply dictionary to test
res=" ".join(d.get(s,s) for s in test.split())
# print test data after dictionary mapping
print(res)

That is one way to do it:
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
# Dict comprehension
words = {short:full for short, full in zip(shortwords, fullwords)}
#Solving problem
keys = ['deve','arm','pony']
values = [words[key] for key in keys]
print(values)
This is a classical key - value problem. Use a dictionary for that or consider pandas in long-term.

Your question text seems to indicate that you're looking for your shortwords at the start of each word. That should be easy then:
matched_words = [word for word in fullwords if any(word.startswith(shortword) for shortword in shortwords]
If you'd like to regex this for some reason (it's unlikely to be faster), you could do that with a large alternation:
regex_alternation = '|'.join(re.escape(shortword) for shortword in shortwords)
matched_words = [word for word in fullwords if re.match(rf"^{regex_alternation}", word)]
Alternately if your shortwords are always four characters, you could just slice the first four off:
shortwords = set(shortwords) # sets have O(1) lookups so this will save
# a significant amount of time if either shortwords
# or longwords is long
matched_words = [word for word in fullwords if word[:4] in shortwords]

This snippet has the functionality I wanted. It builds a regular expression pattern at each loop iteration in order to accomodate varying word length parameters. Further it maintains the original order of the wordroots list. In essence it looks at each word in wordroots and fills out the full word from the dataset. This is useful when working with the bip-0039 words list which contains words of 3-8 characters in length and which are uniquely identifiable by their initial 4 characters. Recovery phrases are built by randomly selecting a sequence of words from the bip-0039 list, order is important. Observed security practice is often to abbreviate each word to a maximum of its four initial characters. Here is code which would rebuild a recovery phrase from its abbreviation:
import re
wordroots = ['sun', 'sunk', 'sunn', 'suns']
dataset = ['sun', 'sunk', 'sunny', 'sunshine']
retrieved = []
for root in wordroots:
#(exact match) or ((exact match at beginning of word when root is 4 or more characters) else (exact match))
pattern = r"(^" + root + "$|" + ("^" + root + "[a-zA-Z]+)" if len(root) >= 4 else "^" + root + "$)")
retrieved.extend( filter(lambda i: re.match(pattern, i), dataset) )
print(*retrieved)
Output:
sun sunk sunny sunshine

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]

Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]

This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!

Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())

If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

Efficiently printing matching pairs of semordnilaps from a sorted list of words

I'm working on the problem of printing all matching pairs of semordnilaps from a given alphabetically sorted list of words (or phrases) (assumed to be in lower case).
A semordnilap is defined as a word (or phrase) which spells a different word (or phrase) backwards. So 'top' ('pot' read backwards), 'avid' ('diva' read backwards), and 'animal' ('lamina' read backwards) are semordnilaps, as is 'semordnilap' itself because it's 'palindromes' read backwards, whereas 'tot', 'peep', 'radar' are palindromes (words which read the same backwards) but not semordnilaps. In this context a pair of words 'word1' and 'word2' match if 'word1' is 'word2' read backwards (and vice versa).
If the length of the input list is N then the solution will obviously have complexity O(N(N-1)/2) because there are N(N-1)/2 different pairs that can be constructed. Also, if the list is alphabetically sorted then it seems in the worst case all N(N-1)/2 pairs must be examined to find all the matching pairs.
I was wondering if there is a more efficient way of doing this, than the straightforward way. Here is my code, currently.
import io
def semordnilaps_in_text_file( file_path ):
def pairup( alist ):
for elem1 in range( len( alist ) ):
for elem2 in range( elem1 + 1 , len( alist ) ):
yield ( alist[elem1], alist[elem2] )
def word_list( file_path ):
thelist = []
with io.open( file_path, 'r', encoding='utf-8' ) as file:
for line in file:
thelist.append( line.strip() )
return thelist
for word1, word2 in pairup( word_list( file_path ) ):
if word1[::-1] == word2:
print '{} {}'.format( word1, word2 )
I tried this function with a list of (all lowercase) English words found here (containing 109583 words), and after a several minutes managed to print the following 21 pairs, before I interrupted it.
abut tuba
ac ca
ados soda
agar raga
ah ha
ajar raja
al la
am ma
an na
animal lamina
ante etna
ape epa
are era
ares sera
as sa
assam massa
ate eta
avid diva
aw wa
bad dab
bag gab

You just need to keep track of the words you've seen.
def pairup(alist):
seen = set()
for word in alist:
if word not in seen:
# Haven't seen this one yet
if word[::-1] in seen:
# But we've seen its reverse, so we found a pair
yield (word, word[::-1])
# Now we've seen it
seen.add(word)
Subtlety: Adding the newly found word to seen at the end avoids triggering the yield if a palindrome is encountered. Conversely, if you also want to detect palindromes, add the word to seen before checking whether the reflection is already there.
Also: it is unnecessary to read the words into a list to use that function. You could just provide it with an iterable, such as a list comprehension:
for word, drow in pairup(line.strip().lower()
for line in io.open(filepath, 'r')):
print('{} {}'.format(word, drow))

One thing that you can do is to preprocess the words using a hash table. Palindromes have to have the same count of letters, so just make a dictionary mapping like this:
opt => [pot, top, opt]
Then you just iterate over the lists and repeat your slower method. This words because it still uses your O(N^2) algorithm, but makes N much, much smaller by only comparing things that have the potential to be semordnilaps. You could use the same idea based only on length, where all words of length three were in one bucket. That would look like this:
3 => [pot, top, opt, cat, act, tac, art, tar, hop, ...]
However, this would be much slower than having the key depending on the word composition, because using only length you'd be comparing top, pot and opt to all other three-letter words.
Here's some code that found 281 semordnilaps in under one second on my laptop:
#!/usr/bin/python
import collections
def xform(word):
return ''.join(sorted(list(word)))
wordmap = collections.defaultdict(lambda: [])
for line in open('wordsEn.txt', 'r'):
word = line.rstrip()
key = xform(word)
wordmap[key].append(word)
for key, words in wordmap.iteritems():
for index1 in xrange(len(words)):
for index2 in xrange(index1 + 1, len(words)):
word1 = words[index1]
word2 = words[index2]
if word1[::-1] == word2:
print word1, ' ', word2
Results are available from here.
It's probably worth noting that sorting the list of words doesn't really help you, because the the semordnilaps are going to be scattered throughout the list.

You can use dictionary here to access the words in O(1).
words=open('words.txt','r')
word_dict={} #dictionary to store all the words
for word in words:
word = word.strip('\n')
if word!=word[::-1]: #remove the palindromic words
word_dict[word] = ''
for word in word_dict.keys():
try:
word_dict[word] = word[::-1]
#delete the semordnilaps from dictionary
del word_dict[word[::-1]]
except KeyError:
#if word has no semordnilaps then remove it from dictionary
del word_dict[word]
#word_dict is the desired dictionary
print word_dict,"\nTotal words: \n",len(word_dict)
I have used 'del' to remove the unwanted words from dictionary thereby reducing the time complexity and and 'exception handling' to access the words in O(1).
Hope it helps.

Large anagram search not reading to end of set Python

I've got a piece of code here that checks anagrams of a long list of words. I'm trying to find out how to search through every word in my long word list to find other anagrams that can match this word. Some words should have more than one anagram in my word list, yet I'm unable to find a solution to join the anagrams found in my list.
set(['biennials', 'fawn', 'unsupportable', 'jinrikishas', 'nunnery', 'deferment', 'surlinesss', 'sonja', 'bioko', 'devon'] ect...
Since I've been using sets, the set never reads to the end, and it returns only the shortest words. I know there should be more. I've been trying to iterate over my key over my whole words set so I can find all the ones that are anagrams to my key.
anagrams_found = {'diss': 'sids', 'abels': 'basel', 'adens': 'sedna', 'clot': 'colt', 'bellow': 'bowell', 'cds': 'dcs', 'doss': 'sods', '
als': 'las', 'abes': 'base', 'fir': 'fri', 'blot': 'bolt', 'ads': 'das', 'elm': 'mel', 'hops': 'shop', 'achoo': 'ochoa'... and more}
I was wondering where my code has been cutting off short. It should be finding a lot more anagrams from my Linux word dictionary. Can anyone see what's wrong with my piece of code? Simpley put, first the program iterates through every word I have, then checks if the sets contain my keys. This will append the keys to my dictionary for words later that will also match my same key. If there is already a key that I've added an anagram for, I will update my dictionary by concatenating the old dict value with a new word (anagram)
anagram_list = dict()
words = set(words)
anagrams_found = []
for word in words:
key = "".join(sorted([w for w in word]))
if (key in words) and (key != word):
anagrams_found.append(word)
for name, anagram in anagram_list.iteritems():
if anagram_list[name] == key:
anagram = " ".join([anagram],anagram_found)
anagram_list.update({key:anagram})
anagram_list[key] = word
return anagram_list
All in all, this program is maybe not efficient. Can someone explain the shortcomings of my code?

anagram_dict = {} # You could also use defaultdict(list) here
for w in words:
key = "".join(sorted(w))
if key in anagram_dict:
anagram_dict[key].append(w)
else:
anagram_dict[key] = [w]
Now entries that only have one item in the list aren't anagrams so
anagram_list = []
for v in anagram_dict.iteritems():
if len(v) > 1:
anagram_list += v

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find words in text, very slow solution - python

Related

How can i find a phrase duplicates in list?

Python; use list of initial characters to retrieve full word from other list?

parsing emails to identify keywords

Efficiently printing matching pairs of semordnilaps from a sorted list of words

Large anagram search not reading to end of set Python

Categories

Resources