Large anagram search not reading to end of set Python

Large anagram search not reading to end of set Python - python

I've got a piece of code here that checks anagrams of a long list of words. I'm trying to find out how to search through every word in my long word list to find other anagrams that can match this word. Some words should have more than one anagram in my word list, yet I'm unable to find a solution to join the anagrams found in my list.
set(['biennials', 'fawn', 'unsupportable', 'jinrikishas', 'nunnery', 'deferment', 'surlinesss', 'sonja', 'bioko', 'devon'] ect...
Since I've been using sets, the set never reads to the end, and it returns only the shortest words. I know there should be more. I've been trying to iterate over my key over my whole words set so I can find all the ones that are anagrams to my key.
anagrams_found = {'diss': 'sids', 'abels': 'basel', 'adens': 'sedna', 'clot': 'colt', 'bellow': 'bowell', 'cds': 'dcs', 'doss': 'sods', '
als': 'las', 'abes': 'base', 'fir': 'fri', 'blot': 'bolt', 'ads': 'das', 'elm': 'mel', 'hops': 'shop', 'achoo': 'ochoa'... and more}
I was wondering where my code has been cutting off short. It should be finding a lot more anagrams from my Linux word dictionary. Can anyone see what's wrong with my piece of code? Simpley put, first the program iterates through every word I have, then checks if the sets contain my keys. This will append the keys to my dictionary for words later that will also match my same key. If there is already a key that I've added an anagram for, I will update my dictionary by concatenating the old dict value with a new word (anagram)
anagram_list = dict()
words = set(words)
anagrams_found = []
for word in words:
key = "".join(sorted([w for w in word]))
if (key in words) and (key != word):
anagrams_found.append(word)
for name, anagram in anagram_list.iteritems():
if anagram_list[name] == key:
anagram = " ".join([anagram],anagram_found)
anagram_list.update({key:anagram})
anagram_list[key] = word
return anagram_list
All in all, this program is maybe not efficient. Can someone explain the shortcomings of my code?

anagram_dict = {} # You could also use defaultdict(list) here
for w in words:
key = "".join(sorted(w))
if key in anagram_dict:
anagram_dict[key].append(w)
else:
anagram_dict[key] = [w]
Now entries that only have one item in the list aren't anagrams so
anagram_list = []
for v in anagram_dict.iteritems():
if len(v) > 1:
anagram_list += v

Related

Python; use list of initial characters to retrieve full word from other list?

I'm trying to use the list of shortened words to select & retrieve the corresponding full word identified by its initial sequence of characters:
shortwords = ['appe', 'kid', 'deve', 'colo', 'armo']
fullwords = ['appearance', 'armour', 'colored', 'developing', 'disagreement', 'kid', 'pony', 'treasure']
Trying this regex match with a single shortened word:
import re
shortword = 'deve'
retrieved=filter(lambda i: re.match(r'{}'.format(shortword),i), fullwords)
print(retrieved*)
returns
developing
So the regex match works but the question is how to adapt the code to iterate through the shortwords list and retrieve the full words?
EDIT: The solution needs to preserve the order from the shortwords list.

Maybe using a dictionary
# Using a dictionary
test= 'appe is a deve arm'
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
#Building the dictionary
d={}
for i in range(len(shortwords)):
d[shortwords[i]]=fullwords[i]
# apply dictionary to test
res=" ".join(d.get(s,s) for s in test.split())
# print test data after dictionary mapping
print(res)

That is one way to do it:
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
# Dict comprehension
words = {short:full for short, full in zip(shortwords, fullwords)}
#Solving problem
keys = ['deve','arm','pony']
values = [words[key] for key in keys]
print(values)
This is a classical key - value problem. Use a dictionary for that or consider pandas in long-term.

Your question text seems to indicate that you're looking for your shortwords at the start of each word. That should be easy then:
matched_words = [word for word in fullwords if any(word.startswith(shortword) for shortword in shortwords]
If you'd like to regex this for some reason (it's unlikely to be faster), you could do that with a large alternation:
regex_alternation = '|'.join(re.escape(shortword) for shortword in shortwords)
matched_words = [word for word in fullwords if re.match(rf"^{regex_alternation}", word)]
Alternately if your shortwords are always four characters, you could just slice the first four off:
shortwords = set(shortwords) # sets have O(1) lookups so this will save
# a significant amount of time if either shortwords
# or longwords is long
matched_words = [word for word in fullwords if word[:4] in shortwords]

This snippet has the functionality I wanted. It builds a regular expression pattern at each loop iteration in order to accomodate varying word length parameters. Further it maintains the original order of the wordroots list. In essence it looks at each word in wordroots and fills out the full word from the dataset. This is useful when working with the bip-0039 words list which contains words of 3-8 characters in length and which are uniquely identifiable by their initial 4 characters. Recovery phrases are built by randomly selecting a sequence of words from the bip-0039 list, order is important. Observed security practice is often to abbreviate each word to a maximum of its four initial characters. Here is code which would rebuild a recovery phrase from its abbreviation:
import re
wordroots = ['sun', 'sunk', 'sunn', 'suns']
dataset = ['sun', 'sunk', 'sunny', 'sunshine']
retrieved = []
for root in wordroots:
#(exact match) or ((exact match at beginning of word when root is 4 or more characters) else (exact match))
pattern = r"(^" + root + "$|" + ("^" + root + "[a-zA-Z]+)" if len(root) >= 4 else "^" + root + "$)")
retrieved.extend( filter(lambda i: re.match(pattern, i), dataset) )
print(*retrieved)
Output:
sun sunk sunny sunshine

Selecting only a single synset synonym for each item in the dictionary

I have a python dictionary of words where I want to attach a single synonym to each of the items in the dictionary using Wordnet's synsets function. My current code picks up several synonyms but I only want to be able to narrow it down to just one word (i.e. the first in the set). Additionally, if there isn't an available synonym, I would like for it to just use the word as is in the dictionary.
dic = {w : [] for w in words if not w in stop_words}
for k, v in dic.items():
for synset in wordnet.synsets(k):
for lemma in synset.lemmas():
v.append(lemma.name())
I'm fairly new at this so any help would be greatly appreciated. Thanks in advance.

How can I apply stemming into a dictionary?

I'm working in some kind of NLP. I compare a daframe of articles with inputs words. The main goal is classify text if a bunch of words were found
I've tried to extract the values in the dictionary and convert into a list and then apply stemming to it. The problem is that later I'll do another process to split and compare according to the keys. I think if more practical to work directly in the dictionary.
search = {'Tecnology' : ['computer', 'digital', 'sistem'], 'Economy' : ['bank', 'money']}
words_list = list()
for key in search.keys():
words_list.append(search[key])
search_values = [val for sublist in words_list for val in sublist]
search_values_stem = [stemmer.stem(word) for word in test]
I expect a dictionary stemmed to compare directly with the column of the articles stemmed

If I understood your question correctly, you are looking to apply stemming to the values of your dictionary (and not the keys), and in addition the values in your dictionary are all lists of strings.
The following code should do that:
def stemList(l):
return([stemmer.stem(word) for word in l])
# your initial dictionary is called search (as in your example code)
#the following creates a new dictionary where stemming has been applied to the values
stemmedSearch = {}
for key in search:
stemmedSearch[key] = stemList(search[key])

Find words in text, very slow solution

I have to make a function that given a text of concatenated words without spaces and a list that both contains words that appear and do not appear in said text.
I have to create a tuple that contains a new list that only includes the words that are in the text in order of appearance and the word that appears the most in the text. If there are two words that appear the most number of times, the function will chose one on an alphabetical order (if the words appear like "b"=3,"c"=3,"a"=1, then it will chose "b")
Also I have to modify the original list so that it only includes the words that are not in the text without modifying its order.
For example if I have a
list=["tree","page","law","gel","sand"]
text="pagelawtreetreepagepagetree"`
then the tuple will be
(["page","tree","law"], "page")
and list will become
list=["gel","sand"]
Now I have done this function but it's incredibly slow, can someone help?
ls=list
def es3(ls,text):
d=[]
v={}
while text:
for item in ls:
if item in text[:len(item)]:
text=text[len(item):]
if item not in d:
d+=[item]
v[item]=1
else:
v[item]+=1
if text=="":
p=sorted(v.items())
f=max(p, key=lambda k: k[1])
M=(d,f[0])
for b in d:
if b in lista:
ls.remove(b)
return (M)

In python strings are immutable - if you modify them you create new objects. Object creation is time/memory inefficient - almost all of the times it is better to use lists instead.
By creating a list of all possible k-lenght parts of text - k being the (unique) lenghts of the words you look for ( 3 and 4 in your list) you create all splits that you could count and filter out that are not in your word-set:
# all 3+4 length substrings as list - needs 48 lookups to clean it up to whats important
['pag', 'page', 'age', 'agel', 'gel', 'gela', 'ela', 'elaw', 'law', 'lawt', 'awt',
'awtr', 'wtr', 'wtre', 'tre', 'tree', 'ree', 'reet', 'eet', 'eetr', 'etr', 'etre',
'tre', 'tree', 'ree', 'reep', 'eep', 'eepa', 'epa', 'epag', 'pag', 'page', 'age',
'agep', 'gep', 'gepa', 'epa', 'epag', 'pag', 'page', 'age', 'aget', 'get', 'getr',
'etr', 'etre', 'tre', 'tree']
Using a set for "is A in B" checks makes the coder faster as well - sets have O(1) lookup - list take the longer the more lements are in it (worst case: n). So you eliminate all words from the k-lenght parts list that do not match any of the words you look for (i.e. 'eter'):
# whats left after the list-comprehension including the filter criteria is done
['page', 'gel', 'law', 'tree', 'tree', 'page', 'page', 'tree']
For counting iterables I use collections.Counter - a specialiced dictionary .. that counts things. It's most_common() method returns sorted tuples (key,count) sorted by most occured first which I format to a return-value that matches your OP.
One version to solve it respection overlapping results:
from collections import Counter
def findWordsInText(words,text):
words = set(words) # set should be faster for lookup
lens = set(len(w) for w in words)
# get all splits of len 3+4 (in this case) from text
splitted = [text[i:i+ll] for i in range(len(text)-min(lens)) for ll in lens
if text[i:i+ll] in words] # only keep whats in words
# count them
counted = Counter(splitted)
# set-difference
not_in = words-set(splitted)
# format as requested: list of words in order, most common word
most_common = counted.most_common()
ordered_in = ( [w for w,_ in most_common], most_common[0][0] )
return list(not_in), ordered_in
words = ["tree","page","law","gel","sand"]
text = "pagelawtreetreepagepagetree"
not_in, found = findWordsInText(words,text)
print(not_in)
print(found)
Output:
['sand']
(['page', 'tree', 'gel', 'law'], 'page')

Efficiently printing matching pairs of semordnilaps from a sorted list of words

I'm working on the problem of printing all matching pairs of semordnilaps from a given alphabetically sorted list of words (or phrases) (assumed to be in lower case).
A semordnilap is defined as a word (or phrase) which spells a different word (or phrase) backwards. So 'top' ('pot' read backwards), 'avid' ('diva' read backwards), and 'animal' ('lamina' read backwards) are semordnilaps, as is 'semordnilap' itself because it's 'palindromes' read backwards, whereas 'tot', 'peep', 'radar' are palindromes (words which read the same backwards) but not semordnilaps. In this context a pair of words 'word1' and 'word2' match if 'word1' is 'word2' read backwards (and vice versa).
If the length of the input list is N then the solution will obviously have complexity O(N(N-1)/2) because there are N(N-1)/2 different pairs that can be constructed. Also, if the list is alphabetically sorted then it seems in the worst case all N(N-1)/2 pairs must be examined to find all the matching pairs.
I was wondering if there is a more efficient way of doing this, than the straightforward way. Here is my code, currently.
import io
def semordnilaps_in_text_file( file_path ):
def pairup( alist ):
for elem1 in range( len( alist ) ):
for elem2 in range( elem1 + 1 , len( alist ) ):
yield ( alist[elem1], alist[elem2] )
def word_list( file_path ):
thelist = []
with io.open( file_path, 'r', encoding='utf-8' ) as file:
for line in file:
thelist.append( line.strip() )
return thelist
for word1, word2 in pairup( word_list( file_path ) ):
if word1[::-1] == word2:
print '{} {}'.format( word1, word2 )
I tried this function with a list of (all lowercase) English words found here (containing 109583 words), and after a several minutes managed to print the following 21 pairs, before I interrupted it.
abut tuba
ac ca
ados soda
agar raga
ah ha
ajar raja
al la
am ma
an na
animal lamina
ante etna
ape epa
are era
ares sera
as sa
assam massa
ate eta
avid diva
aw wa
bad dab
bag gab

You just need to keep track of the words you've seen.
def pairup(alist):
seen = set()
for word in alist:
if word not in seen:
# Haven't seen this one yet
if word[::-1] in seen:
# But we've seen its reverse, so we found a pair
yield (word, word[::-1])
# Now we've seen it
seen.add(word)
Subtlety: Adding the newly found word to seen at the end avoids triggering the yield if a palindrome is encountered. Conversely, if you also want to detect palindromes, add the word to seen before checking whether the reflection is already there.
Also: it is unnecessary to read the words into a list to use that function. You could just provide it with an iterable, such as a list comprehension:
for word, drow in pairup(line.strip().lower()
for line in io.open(filepath, 'r')):
print('{} {}'.format(word, drow))

One thing that you can do is to preprocess the words using a hash table. Palindromes have to have the same count of letters, so just make a dictionary mapping like this:
opt => [pot, top, opt]
Then you just iterate over the lists and repeat your slower method. This words because it still uses your O(N^2) algorithm, but makes N much, much smaller by only comparing things that have the potential to be semordnilaps. You could use the same idea based only on length, where all words of length three were in one bucket. That would look like this:
3 => [pot, top, opt, cat, act, tac, art, tar, hop, ...]
However, this would be much slower than having the key depending on the word composition, because using only length you'd be comparing top, pot and opt to all other three-letter words.
Here's some code that found 281 semordnilaps in under one second on my laptop:
#!/usr/bin/python
import collections
def xform(word):
return ''.join(sorted(list(word)))
wordmap = collections.defaultdict(lambda: [])
for line in open('wordsEn.txt', 'r'):
word = line.rstrip()
key = xform(word)
wordmap[key].append(word)
for key, words in wordmap.iteritems():
for index1 in xrange(len(words)):
for index2 in xrange(index1 + 1, len(words)):
word1 = words[index1]
word2 = words[index2]
if word1[::-1] == word2:
print word1, ' ', word2
Results are available from here.
It's probably worth noting that sorting the list of words doesn't really help you, because the the semordnilaps are going to be scattered throughout the list.

You can use dictionary here to access the words in O(1).
words=open('words.txt','r')
word_dict={} #dictionary to store all the words
for word in words:
word = word.strip('\n')
if word!=word[::-1]: #remove the palindromic words
word_dict[word] = ''
for word in word_dict.keys():
try:
word_dict[word] = word[::-1]
#delete the semordnilaps from dictionary
del word_dict[word[::-1]]
except KeyError:
#if word has no semordnilaps then remove it from dictionary
del word_dict[word]
#word_dict is the desired dictionary
print word_dict,"\nTotal words: \n",len(word_dict)
I have used 'del' to remove the unwanted words from dictionary thereby reducing the time complexity and and 'exception handling' to access the words in O(1).
Hope it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large anagram search not reading to end of set Python - python

Related

Python; use list of initial characters to retrieve full word from other list?

Selecting only a single synset synonym for each item in the dictionary

How can I apply stemming into a dictionary?

Find words in text, very slow solution

Efficiently printing matching pairs of semordnilaps from a sorted list of words

Categories

Resources