Finding anagrams of a specific word in a list - python

So, the problem is:
Given an array of m words and 1 other word, find all anagrams of that word in the array and print them.
Do y’all have any faster algorithm?:)
I’ve succesfully coded this one, but it seems rather slow ( i’ve been using sorted() with a for loop + checking the length before). Found anagrams were added to a new array. Then printing the list of anagrams with a for loop again.

I think that counting characters and comparing it will be faster but im not sure. Just check it ;)
defaultdict will be helpful.
from collections import defaultdict as dd
def char_counter(word: str)-> dict
result = dd(int)
for c in word:
result[c]+=1
return result

Related

Anagram generator from input in Python malfunction

I tried to code a simple generator of a list of anagrams from an input. But after I rewrote the code it gives me only 2 outputs. Here's the code
import random
item=input("Name? ")
a=''
b=''
oo=0
while oo<=(len(item)*len(item)):
a=''.join([str(y) for y in random.sample(item, len(item))]) #this line was found on this site
b=''.join([str(w) for w in random.sample(item, len(item))]) #because in no way i had success in doing it by myself
j=[]
j.append(a) #During the loop it should add the anagrams generated
j.append(b) #everytime the loop repeats itself
oo=oo+1
j=list(set(j)) #To cancel duplicates
h=len(j)
f=0
while f<=(h-1):
print(j[f])
But the output it gives is only one anagram repeated for ever.
As far as I can see you don't increment f at the end.
Do it rather like:
for item in j:
print( item )
The other thing is, you overwrite j in every loop. Are you sure you wanted it like that?
There were several problems with your loop construct, including reinitializing your results every time. Try a simpler approach where things are already the type they want to be rather than constantly converting. And not everything you want to do requires a loop:
import random
item = input("Name? ")
length = len(item)
anagrams = set()
for repetitions in range(length**2):
anagrams.add(''.join(random.sample(item, length)))
print("\n".join(anagrams))
However, these anagrams are not exhaustive (the random nature of this means some will be missed.) And they're not really anagrams as there's no dictionary to help generate actual words, just random letters.

How can I get the value of multiple elements while searching a dict efficiently? (python)

How can I search any given txt file for anagrams and display the anagrams for every word in that file.
So far I can read the file, extract every single word and alphabetically sort every single word. I've tried making two dicts one dict containing the actual words in the text file as keys and the alphabetically sorted version of the words as values, and another dict of the dictionary file I have that is set up the same way.
Using both these dictionaries I've been unable to find an efficient way to get the following output for every word in the input list:
'eerst': steer reste trees
If I try to loop through all the words in the given list, and inside each loop, loop inside the dictionary, looking and recording the anagrams, it takes too much time and is very inefficient. If I try the following:
for x in input_list:
if x in dictionary:
print dictionary[x]
I only get the first anagram of every word and nothing else.
If that made any sense, any suggestions would be immensely helpful.
I'm not sure if what I'm thinking of is what you're currently doing in your code, but I can't think of anything better:
from collections import defaultdict
words = 'dog god steer reste trees dog fred steer'.split() # or words from a file
unique_words = set(words)
anagram_dict = defaultdict(list)
for word in unique_words:
key = "".join(sorted(word))
anagram_dict[key].append(word)
for anagram_list in anagram_dict.values():
if len(anagram_list) > 1:
print(*anagram_list)
This will print (in arbitrary order):
god dog
steer trees reste
If you wanted to get the dictionary key value, you could make the final loop be over the items rather than the values of anagram_dict (and you could print out words that don't have any anagrams like 'fred' in the example above, if you wanted). Note that thanks to the set, duplicates of words are not sorted multiple times.
Running time should be O(M + U*N*log(N)) where M is the number of words, U is the number of unique ones and N is their average length. Unless you're sorting an organic chemistry textbook or something else that has lots of long words, it should be pretty close to linear in the length of the input.
Here is another way to get anagrams using itertools.groupby
from itertools import groupby
words = list_of_words
for k, g in groupby(sorted(words, key=sorted), key=sorted):
g = list(g)
if len(g) > 1:
print(g)
The big-O complexity isn't quite as good as the usual dictionary of lists approach, but it's still fairly efficient and it sounds funny when you read it out loud

Python: how to check that if an item is in a list efficiently?

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.
However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.
My code is something like:
words_in_line = []
for word in line:
if word in my_list:
words_in_line.append(word)
As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.
Do someone has an idea about how to do that in a better way?
You might consider a trie or a DAWG or a database. There are several Python implementations of the same.
Here is some relative timings for you to consider of a set vs a list:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
Prints:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.
For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.
Conclusion: Use better data structures for 600 million lines of text!
I'm not clear on why you chose a list in the first place, but here are some alternatives:
Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.
If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort:
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/
If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter:
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Depending on what you're doing, a trie might also be very good.
This uses list comprehension
words_in_line = [word for word in line if word in my_list]
which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.
There are two improvments you can make here.
Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
Using a more appropriate structure for your matching-word collection.
If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.
Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).
from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
# Do something with matched_word
print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()
input.txt
a b dog cat
c dog poop
maybe b cat
dog
Output
a
b
c
b

list of lists of anagram in python [duplicate]

This question already has answers here:
Finding and grouping anagrams by Python
(7 answers)
Closed 7 months ago.
If my input is a list like this:
words = ['cat','act','wer','erw']
I want to make a list of lists of anagrams like this -
[['cat','act'],['wer','erw']]
I have tried to do something like this:
[[w1 for w in words if w!=w1 and sorted(w1)==sorted(w)] for w1 in words]
but it doesn't work. The output was :
[['cat'], ['act'], ['wer'], ['erw']]
In addition, I don`t want to use any import (except string). What is the mistake?
Be aware that your original method is actually O(#words2) time and thus will not work on large datasets of perhaps more than 10000 words.
groupby one-liner:
One of the most elegant weirdest use cases I've ever seen for itertools.groupby:
>>> [list(v) for k,v in groupby(sorted(words,key=sorted),sorted)]
[['cat', 'act'], ['wer', 'erw']]
defaultdict three-liner:
Using collections.defaultdict, you can do:
anagrams = defaultdict(list)
for w in words:
anagrams[tuple(sorted(w))].append(w)
As for If doing it your original way without any imports, you can emulate collections.defaultdict as follows:
anagrams = {}
for w in words:
key = tuple(sorted(w))
anagrams.setdefault(key,[]).append(w)
example:
>>> anagrams
{('e', 'r', 'w'): ['wer', 'erw'], ('a', 'c', 't'): ['cat', 'act']}
(Also written up in whi's answer.)
map-reduce:
This problem is also the poster child for map-reduce, where the reduction key you use is the sorted letters (or more efficiently, a hash). This will allow you to massively parallelize the problem.
If we assume the length of words is bounded, the groupby solution is O(#words log(#words)), while the hash solution is expected O(#words). In the unlikely event the length of words is arbitrary in length, sorting (O(length log(length)) per word) is less efficient than using an order-agnostic hash of the letters (O(length) per word). Sadly, collections.Counter is not hashable so you'd have to write your own.
words = ['cat','act','wer','erw']
dic={}
for w in words:
k=''.join(sorted(w))
dic.setdefault(k,[])
dic[k].append(w)
print dic.values()
this is better in perform: O(n)
You can find various solutions to anagrams of a single word at a time by googleing. It is likely that there would be a more efficient solver around than the obvious "search through all the words I know and see if they have the same letters".
Once you have one, you can put it into a function:
def anagrams(word):
"return a list of all known anagrams of *word*"
Once you have that, generalising it to a list of words is trivial:
[anagrams(word) for word in words]
This one should do the trick in the style you prefer
[[w, w1] for w1 in words for w in words if w!=w1 and sorted(w1)==sorted(w)][::2]

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.
I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.
As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").
Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
I previously had:
wordlist = [w for w in wordlist if w.startswith(word_fragment)]
(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.
It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:
newlist = []
for w in wordlist:
if w[:len(word_fragment)] > word_fragment:
# Take advantage of the fact that the list is sorted
break
if w.startswith(word_fragment):
newlist.append(w)
return newlist
but that is about the same speed, which I thought may be because list comprehensions run as compiled code?
I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?
Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.
UPDATE:
I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:
The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452
The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.
Thanks all for some excellent suggestions, and well done Peter for the best answer!!!
Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
Note that the lack of square brackets is important for this to work.
However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:
from bisect import bisect_left
def word_exists(wordlist, word_fragment):
try:
return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
except IndexError:
return False # word_fragment is greater than all entries in wordlist
bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.
Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.
You're right that you can do this more efficiently given that the list is sorted.
I'm building off of #Peter's answer, which returns a single element. I see that you want all the words that start with a given prefix. Here's how you do that:
from bisect import bisect_left
wordlist[bisect_left(wordlist, word_fragment):
bisect_left(wordlist, word_fragment[:-1] + chr(ord(word_fragment[-1])+1))]
This returns the slice from your original sorted list.
As Peter suggested I would use the Bisect module. Especially if you're reading from a large file of words.
If you really need speed you could make a daemon ( How do you create a daemon in Python? ) that has a pre-processed data structure suited for the task
I suggest you could use "tries"
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries
There are many algorithms and data structures to index and search
strings inside a text, some of them are included in the standard
libraries, but not all of them; the trie data structure is a good
example of one that isn't.
Let word be a single string and let dictionary be a large set of
words. If we have a dictionary, and we need to know if a single word
is inside of the dictionary the tries are a data structure that can
help us. But you may be asking yourself, "Why use tries if set
and hash tables can do the same?" There are two main reasons:
The tries can insert and find strings in O(L) time (where L represent
the length of a single word). This is much faster than set , but is it
a bit faster than a hash table.
The set and the hash tables
can only find in a dictionary words that match exactly with the single
word that we are finding; the trie allow us to find words that have a
single character different, a prefix in common, a character missing,
etc.
The tries can be useful in TopCoder problems, but also have a
great amount of applications in software engineering. For example,
consider a web browser. Do you know how the web browser can auto
complete your text or show you many possibilities of the text that you
could be writing? Yes, with the trie you can do it very fast. Do you
know how an orthographic corrector can check that every word that you
type is in a dictionary? Again a trie. You can also use a trie for
suggested corrections of the words that are present in the text but
not in the dictionary.
an example would be:
start={'a':nodea,'b':nodeb,'c':nodec...}
nodea={'a':nodeaa,'b':nodeab,'c':nodeac...}
nodeb={'a':nodeba,'b':nodebb,'c':nodebc...}
etc..
then if you want all the words starting with ab you would just traverse
start['a']['b'] and that would be all the words you want.
to build it you could iterate through your wordlist and for each word, iterate through the characters adding a new default dict where required.
In case of binary search (assuming wordlist is sorted), I'm thinking of something like this:
wordlist = "ab", "abc", "bc", "bcf", "bct", "cft", "k", "l", "m"
fragment = "bc"
a, m, b = 0, 0, len(wordlist)-1
iterations = 0
while True:
if (a + b) / 2 == m: break # endless loop = nothing found
m = (a + b) / 2
iterations += 1
if wordlist[m].startswith(fragment): break # found word
if wordlist[m] > fragment >= wordlist[a]: a, b = a, m
elif wordlist[b] >= fragment >= wordlist[m]: a, b = m, b
if wordlist[m].startswith(fragment):
print wordlist[m], iterations
else:
print "Not found", iterations
It will find one matched word, or none. You will then have to look to the left and right of it to find other matched words. My algorithm might be incorrect, its just a rough version of my thoughts.
Here's my fastest way to narrow the list wordlist down to a list of valid words starting with a given fragment :
sect() is a generator function that uses the excellent Peter's idea to employ bisect, and the islice() function :
from bisect import bisect_left
from itertools import islice
from time import clock
A,B = [],[]
iterations = 5
repetition = 10
with open('words.txt') as f:
wordlist = f.read().split()
wordlist.sort()
print 'wordlist[0:10]==',wordlist[0:10]
def sect(wordlist,word_fragment):
lgth = len(word_fragment)
for w in islice(wordlist,bisect_left(wordlist, word_fragment),None):
if w[0:lgth]==word_fragment:
yield w
else:
break
def hooloo(wordlist,word_fragment):
usque = len(word_fragment)
for w in wordlist:
if w[:usque] > word_fragment:
break
if w.startswith(word_fragment):
yield w
for rep in xrange(repetition):
te = clock()
for i in xrange(iterations):
newlistA = list(sect(wordlist,'VEST'))
A.append(clock()-te)
te = clock()
for i in xrange(iterations):
newlistB = list(hooloo(wordlist,'VEST'))
B.append(clock() - te)
print '\niterations =',iterations,' number of tries:',repetition,'\n'
print newlistA,'\n',min(A),'\n'
print newlistB,'\n',min(B),'\n'
result
wordlist[0:10]== ['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']
iterations = 5 number of tries: 30
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.0286089433154
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.415578236899
sect() is 14.5 times faster than holloo()
PS:
I know the existence of timeit, but here, for such a result, clock() is fully sufficient
Doing binary search in the list is not going to guarantee you anything. I am not sure how that would work either.
You have a list which is ordered, it is a good news. The algorithmic performance complexity of both your cases is O(n) which is not bad, that you just have to iterate through the whole wordlist once.
But in the second case, the performance (engineering performance) should be better because you are breaking as soon as you find that rest cases will not apply. Try to have a list where 1st element is match and rest 38000 - 1 elements do not match, you will the second will beat the first.

Categories