This question already has answers here:
Finding and grouping anagrams by Python
(7 answers)
Closed 7 months ago.
If my input is a list like this:
words = ['cat','act','wer','erw']
I want to make a list of lists of anagrams like this -
[['cat','act'],['wer','erw']]
I have tried to do something like this:
[[w1 for w in words if w!=w1 and sorted(w1)==sorted(w)] for w1 in words]
but it doesn't work. The output was :
[['cat'], ['act'], ['wer'], ['erw']]
In addition, I don`t want to use any import (except string). What is the mistake?
Be aware that your original method is actually O(#words2) time and thus will not work on large datasets of perhaps more than 10000 words.
groupby one-liner:
One of the most elegant weirdest use cases I've ever seen for itertools.groupby:
>>> [list(v) for k,v in groupby(sorted(words,key=sorted),sorted)]
[['cat', 'act'], ['wer', 'erw']]
defaultdict three-liner:
Using collections.defaultdict, you can do:
anagrams = defaultdict(list)
for w in words:
anagrams[tuple(sorted(w))].append(w)
As for If doing it your original way without any imports, you can emulate collections.defaultdict as follows:
anagrams = {}
for w in words:
key = tuple(sorted(w))
anagrams.setdefault(key,[]).append(w)
example:
>>> anagrams
{('e', 'r', 'w'): ['wer', 'erw'], ('a', 'c', 't'): ['cat', 'act']}
(Also written up in whi's answer.)
map-reduce:
This problem is also the poster child for map-reduce, where the reduction key you use is the sorted letters (or more efficiently, a hash). This will allow you to massively parallelize the problem.
If we assume the length of words is bounded, the groupby solution is O(#words log(#words)), while the hash solution is expected O(#words). In the unlikely event the length of words is arbitrary in length, sorting (O(length log(length)) per word) is less efficient than using an order-agnostic hash of the letters (O(length) per word). Sadly, collections.Counter is not hashable so you'd have to write your own.
words = ['cat','act','wer','erw']
dic={}
for w in words:
k=''.join(sorted(w))
dic.setdefault(k,[])
dic[k].append(w)
print dic.values()
this is better in perform: O(n)
You can find various solutions to anagrams of a single word at a time by googleing. It is likely that there would be a more efficient solver around than the obvious "search through all the words I know and see if they have the same letters".
Once you have one, you can put it into a function:
def anagrams(word):
"return a list of all known anagrams of *word*"
Once you have that, generalising it to a list of words is trivial:
[anagrams(word) for word in words]
This one should do the trick in the style you prefer
[[w, w1] for w1 in words for w in words if w!=w1 and sorted(w1)==sorted(w)][::2]
Related
So, the problem is:
Given an array of m words and 1 other word, find all anagrams of that word in the array and print them.
Do y’all have any faster algorithm?:)
I’ve succesfully coded this one, but it seems rather slow ( i’ve been using sorted() with a for loop + checking the length before). Found anagrams were added to a new array. Then printing the list of anagrams with a for loop again.
I think that counting characters and comparing it will be faster but im not sure. Just check it ;)
defaultdict will be helpful.
from collections import defaultdict as dd
def char_counter(word: str)-> dict
result = dd(int)
for c in word:
result[c]+=1
return result
I want to improve the performance of my code. I tried few ways following advice before, but the speed of my code is still slow. What can I do instead of trying the way I tried?
My code is here:
matched_word = []
for w in word_list:
for str_ in dictionary:
if str_ == w:
matched_word.append(str_)
There are some points of reference here:
First, the length of word_list is 160,000, and the length of dictionary is about 200,000.
Second, I can not use a set of word_list because I want to make a list (matched_word) including duplicated words (the element of word_list).
Third, the following code is still working slow.
import collections
matched_word = collections.deque
for w in dictionary:
if w in word_list:
matched_word.append(w)
Fourth, the following code is also still working slow.
matched_word = [w for w in word_list if w in dictionary]
Thanks for your help.
(Thanks to all people who adviced before too.)
You don't need to iterate over the dictionary; just check if w is a key. You are turning what should be an O(1) lookup into an O(n) scan.
matched_word = [w for w in word_list if w in dictionary]
I can not use set of word_list because I want to make a list(= matched_word) including duplicated word(= the element of word_list).
Due to how lists are implemented in python .appending might take relatively long time, set is not option due to above requirement, but there is structure in python standard library specially developed to allow rapid inserting at end, namely collections.deque from collections built-in module. Example usage
import collections
matched_word = collections.deque()
for w in ["A","B","C","A","B"]:
matched_word.append(w)
matched_word_list = list(matched_word)
print(matched_word_list)
output
['A', 'B', 'C', 'A', 'B']
I have list of strings like this:
words = ['hello', 'world', 'name', '1', '2018']
I looking for the fastest way (python 3.6) to detect year "word" in the list. For example, "2018" is year. "1" not. Let's define the acceptable year range to 2000-2020.
Possible solution
Check if the word is number ('2018'.isdigit()) and then convert it to int and check if valid range.
What is the fastest way to do it in python?
You can build a set of your valid years (as strings). Then loop through each of the words you want to test to check if it is a valid year:
words = ['hello', 'world', 'name', '1', '2018']
valid_years = {str(x) for x in range(2000,2021)}
for word in words:
if word in valid_years:
print word
As Martijn Pieters mentioned in the comments, sets are the fastest solution for accessing items with an O(1) complexity:
Sets let you test for membership in O(1) time, using a list has a linear O(length_of_list) cost
EDIT:
As you can see in the comments, there are a lot of different ways of generating the set of valid_years, as long as your data structure is a Set you will have the fastest way of doing what you want.
You can read more here:
List comprehension
Sets
Complexities for different Python data structures (so you can understand which data structures in Python are quicker for specific operations)
Concatenate list to one string with special split char. Use regex to search.
For example:
word_tmp = " ".join(words)
re.search("\b20[0-2]\d\b", word_tmp)
How can I search any given txt file for anagrams and display the anagrams for every word in that file.
So far I can read the file, extract every single word and alphabetically sort every single word. I've tried making two dicts one dict containing the actual words in the text file as keys and the alphabetically sorted version of the words as values, and another dict of the dictionary file I have that is set up the same way.
Using both these dictionaries I've been unable to find an efficient way to get the following output for every word in the input list:
'eerst': steer reste trees
If I try to loop through all the words in the given list, and inside each loop, loop inside the dictionary, looking and recording the anagrams, it takes too much time and is very inefficient. If I try the following:
for x in input_list:
if x in dictionary:
print dictionary[x]
I only get the first anagram of every word and nothing else.
If that made any sense, any suggestions would be immensely helpful.
I'm not sure if what I'm thinking of is what you're currently doing in your code, but I can't think of anything better:
from collections import defaultdict
words = 'dog god steer reste trees dog fred steer'.split() # or words from a file
unique_words = set(words)
anagram_dict = defaultdict(list)
for word in unique_words:
key = "".join(sorted(word))
anagram_dict[key].append(word)
for anagram_list in anagram_dict.values():
if len(anagram_list) > 1:
print(*anagram_list)
This will print (in arbitrary order):
god dog
steer trees reste
If you wanted to get the dictionary key value, you could make the final loop be over the items rather than the values of anagram_dict (and you could print out words that don't have any anagrams like 'fred' in the example above, if you wanted). Note that thanks to the set, duplicates of words are not sorted multiple times.
Running time should be O(M + U*N*log(N)) where M is the number of words, U is the number of unique ones and N is their average length. Unless you're sorting an organic chemistry textbook or something else that has lots of long words, it should be pretty close to linear in the length of the input.
Here is another way to get anagrams using itertools.groupby
from itertools import groupby
words = list_of_words
for k, g in groupby(sorted(words, key=sorted), key=sorted):
g = list(g)
if len(g) > 1:
print(g)
The big-O complexity isn't quite as good as the usual dictionary of lists approach, but it's still fairly efficient and it sounds funny when you read it out loud
I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.
However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.
My code is something like:
words_in_line = []
for word in line:
if word in my_list:
words_in_line.append(word)
As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.
Do someone has an idea about how to do that in a better way?
You might consider a trie or a DAWG or a database. There are several Python implementations of the same.
Here is some relative timings for you to consider of a set vs a list:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
Prints:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.
For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.
Conclusion: Use better data structures for 600 million lines of text!
I'm not clear on why you chose a list in the first place, but here are some alternatives:
Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.
If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort:
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/
If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter:
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Depending on what you're doing, a trie might also be very good.
This uses list comprehension
words_in_line = [word for word in line if word in my_list]
which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.
There are two improvments you can make here.
Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
Using a more appropriate structure for your matching-word collection.
If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.
Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).
from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
# Do something with matched_word
print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()
input.txt
a b dog cat
c dog poop
maybe b cat
dog
Output
a
b
c
b