Anagram Search Algorithm Comparison in Python [Tutorial Homework]

Anagram Search Algorithm Comparison in Python [Tutorial Homework] - python

I am making a simple algorithm sort in Python using defaultdict, creating a hash to use as the key and then just going through the dictionary afterwards and printing out anything with more than a single value.
Originally started by creating the hash by creating a sorted string using:
def createHashFromFile(fileName):
with open(fileName) as fileObj:
for line in fileObj:
line = line.lower()
aHash = ("").join(sorted(line.strip()))
aSorter[aHash].append(line.strip())
However, because the sorted() function is O(n^2) have had the suggestion made to create the hash through prime factorization. I created a dictionary that has mapped all the lower case letters to a prime and then done:
def keyHash(word):
mulValue = 1
for letter in word:
letter = letter.lower()
mulValue = mulValue * primeDict[letter]
return mulValue
On 300k words, the string hash runs in 0.75s and the prime hash runs in 1s. I've been reading up on this but I'm unable to determine if I've missed anything with this or why it is running slower.
This is already completed as far as the homework is concerned but I want to understand why or what I am missing here.

There's a whole bunch of factors going on here:
sorted is O(n log n) average case rather than O(n^2). Worst case of sort is almost never relevant in real programs.
multiplying primes together is a clever trick, but while it's O(n) cost in multiplications, multiplying a big number N by a small factor is going to be O(log N) in cost rather than O(1) (since you have to go through the O(log N) digits of the bignum). This means the prime technique is going to be O(n log n) too, because keyHash(s) is going to have O(len(s)) digits.
n is small, so implementation details are going to matter a whole lot more than complexity.
sorted is built-in and written in C. The implementation has been tuned over many years. Your prime multiplication code is written in Python.
You don't say in your question how you performed the timing. It's very easy to get this wrong, by, for example, timing the whole program rather than a micro-benchmark. Given the closeness of the results, I expect you've made such an error, but it's a guess.

Related

How can I mark an index as 'used' when iterating over a list?

I will iterate over a list of integers, nums, multiple times, and each time, when an integer has been 'used' for something (doesn't matter what), I want to mark the index as used. So that in future iterations, I do not use this integer again.
Two questions:
My idea is to simply create a separate list marker = [1]*len(nums) ; and each time I use a number in nums, I will subtract 1 from the corresponding index in marker as a way to keep track of the numbers in nums I have used.
My first question is, is there a well known efficient way to do this? As I believe this would make the SPACE COMPLEXITY O(n)
My other idea is to replace each entry in nums, like this. nums = [1,2,3,4] -> nums = [(1,1),(2,1),(3,1),(4,1)]. And each time I use an integer in nums, I would subtract 1 from the second index in each pair as a way of marking that it has been used. My question is, am I right in understanding that this would optimise the SPACE COMPLEXITY relative to solution 1. above? And the SPACE COMPLEXITY here would be O(1)?
For reference, I am solving the following question: https://leetcode.com/contest/weekly-contest-256/problems/minimum-number-of-work-sessions-to-finish-the-tasks/
Where each entry in tasks needs to be used once.

I don't think there is a way to do it in O(1) space. Although, I believe that using a boolean value instead of an integer value or using the concept of sets would be a better solution.
No, the space complexity is still O(n). Think about it like this. Let us assume n is the size of the list. In the first method that you mentioned, we are storing n 'stuff' separately. So, the space complexity is O(n). In the second method also, we are storing n 'stuff' separately. It's just that those n 'stuff' are being stored as part of the same array. So, the space complexity still remains the same which is O(n).

Firstly, In both cases, the space Complexity comes out to be O(n). This is because nums itself utilizes O(n) space whether or not you use a separate list to store usage of elements. So space complexity in any way cannot come down to O(1).
However, here is a suggestion.
If you don't want to use the used element again then why not just remove it from the list.
Or, in case you don't want to disrupt the indexing, just change the number to -1.

Reverse string time and space complexity

I have written different python codes to reverse a given string. But, couldn't able to figure the which one among them is efficient. Can someone point out the differences between these algorithms using time and space complexities?
def reverse_1(s):
result = ""
for i in s :
result = i + result
return result
def reverse_2(s):
return s[::-1]
There are already some solutions out there, but I couldn't find out the time and space complexity. I would like to know how much space s[::-1] will take?

Without even trying to bench it (you can do it easily), reverse_1 would be dead slow because of many things:
loop with index
constantly adding character to string, creating a copy each time.
So, slow because of loop & indexes, O(n*n) time complexity because of the string copies, O(n) complexity because it uses extra memory to create temp strings (which are hopefully garbage collected in the loop).
On the other hand s[::-1]:
doesn't use a visible loop
returns a string without the need to convert from/to list
uses compiled code from python runtime
So you cannot beat it in terms of time & space complexity and speed.
If you want an alternative you can use:
''.join(reversed(s))
but that will be slower than s[::-1] (it has to create a list so join can build a string back). It's interesting when other transformations are required than reversing the string.
Note that unlike C or C++ languages (as far as the analogy goes for strings) it is not possible to reverse the string with O(1) space complexity because of the immutability of strings: you need twice the memory because string operations cannot be done in-place (this can be done on list of characters, but the str <=> list conversions use memory)

Longest Increasing Subsequence code in O(N)?

Someone asked me a question
Find the longest alphabetically increasing or equal string
composed of those letters. Note that you are allowed to drop
unused characters.
So ghaaawxyzijbbbklccc returns aaabbbccc.
Is an O(n) solution possible?
and I implemented it code [in python]
s = 'ghaaawxyzijbbbklccc'
lst = [[] for i in range(26)]
for ch in s:
ml = 0
for i in range(0,ord(ch) + 1 - ord('a')):
if len(lst[i]) > len(lst[ml]):
ml= i
cpy = ''.join(lst[ml])
lst[ord(ch) - ord('a')] = cpy + ch
ml = 0
for i in range(26):
if len(lst[i]) > len(lst[ml]):
ml = i
print lst[ml]
and the answer is 'aaabbbccc'
I have tried this some more examples and all works!
and as far as I can think the complexity of this code is O(N)
let's take an example
suppose I have a string 'zzzz'
so the main loop will run 4 times and internal loop will run 26 times for each iteration so we can say in worst case the code will run in
O(26*N + 26)
---------^-
this is the last iteration
so O(N) is acceptable?
Now questions are
Is it works in O(N) my code at ideone
If it works in O(N) then why to use DP of O(N2) code of DP
Is it better then this code Friends code
Limitations of this code

It's O(N)
'why to use DP of O(N2)' : You don't need to for this problem. Note, though, that you take advantage of the fact that your sequence tokens (letters) are finite - so you can set up a list to hold all the possible starting values (26) and you need only look for the longest member of that list - an O(1) operation. A more generalised solution for sequences with an arbitrary number of ordered tokens can be done in O(NlogN).
Your friend's code is basically the same, just mapping the letters to numbers and their list for the 26 starting places holds 26 numbers for letter counts - they don't need to do either of those. Conceptually, though, it's the same thing - holding a list of lists.
"Better" is a matter of opinion. Although it has the same asymptotic complexity, the constant terms may be different, so one may execute faster than the other. Also, in terms of storage, one may use very slightly more memory than the other. With such low n - judging which is more readable may be more important than the absolute performance of either algorithm. I'm not going to make a judgement.
You might notice a slight difference where the "winning" sequence is a tie. For instance - on the test string edxeducation that you have there - your implementation returns ddin whereas your friend's returns ddio. Both seem valid to me - without a rule to break such ties.
The major limitation of this code is that it can only cope with sequences composed entirely of letters in one particular case. You could extend it to cope with upper and lower case letters, either treating them the same, or using an ordering where all lower case letters were "less than" all upper case letters or something similar. This is just extending the finite set of tokens that it can cope with.
To generalise this limitation - the code will only cope with finite sets of sequence tokens as noted in 2. above. Also - there is no error handling, so if you put in a string with, say, digits or punctuation, it will fail.

This is a variation of the Longest Increasing Subsequence.
The difference is that your elements are bounded, since they can only run from 'a' to 'z'. Your algorithm is indeed O(N). O(N log N) is possible, for example using the algorithm from the link above. The bound on the number of possible elements turns this into O(N).

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.

The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.

Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Memory error while solving an anagram

I am trying to solve the below question:
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once; e.g., orchestra = carthorse. Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, write a program that finds the sets of words that share the same characters that contain the most words in them.
It's failing even with just 1000 bytes of file size. Also every time a new list is created, so why does Python keep the old list in memory? I am getting the below error.
l=list(map(''.join, itertools.permutations(i)))
gives me:
MemoryError
Here's my code:
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = f.read(1000).split('\n')
for i in words:
l=[]
l=list(map(''.join, itertools.permutations(i)))
l.remove(i)
for anagram in l:
if l==i:
f2.write(i + "\n")
return True
anagram()
Changed the above code to, as per suggestion. But still getting the memory error.
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = set(line.rstrip('\n') for line in f)
for i in words:
l= map(''.join, itertools.permutations(i))
l =(x for x in l if x!=i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
return True
anagram()
MemoryError
[Finished in 22.2s]

This program is going to be horribly inefficient no matter what you do.
But you can fix this MemoryError so it'll just take forever to run instead of failing.
First, note that a 12-letter word has 479,001,600 permutations. Storing all of those in memory is going to take more than 2GB of memory. So, how do you solve that? Just don't store them all in memory. Leave the iterator as an iterator instead of making a list, and then you'll only have to fit one at a time, instead of all of them.
There's one problem here: You're actually using that list in the if l==i: line. But clearly that's a mistake. There's no way that a list of strings can ever equal a single string. You might as well replace that line with raise TypeError, at which point you can just replace the whole loop and fail a whole lot faster. :)
I think what you wanted there is if anagram in words:. In which case you have no need for l, except for in the for loop, which means you can safely leave it as a lazy iterator:
for i in words:
l = map(''.join, itertools.permutations(i))
l = (x for x in l if x != i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
I'm assuming Python 3.x here, since otherwise the list call was completely unnecessary. If you're using 2.x, replace that map with itertools.imap.
As a side note, f.read(1000) is usually going to get part of an extra word at the end, and the leftover part in the next loop. Try readlines. While it's useless with no argument, with an argument it's very useful:
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
So, f.readlines(1000) will let you read buffers of about 1K at a time, without getting partial lines. Of course now, instead of having to split on newlines, you have to rstrip them:
words = [line.rstrip('\n') for line in f.readlines(1000)]
However, you've got another problem. If you're only reading about 100 words at a time, the chances of finding an anagram are pretty slim. For example, orchestra is not going to be anywhere near carthorse in the dictionary, so there's no way to find unless you remember the entire file. But that should be fine; a typical Unix dictionary like web2 has around 200K lines; you an easily read that into memory and keep it around as a set without making even a dent on your 2GB. So:
words = set(line.rstrip('\n') for line in f)
Also, note that you're trying to print out every word in the dictionary that has an anagram (multiple times, if it has multiple anagrams). Even with an efficient algorithm, that's going to take a long time—and spew out more data than you could possibly want. A more useful program might be one that takes an input word (e.g., via input or sys.argv[1]) and outputs just the anagrams of that word.
Finally:
Even after using l as a generator it taking up too much off time though no failing with memory error. Can you explain the importance of words as a set rather than a list. [Finished in 137.4s] just for 200 bytes, you have mentioned it before, but how to overcome it using words as set?
As I said at the top, "This program is going to be horribly inefficient no matter what you do."
In order to find the anagrams of a 12-letter word, you're going through 479 million permutations, and checking each one against a dictionary of about 200 thousand words, so that's 479M * 200K = 95 trillion checks, for each word. There are two ways to improve this, the first involving using the right data structures for the job, and the second involving the right algorithms for the job.
Changing the collection of things to iterate over from a list into a generator (a lazy iterable) turns something that took linear space (479M strings) into something that takes constant space (some fixed-size iterator state, plus one string at a time). Similarly, changing the collection of words to check against from a list into a set turns something that takes linear time (comparing a string against every element in the list) into something that takes constant time (hashing a string, then seeing if there's anything in the set with that hash value). So, this gets rid of the * 200K part of your problem.
But you've still got the 479M part of the problem. And you can't make that go away with a better data structure. Instead, you have to rethink the problem. How can you check whether any permutation of a word matches any other words, without trying all the permutations?
Well, some permutation of the word X matches the word Y if and only if X and Y have the same letters. It doesn't matter what order the letters in X were in; if the set is the same, there is at least one matching permutation (or exactly one, depending on how you count duplicate letters), and if not, there are exactly 0. So, instead of iterating through all the permutations in the word to look up, just look up its set. But it does matter if there are duplicates, so you can't just use set here. You could use some kind of multi-set (collections.Counter) works… or, with very little loss in efficiency and a big gain in simplicity, you could just sort the letters. After all, if two words have the same letters in some arbitrary order, they have the same letters in the same order when they're both sorted.
Of course you need to know which words are anagrams, not just that there is an anagram, so you can't just look it up in a set of letter sets, you have to look it up in a dictionary that maps letter sets to words. For example, something like this:
lettersets = collections.defaultdict(set)
for word in words:
lettersets[''.join(sorted(word))].add(word)
So now, to look up the anagrams for a word, all you have to do is:
anagrams = lettersets[''.join(sorted(word))]
Not only is that simple and readable, it's also constant-time.
And if you really want to print out the massive list of all anagrams of all words… well, that's easy too:
for _, words in lettersets.items():
for word in words:
print('{} is an anagram of {}'.format(word, ', '.join(words - {word})))
Now, instead of taking 479M*200K time to find anagrams for one word, or 479M*200K*200K time to find all anagrams for all words, it takes constant time to find anagrams for one word, or 200K time to find all anagrams for all words. (Of course there is 200K setup time added to the start to create the mapping, but spending 200K time up-front to save 200K, much less 479M*200K, time for each lookup is an obvious win.)
Things get a little trickier when you want to, e.g., find partial anagrams, or sentence anagarms, but you want to follow the same basic principles: find data structures that let you do things in constant or logarithmic time instead of linear or worse, and find algorithms that don't require you to brute-force your way through an exponential or factorial number of candidates.

import urllib
def anagram():
f=urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt')
words = f.read().split('\n')
d={''.join(sorted(x)):[] for x in words} #create dic with empty list as default
for x in words:
d[''.join(sorted(x))].append(x)
max_len= max( len(v) for k,v in d.iteritems())
for k,v in d.iteritems():
if len(v)>=max_len:
print v
anagram()
Output:
['abel', 'able', 'bale', 'bela', 'elba']
['alger', 'glare', 'lager', 'large', 'regal']
['angel', 'angle', 'galen', 'glean', 'lange']
['evil', 'levi', 'live', 'veil', 'vile']
['caret', 'carte', 'cater', 'crate', 'trace']
['elan', 'lane', 'lean', 'lena', 'neal']
Finished in 5.7 secs

Here's a hint on solving the problem: two strings are anagrams of each other if they have the same collection of letters. You can sort the words (turning e.g. "orchestra" into "acehorrst"), then just see two words have the same sorted order. If they do, then the original words must have been anagrams of each other, since they have all the same letters (in a different order).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.