Get maximum value by arranging numbers of various digits - python

I was trying to write a function largest_number() that can output the maximum value by arranging the orders of a list of given numbers (eg. 21,2 should give 221. 543,5432,1 should give 54354321). This is a homework problem that I have tested using many many cases but didn't get any wrong answer. However the grading system kept telling me that I have the wrong output (without showing the output on their end). I suspect there 's a tiny bit of the code that resulted in wrong output in some special cases, but I coudn't find it.
#Uses python3
#%%
import functools
def greater(str1, str2):
if str1==str2:
return 1
for i in range(min(len(str1),len(str2))):
if str1[i]>str2[i]:
return 1
if str1[i]<str2[i]:
return -1
if len(str1)>len(str2):
if str1[len(str1)-len(str2)-1]>str1[0]:
return 1
else:
return -1
if len(str1)<len(str2):
if str2[len(str2)-len(str1)-1]>str2[0]:
return -1
else:
return 1
def largest_number(a):
a=list(map(str,a))
a_sorted=sorted(a,key=functools.cmp_to_key(greater),reverse=True)
largest=int(''.join(a_sorted))
return(largest)
#test cases:
largest_number([21,2])
largest_number([543,5432,1])

The logic in greater is not entirely correct where it deals with the case that one string is a left-substring of the other:
if len(str1)>len(str2):
if str1[len(str1)-len(str2)-1]>str1[0]:
return 1
else:
return -1
Here you don't deal correctly with the case where str1[len(str1)-len(str2)-1] is equal to str1[0]. In that case you should iterate further, increasing the index in both strings (much like you did in the first part of the algorithm). So change the above code to:
if len(str1)>len(str2):
for i in range(len(str2),len(str1)):
if str1[i]>str1[i-len(str2)]:
return 1
if str1[i]<str1[i-len(str2)]:
return -1
... and do a similar change for the mirrored case.
However, there is a much easier way to approach this. In the greater function you could just concatenate the two strings in the two possible ways, and compare to see which of these two is the greater one:
def cmp(a, b): # This function was included in Python 2.x, but no longer in 3.x
return (a > b) - (a < b)
def greater(str1, str2):
return cmp(str1 + str2, str2 + str1)

You can solve that problem with Itertools Module:
import itertools
import functools
#"my_list" object is a list of string objects.
def largest_number(my_list):
answer = 0
for i in itertools.permutations(my_list):
x = int(functools.reduce(lambda x,y:x+y,i))
if x > answer:
answer = x
return answer
This code works because the "permutations" method of itertools module return all possible cases of combinations of a list.

Related

Python noob question. Looping through all to find correct combination

Sorry for noob question. Just started learning coding with Python - beginners level.
Wasn't able to find in the net what I require.
Creating function to loop through the entire line, in order to find right combination - 7,8,9 - regardless of input length and of target position, and return 'true' if found. Wasn't able to devise the function correctly. Not sure how to devise function clearly and at all this far.
Your help is much appreciated.
This is what I came up with so far (not working of course):
def _11(n):
for loop in range(len(n)):
if n[loop]==[7,8,9]:
return True
else:
return False
print(_11([1000,10,11,34,67,89,334,5567,6534,765,2,3,5,6,112,7,8,9,11111]))
It always returns False. Tried with (*n) to no avail.
The answer offered by #Carson is entirely correct.
I offer this not really as an answer to the question but as an alternative and more efficient approach.
In OP's question he is looking for an occurrence of 3 consecutive values described by way of a list. Let's call that a triplet.
If we iterate over the input list one element at a time we create lots of triplets before comparing them.
However, we can make this more efficient by searching the input list for any occurrence of the first item in the target triplet. In that way we are likely to slice the input list far less often.
Here are two implementations with timings...
from timeit import timeit
def _11(n, t):
offset = 0
lt = len(t)
m = len(n) - lt
while offset < m:
try:
offset += n[offset:].index(t[0])
if n[offset:offset+lt] == t:
return True
offset += 1
except ValueError:
break
return False
def _11a(n, t):
for index in range(len(n) - len(t)):
if n[index:index + len(t)] == t:
return True
return False
n = [1000,10,11,34,67,89,334,5567,6534,765,2,3,5,6,112,7,8,9,11111]
t = [7, 8, 9]
for func in _11, _11a:
print(func.__name__, timeit(lambda: func(n, t)))
Output:
_11 0.43439731000012216
_11a 1.8685798310000337
There are two mistakes with your code.
Indexing into a loop returns 1 element, not multiple. When you write n[loop], you're getting 1 value, not a list.
You shouldn't return false that early. Your code exits after the first step in the loop, but it should go through the entire loop before returning false.
Consider the following snippet:
def has_subarr(arr, subarr):
"""Tests if `subarr` exists in arr"""
for i in range(len(arr)):
if arr[i:i+len(subarr)] == subarr:
return True
return False
This code is more general than your example, it accepts the value to check for as another argument. Notice the use of : in the array access. This allows you to return multiple elements in an array. Also notice how the return False is only reached once the entire loop has completed.
First, n[loop] return a single element, not a sublist. You should use n[loop+3]. But this will introduce a problem where loop+3 exceeds the length of the list. So the solution may be:
def _11(n):
for loop in range(len(n)-3):
if n[loop:loop+3]==[7,8,9]:
return True
else:
return False
print(_11([1000,10,11,34,67,89,334,5567,6534,765,2,3,5,6,112,7,8,9,11111]))
Your actual code return during the first iteration. You only test once. You must modify the indentation as in:
def _11(n):
target = [7,8,9]
for index in range( len(n) - len(target)):
if n[index:index + len(target)] == [7,8,9]:
return True
return False
print(_11([1000,10,11,34,67,89,334,5567,6534,765,2,3,5,6,112,7,8,9,11111]))
You can try checking the str representation of the 2 lists:
import re
def _11(n):
if re.search("(?<![0-9-.'])7, 8, 9(?![0-9.])",str(n)):
return True
return False
print(_11([27,8,9]))
The Output:
False

calculates the product of all elements in a tuple. For example, for (2,3,4) the result would be 2X3X4=24

numbers = (2,3,4)
def product(n):
m = 1
for i in n:
m *= i
return print(numbers[0],'x',numbers[1],'x',numbers[2],'=',m)
product(numbers)
This is what I wrote for this problem. But I don't know how to make the result like "2x3x4=24" exactly. Another question is if I add '5' in the parentheses, it only shows "2x3x4=120", I cannot get "2x3x4x5=120". Could anyone helps me to fix my code??? Thanks.
There are two main issues with your code. The first in that your return statement is executed in the first iteration of your loop, but you want it to happen after the entire loop has finished. This happens because you have too much whitespace to the left of the return statement. The second is that you’re attempting to return the return value of a call to print. To fix this ditch the return or even better return a string then print that:
numbers = (2,3,4)
def product(n):
m = 1
for i in n:
m *= i
return f"{n[0]}x{n[1]}x{n[2]}={m}"
returned = product(numbers)
print(returned)
The answer linked in the comments points out there’s an even easier way to do this:
from math import prod
returned = prod((2,3,4))
print(returned)
And here's a bit of a kick-flip for fun:
from math import prod
numbers = (2,3,4)
print(*numbers, sep="x", end=f"={prod(numbers)}")

Given an integer, add operators between digits to get n and return list of correct answers

Here is the problem I'm trying to solve:
Given an int, ops, n, create a function(int, ops, n) and slot operators between digits of int to create equations that evaluates to n. Return a list of all possible answers. Importing functions is not allowed.
For example,
function(111111, '+-%*', 11) => [1*1+11/1-1 = 11, 1*1/1-1+11 =11, ...]
The question recommended using interleave(str1, str2) where interleave('abcdef', 'ab') = 'aabbcdef' and product(str1, n) where product('ab', 3) = ['aaa','aab','abb','bbb','aba','baa','bba'].
I have written interleave(str1, str2) which is
def interleave(str1,str2):
lsta,lstb,result= list(str1),list(str2),''
while lsta and lstb:
result += lsta.pop(0)
result += lstb.pop(0)
if lsta:
for i in lsta:
result+= i
else:
for i in lstb:
result+=i
return result
However, I have no idea how to code the product function. I assume it has to do something with recursion, so I'm trying to add 'a' and 'b' for every product.
def product(str1,n):
if n ==1:
return []
else:
return [product(str1,n-1)]+[str1[0]]
Please help me to understand how to solve this question. (Not only the product it self)
General solution
Assuming your implementation of interleave is correct, you can use it together with product (see my suggested implementation below) to solve the problem with something like:
def f(i, ops, n):
int_str = str(i)
retval = []
for seq_len in range(1, len(int_str)):
for op_seq in r_prod(ops, seq_len):
eq = interleave(int_str, op_seq)
if eval(eq) == n:
retval.append(eq)
return retval
The idea is that you interleave the digits of your string with your operators in a varying order. Basically I do that with all possible sequences of length seq_len which varies from 1 to max, which will be the number of digits - 1 (see assumptions below!). Then you use the built-in function eval to evaluate the expression returned by inteleave for a specific sequence of the operators and compare the result with the desired number, n. If the expression evaluates to n you append it to the return array retval (initially empty). After you evaluated all the expressions for all possible operator sequences (see assumptions!) you return the array.
Assumptions
It's not clear whether you can use the same operator multiple times or if you're allowed to omit using some. I assumed you can use the same operator many times and that you're allowed to omit using an operator. Hence, the r_prod was used (as suggested by your question). In case of such restrictions, you will want to use permutations (of possibly varying length) of the group of operators.
Secondly, I assumed that your implementation of the interleave function is correct. It is not clear if, for example, interleave("112", "*") should return both "1*12" and "11*2" or just "1*12" like your implementation does. In the case both should be returned, then you should also iterate over the possible ways the same ordered sequence of operators can be interleaved with the provided digits. I omitted that, because I saw that your function always returns a single string.
Product implementation
If you look at the itertools docs you can see the equivalent code for the function itertools.product. Using that you'd have:
def product(*args, repeat=1):
pools = [tuple(pool) for pool in args] * repeat
result = [[]]
for pool in pools:
result = [x+[y] for x in result for y in pool]
for prod in result:
yield tuple(prod)
a = ["".join(x) for x in product('ab', repeat=3)]
print(a)
Which prints ['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba', 'bbb'] -- what I guess is what you're after.
A more specific (assuming iterable is a string), less efficient, but hopefully more understandable solution would be:
def prod(string, r):
if r < 1:
return None
retval = list(string)
for i in range(r - 1):
temp = []
for l in retval:
for c in string:
temp.append(l + c)
retval = temp
return retval
The idea is simple. The second parameter r gives you the length of the strings you want to produce. The characters in the string give you the elements from which you build the string. Hence, you first generate a string of length 1 that starts with each possible character. Then for each of those strings you generate new strings by concatenating the old string with all of the possible characters.
For example, given a pool of characters "abc", you'll first generate strings "a", "b", and "c". Then you'll replace string "a" with strings "aa", "ab", and "ac". Similarly for "b" and "c". You repeat this process n-times to get all possible strings of length r generated by drawing with replacement from the pool "abc".
I'd think it would be a good idea for you to try to implement the prod function recursively. You can see my ugly solution below, but I'd suggest you stop reading this now and try to do it without looking at my suggestion first.
SPOILER BELOW
def r_prod(string, r):
if r == 1:
return list(string)
else:
return [c + s for c in string for s in r_prod(string, r - 1)]

Understanding recursion in Python with if case

In solving this codewars challenge I came across a recursion example I don't understand.
The challenge is to give the next nth numbers, given an initial 3 number seed sequence, where the nth numbers are determined by adding the last three numbers of the sequence.
So for the seed sequence list of [1,2,3] and given n=5, you'd return the following:
1 + 2 + 3 = 6
2 + 3 + 6 = 11
Answer:
[1, 2, 3, 6, 11]
I solved the problem with the following:
def tribonacci(sequence,n):
if n <=3:
return sequence[:n]
else:
for i in range(n-3):
sequence.append(sum(signature[-3:]))
return sequence
In reviewing the other solutions, I came across this very elegant solution that uses recursion:
def tribonacci(sequence,n):
return sequence[:n] if n<=len(sequence) else tribonacci(sequence + [sum(sequence[-3:])],n)
My question is: why doesn't this just run infinitely? I'm having trouble understanding why this terminates with the nth iteration. It doesn't seem like the function of 'n' is stipulated anywhere, except in the if case.
As an experiment, I modified the code to ignore the less-than-or-equal-to-length-of-sequence case, like so:
def tribonacci(sequence,n):
return tribonacci(sequence + [sum(sequence[-3:])],n)
And this does run infinitely and errors out with a runtime error of max recursion depth.
So obviously the case option is what seems to control the termination, but I can't see why. I've used recursion myself in solves (for instance in creating a factoring function), but in that example you subtract n-1 as you iterate so there's a terminating process. I don't see that happening here.
I guess I don't completely understand how the return function works. I was reading it as:
return n-item list if n is less/equal to sequence length
else
rerun the function
Perhaps I should actually be reading it as:
return n-item list if n is less/equal to sequence length
else
return n-item list after iterating through the function enough times
to fill a n-item list
At each level of recursion, the sequence becomes longer because of concatenation (+). Eventually it will become long enough for n to become less than length.
You can rewrite this:
a = b if p else c
as this:
if p:
a = b
else:
a = c
Knowing that, you can see that this:
def tribonacci(sequence,n):
return sequence[:n] if n<=len(sequence) else tribonacci(sequence + [sum(sequence[-3:])],n)
is the same as:
def tribonacci(sequence,n):
if n <= len(sequence):
return sequence[:n]
else:
return tribonacci(sequence + [sum(sequence[-3:])],n)
Now you should be able to see that there's a condition controlling that the recursion doesn't go on for ever.

How can I optimize this Python code to generate all words with word-distance 1?

Profiling shows this is the slowest segment of my code for a little word game I wrote:
def distance(word1, word2):
difference = 0
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
def getchildren(word, wordlist):
return [ w for w in wordlist if distance(word, w) == 1 ]
Notes:
distance() is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ from word by exactly 1 letter.
wordlist is pre-filtered to only have words containing the same number of letters as word so it's guaranteed that word1 and word2 have the same number of chars.
I'm fairly new to Python (started learning it 3 days ago) so comments on naming conventions or other style things also appreciated.
for wordlist, take the 12dict word list using the "2+2lemma.txt" file
Results:
Thanks everyone, with combinations of different suggestions I got the program running twice as fast now (on top of the optimizations I did on my own before asking, so 4 times speed increase approx from my initial implementation)
I tested with 2 sets of inputs which I'll call A and B
Optimization1: iterate over indices of word1,2 ... from
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
to iterate on letter-pairs using zip(word1, word2)
for x,y in zip (word1, word2):
if x != y:
difference += 1
return difference
Got execution time from 11.92 to 9.18 for input A, and 79.30 to 74.59 for input B
Optimization2:
Added a separate method for differs-by-one in addition to the distance-method (which I still needed elsewhere for the A* heuristics)
def is_neighbors(word1,word2):
different = False
for c1,c2 in zip(word1,word2):
if c1 != c2:
if different:
return False
different = True
return different
Got execution time from 9.18 to 8.83 for input A, and 74.59 to 70.14 for input B
Optimization3:
Big winner here was to use izip instead of zip
Got execution time from 8.83 to 5.02 for input A, and 70.14 to 41.69 for input B
I could probably do better writing it in a lower level language, but I'm happy with this for now. Thanks everyone!
Edit again: More results
Using Mark's method of checking the case where the first letter doesn't match got it down from 5.02 -> 3.59 and 41.69 -> 29.82
Building on that and incorporating izip instead of range, I ended up with this:
def is_neighbors(word1,word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for x,y in izip(word1[1:],word2[1:]):
if x != y:
if different:
return False
different = True
return different
Which squeezed a little bit more, bringing the times down from 3.59 -> 3.38 and 29.82 -> 27.88
Even more results!
Trying Sumudu's suggestion that I generate a list of all strings that are 1 letter off from "word" and then checking to see which ones were in the wordlist, instead of the is_neighbor function I ended up with this:
def one_letter_off_strings(word):
import string
dif_list = []
for i in xrange(len(word)):
dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
return dif_list
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return ( w for w in oneoff if w in wordlist )
Which ended up being slower (3.38 -> 3.74 and 27.88 -> 34.40) but it seemed promising. At first I thought the part I'd need to optimize was "one_letter_off_strings" but profiling showed otherwise and that the slow part was in fact
( w for w in oneoff if w in wordlist )
I thought if there'd be any difference if I switched "oneoff" and "wordlist" and did the comparison the other way when it hit me that I was looking for the intersection of the 2 lists. I replace that with set-intersection on the letters:
return set(oneoff) & set(wordlist)
Bam! 3.74 -> 0.23 and 34.40 -> 2.25
This is truely amazing, total speed difference from my original naive implementation:
23.79 -> 0.23 and 180.07 -> 2.25, so approx 80 to 100 times faster than the original implementation.
If anyone is interested, I made blog post describing the program and describing the optimizations made including one that isn't mentioned here (because it's in a different section of code).
The Great Debate:
Ok, me and Unknown are having a big debate which you can read in the comments of his answer. He claims that it would be faster using the original method (using is_neighbor instead of using the sets) if it was ported to C. I tried for 2 hours to get a C module I wrote to build and be linkable without much success after trying to follow this and this example, and it looks like the process is a little different in Windows? I don't know, but I gave up on that. Anyway, here's the full code of the program, and the text file come from the 12dict word list using the "2+2lemma.txt" file. Sorry if the code's a little messy, this was just something I hacked together. Also I forgot to strip out commas from the wordlist so that's actually a bug that you can leave in for the sake of the same comparison or fix it by adding a comma to the list of chars in cleanentries.
from itertools import izip
def unique(seq):
seen = {}
result = []
for item in seq:
if item in seen:
continue
seen[item] = 1
result.append(item)
return result
def cleanentries(li):
pass
return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
difference = 0
for x,y in izip (word1, word2):
if x != y:
difference += 1
return difference
def is_neighbors(word1,word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for x,y in izip(word1[1:],word2[1:]):
if x != y:
if different:
return False
different = True
return different
def one_letter_off_strings(word):
import string
dif_list = []
for i in xrange(len(word)):
dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
return dif_list
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
import Queue
closedset = []
openset = [start]
pqueue = Queue.PriorityQueue(0)
g_score = {start:0} #Distance from start along optimal path.
h_score = {start:distance(start, goal)}
f_score = {start:h_score[start]}
pqueue.put((f_score[start], start))
parent_dict = {}
while len(openset) > 0:
x = pqueue.get(False)[1]
if x == goal:
return reconstruct_path(parent_dict,goal)
openset.remove(x)
closedset.append(x)
sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
sortedOpen.sort()
for y in getchildren(x, wordlist):
if y in closedset:
continue
temp_g_score = g_score[x] + 1
temp_is_better = False
appended = False
if (not y in openset):
openset.append(y)
appended = True
h_score[y] = distance(y, goal)
temp_is_better = True
elif temp_g_score < g_score[y] :
temp_is_better = True
else :
pass
if temp_is_better:
parent_dict[y] = x
g_score[y] = temp_g_score
f_score[y] = g_score[y] + h_score[y]
if appended :
pqueue.put((f_score[y], y))
return None
def reconstruct_path(parent_dict,node):
if node in parent_dict.keys():
p = reconstruct_path(parent_dict,parent_dict[node])
p.append(node)
return p
else:
return []
wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
words = [w.lower() for w in userentry.split()]
if(len(words) == 2 and len(words[0]) == len(words[1])):
break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
print "Minimum number of steps is %s" % (len(answer))
reply = raw_input("Would you like the answer(y/n)? ")
if(reply.lower() == "y"):
answer.insert(0, words[0])
print "\n".join(answer)
else:
print "Good luck!"
else:
print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")
I left the is_neighbors method in even though it's not used. This is the method that is proposed to be ported to C. To use it, just replace getchildren with this:
def getchildren(word, wordlist):
return ( w for w in wordlist if is_neighbors(word, w))
As for getting it to work as a C module I didn't get that far, but this is what I came up with:
#include "Python.h"
static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
int length;
const char *word1, *word2;
if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
return NULL;
int i;
int different = 0;
for (i =0; i < length; i++)
{
if (*(word1 + i) != *(word2 + i))
{
if (different)
{
return Py_BuildValue("i", different);
}
different = 1;
}
}
return Py_BuildValue("i", different);
}
PyMethodDef methods[] = {
{"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC
initIsNeighbor(void)
{
Py_InitModule("isneighbor", methods);
}
I profiled this using:
python -m cProfile "Wordgame.py"
And the time recorded was the total time of the AStar method call. The fast input set was "verse poets" and the long input set was "poets verse". Timings will obviously vary between different machines, so if anyone does end up trying this give result comparison of the program as is, as well as with the C module.
If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.
I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.
Your function distance is calculating the total distance, when you really only care about distance=1. The majority of cases you'll know it's >1 within a few characters, so you could return early and save a lot of time.
Beyond that, there might be a better algorithm, but I can't think of it.
Edit: Another idea.
You can make 2 cases, depending on whether the first character matches. If it doesn't match, the rest of the word has to match exactly, and you can test for that in one shot. Otherwise, do it similarly to what you were doing. You could even do it recursively, but I don't think that would be faster.
def DifferentByOne(word1, word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
same = True
for i in range(1, len(word1)):
if word1[i] != word2[i]:
if same:
same = False
else:
return False
return not same
Edit 2: I've deleted the check to see if the strings are the same length, since you say it's redundant. Running Ryan's tests on my own code and on the is_neighbors function provided by MizardX, I get the following:
Original distance(): 3.7 seconds
My DifferentByOne(): 1.1 seconds
MizardX's is_neighbors(): 3.7 seconds
Edit 3: (Probably getting into community wiki territory here, but...)
Trying your final definition of is_neighbors() with izip instead of zip: 2.9 seconds.
Here's my latest version, which still times at 1.1 seconds:
def DifferentByOne(word1, word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for i in range(1, len(word1)):
if word1[i] != word2[i]:
if different:
return False
different = True
return different
from itertools import izip
def is_neighbors(word1,word2):
different = False
for c1,c2 in izip(word1,word2):
if c1 != c2:
if different:
return False
different = True
return different
Or maybe in-lining the izip code:
def is_neighbors(word1,word2):
different = False
next1 = iter(word1).next
next2 = iter(word2).next
try:
while 1:
if next1() != next2():
if different:
return False
different = True
except StopIteration:
pass
return different
And a rewritten getchildren:
def iterchildren(word, wordlist):
return ( w for w in wordlist if is_neighbors(word, w) )
izip(a,b) returns an iterator over pairs of values from a and b.
zip(a,b) returns a list of pairs from a and b.
People are mainly going about this by trying to write a quicker function, but there might be another way..
"distance" is called over 5 million times
Why is this? Perhaps a better way to optimise is to try and reduce the number of calls to distance, rather than shaving milliseconds of distance's execution time. It's impossible to tell without seeing the full script, but optimising a specific function is generally unnecessary.
If that is impossible, perhaps you could write it as a C module?
How often is the distance function called with the same arguments? A simple to implement optimization would be to use memoization.
You could probably also create some sort of dictionary with frozensets of letters and lists of words that differ by one and look up values in that. This datastructure could either be stored and loaded through pickle or generated from scratch at startup.
Short circuiting the evaluation will only give you gains if the words you are using are very long, since the hamming distance algorithm you're using is basically O(n) where n is the word length.
I did some experiments with timeit for some alternative approaches that may be illustrative.
Timeit Results
Your Solution
d = """\
def distance(word1, word2):
difference = 0
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
"""
t1 = timeit.Timer('distance("hello", "belko")', d)
print t1.timeit() # prints 6.502113536776391
One Liner
d = """\
from itertools import izip
def hamdist(s1, s2):
return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
"""
t2 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 10.985101179
Shortcut Evaluation
d = """\
def distance_is_one(word1, word2):
diff = 0
for i in xrange(len(word1)):
if word1[i] != word2[i]:
diff += 1
if diff > 1:
return False
return diff == 1
"""
t3 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 6.63337
Well you can start by having your loop break if the difference is 2 or more.
Also you can change
for i in range(len(word1)):
to
for i in xrange(len(word1)):
Because xrange generates sequences on demand instead of generating the whole range of numbers at once.
You can also try comparing word lengths which would be quicker. Also note that your code doesn't work if word1 is greater than word2
There's not much else you can do algorithmically after that, which is to say you'll probably find more of a speedup by porting that section to C.
Edit 2
Attempting to explain my analysis of Sumudu's algorithm compared to verifying differences char by char.
When you have a word of length L, the number of "differs-by-one" words you will generate will be 25L. We know from implementations of sets on modern computers, that the search speed is approximately log(n) base 2, where n is the number of elements to search for.
Seeing that most of the 5 million words you test against is not in the set, most of the time, you will be traversing the entire set, which means that it really becomes log(25L) instead of only log(25L)/2. (and this is assuming best case scenario for sets that comparing string by string is equivalent to comparing char by char)
Now we take a look at the time complexity for determining a "differs-by-one". If we assume that you have to check the entire word, then the number of operations per word becomes L. We know that most words differ by 2 very quickly. And knowing that most prefixes take up a small portion of the word, we can logically assume that you will break most of the time by L/2, or half the word (and this is a conservative estimate).
So now we plot the time complexities of the two searches, L/2 and log(25L), and keeping in mind that this is even considering string matching the same speed as char matching (highly in favor of sets). You have the equation log(25*L) > L/2, which can be simplified down to log(25) > L/2 - log(L). As you can see from the graph, it should be quicker to use the char matching algorithm until you reach very large numbers of L.
Also, I don't know if you're counting breaking on difference of 2 or more in your optimization, but from Mark's answer I already break on a difference of 2 or more, and actually, if the difference in the first letter, it breaks after the first letter, and even in spite of all those optimizations, changing to using sets just blew them out of the water. I'm interested in trying your idea though
I was the first person in this question to suggest breaking on a difference of 2 or more. The thing is, that Mark's idea of string slicing (if word1[0] != word2[0]: return word1[1:] == word2[1:]) is simply putting what we are doing into C. How do you think word1[1:] == word2[1:] is calculated? The same way that we are doing.
I read your explanation a few times but I didn't quite follow it, would you mind explaining it a little more indepth? Also I'm not terribly familiar with C and I've been working in high-level languages for the past few years (closest has been learning C++ in high school 6 years ago
As for producing the C code, I am a bit busy. I am sure you will be able to do it since you have written in C before. You could also try C#, which probably has similar performance characteristics.
More Explanation
Here is a more indepth explanation to Davy8
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return set(oneoff) & set(wordlist)
Your one_letter_off_strings function will create a set of 25L strings(where L is the number of letters).
Creating a set from the wordlist will create a set of D strings (where D is the length of your dictionary). By creating an intersection from this, you MUST iterate over each oneoff and see if it exists in wordlist.
The time complexity for this operation is detailed above. This operation is less efficient than comparing the word you want with each word in wordlist. Sumudu's method is an optimization in C rather than in algorithm.
More Explanation 2
There's only 4500 total words (because the wordlist is pre-filtered for 5 letter words before even being passed to the algorithm), being intersected with 125 one-letter-off words. It seemed that you were saying intersection is log(smaller) or in otherwords log(125, 2). Compare this to again assuming what you said, where comparing a word breaks in L/2 letters, I'll round this down to 2, even though for a 5 letter word it's more likely to be 3. This comparison is done 4500 times, so 9000. log(125,2) is about 6.9, and log(4500,2) is about 12. Lemme know if I misinterpreted your numbers.
To create the intersection of 125 one-letter-off words with a dictionary of 4500, you need to make 125 * 4500 comparisons. This is not log(125,2). It is at best 125 * log(4500, 2) assuming that the dictionary is presorted. There is no magic shortcut to sets. You are also doing a string by string instead of char by char comparison here.
For such a simple function that has such a large performance implication, I would probably make a C library and call it using ctypes. One of reddit's founders claims they made the website 2x as fast using this technique.
You can also use psyco on this function, but beware that it can eat up a lot of memory.
I don't know if it will significantly affect your speed, but you could start by turning the list comprehension into a generator expression. It's still iterable so it shouldn't be much different in usage:
def getchildren(word, wordlist):
return [ w for w in wordlist if distance(word, w) == 1 ]
to
def getchildren(word, wordlist):
return ( w for w in wordlist if distance(word, w) == 1 )
The main problem would be that a list comprehension would construct itself in memory and take up quite a bit of space, whereas the generator will create your list on the fly so there is no need to store the whole thing.
Also, following on unknown's answer, this may be a more "pythonic" way of writing distance():
def distance(word1, word2):
difference = 0
for x,y in zip (word1, word2):
if x == y:
difference += 1
return difference
But it's confusing what's intended when len (word1) != len (word2), in the case of zip it will only return as many characters as the shortest word. (Which could turn out to be an optimization...)
Try this:
def distance(word1, word2):
return sum([not c1 == c2 for c1, c2 in zip(word1,word2)])
Also, do you have a link to your game? I like being destroyed by word games
First thing to occur to me:
from operator import ne
def distance(word1, word2):
return sum(map(ne, word1, word2))
which has a decent chance of going faster than other functions people have posted, because it has no interpreted loops, just calls to Python primitives. And it's short enough that you could reasonably inline it into the caller.
For your higher-level problem, I'd look into the data structures developed for similarity search in metric spaces, e.g. this paper or this book, neither of which I've read (they came up in a search for a paper I have read but can't remember).
for this snippet:
for x,y in zip (word1, word2):
if x != y:
difference += 1
return difference
i'd use this one:
return sum(1 for i in xrange(len(word1)) if word1[i] == word2[i])
the same pattern would follow all around the provided code...
Everyone else focused just on explicit distance-calculation without doing anything about constructing the distance-1 candidates.
You can improve by using a well-known data-structure called a Trie to merge the implicit distance-calculation with the task of generating all distance-1 neighbor words. A Trie is a linked-list where each node stands for a letter, and the 'next' field is a dict with up to 26 entries, pointing to the next node.
Here's the pseudocode: walk the Trie iteratively for your given word; at each node add all distance-0 and distance-1 neighbors to the results; keep a counter of distance and decrement it. You don't need recursion, just a lookup function which takes an extra distance_so_far integer argument.
A minor tradeoff of extra speed for O(N) space increase can be gotten by building separate Tries for length-3, length-4, length-5 etc. words.

Categories