Comparing Python integers - python

I have a list of integers below a thousand and a hash function transforming it into a single integer but much larger one. There is a hash function code below:
def hash_function(lst):
hsh = 0
for i, item in enumerate(lst):
hsh += item * pow(10, i * 3)
return hsh
Assume that lst has about 4-5 items.
Is comparing two integers more efficient than comparing two lists of much smaller integers? Why or why not? I have to compare a few hundreds thousands of hashes.

I came up with a quick test to show the differences between the builtin list comparisons, and your hash function.
import time
import random
import sys
def compareRegular(a, b):
return a == b
def listHash(lst):
hsh = 0
for i, item in enumerate(lst):
hsh += item * pow(10, i * 3)
return hsh
def compareHash(a, b):
return listHash(a) == listHash(b)
def compareLists(hugeList, comparison):
output = []
for i, lstA in enumerate(hugeList[:-1]):
for j, lstB in enumerate(hugeList[i + 1:]):
if comparison(lstA, lstB):
output.append([i, j])
return output
def genList(minValue, maxValue, numElements):
output = []
for _ in range(1000):
smallList = []
for _ in range(numElements):
smallList.append(random.randint(minValue, maxValue))
output.append(smallList)
return output
random.seed(123)
hugeListA = genList(-sys.maxint - 1, sys.maxint, 5)
hugeListB = genList(0, 100, 5)
print "Test with huge numbers in our list"
start = time.time()
regularOut = compareLists(hugeListA, compareRegular)
end = time.time()
print "Regular compare takes:", end - start
start = time.time()
hashOut = compareLists(hugeListA, compareHash)
end = time.time()
print "Regular compare takes:", end - start
print "Are both outputs the same?", regularOut == hashOut
print
print "Test with smaller number in our lists"
start = time.time()
regularOut = compareLists(hugeListB, compareRegular)
end = time.time()
print "Regular compare takes:", end - start
start = time.time()
hashOut = compareLists(hugeListB, compareHash)
end = time.time()
print "Regular compare takes:", end - start
print "Are both outputs the same?", regularOut == hashOut
On my computer this outputs:
Test with huge numbers in our list
Regular compare takes: 0.0940001010895
Regular compare takes: 3.38999986649
Are both outputs the same? True
Test with smaller number in our lists
Regular compare takes: 0.0789999961853
Regular compare takes: 3.01400017738
Are both outputs the same? True
The people who develop python definitely spend a bunch of time thinking about stuff like this. I personally have no idea how the built in list comparison actually works, but I'm pretty certain it doesn't execute within the Python interpreter like your hash function will. Many python built in functions and types are backed by native executing C code, the list comparison function likely falls into this category.
Even if you implemented your hash function in a similar way and had it executed natively, it would still likely be slower. You're basically looking at N calls to pow or N number comparisons. Even if they are variable size integers, memcmp certainly won't take longer than loading the same value from memory and performing floating point operations on it.

Related

Speeding up `any` with list comprehension

I am using any with a list comprehension. I would like to break the list comprehension when any returns True. For example,
import time
def f(x):
time.sleep(2)
return x
beginTime = time.time()
result = any([f(x) == 0 for x in [0,1,3,5,7]])
endTime = time.time()
print(endTime - beginTime)
The above code prints 10 seconds although it could break the iteration after first True.
Use a generator expression instead of a list comprehension to avoid forming the list first:
result = any(f(x) == 0 for x in [0,1,3,5,7])
(the square brackets of the list comprehension are gone.)
Note that any has a short-circuiting behaviour in either case, but what differs is the lack of forming the whole list.
You can use a generator, as told by Mustafa, but retrieve only first element of truth.
The generator non necessary must be consumed totally, and walrus operator do the rest
import time
def f(x):
time.sleep(2)
return x
beginTime = time.time()
result = next((wr for x in [0,1,3,5,7] if (wr := f(x)) ==0))
endTime = time.time()
print(endTime - beginTime)
This takes only minimum time to retrieve first ocurrence

how to make an imputed string to a list, change it to a palindrome(if it isn't already) and reverse it as a string back

A string is palindrome if it reads the same forward and backward. Given a string that contains only lower case English alphabets, you are required to create a new palindrome string from the given string following the rules gives below:
1. You can reduce (but not increase) any character in a string by one; for example you can reduce the character h to g but not from g to h
2. In order to achieve your goal, if you have to then you can reduce a character of a string repeatedly until it becomes the letter a; but once it becomes a, you cannot reduce it any further.
Each reduction operation is counted as one. So you need to count as well how many reductions you make. Write a Python program that reads a string from a user input (using raw_input statement), creates a palindrome string from the given string with the minimum possible number of operations and then prints the palindrome string created and the number of operations needed to create the new palindrome string.
I tried to convert the string to a list first, then modify the list so that should any string be given, if its not a palindrome, it automatically edits it to a palindrome and then prints the result.after modifying the list, convert it back to a string.
c=raw_input("enter a string ")
x=list(c)
y = ""
i = 0
j = len(x)-1
a = 0
while i < j:
if x[i] < x[j]:
a += ord(x[j]) - ord(x[i])
x[j] = x[i]
print x
else:
a += ord(x[i]) - ord(x[j])
x [i] = x[j]
print x
i = i + 1
j = (len(x)-1)-1
print "The number of operations is ",a print "The palindrome created is",( ''.join(x) )
Am i approaching it the right way or is there something I'm not adding up?
Since only reduction is allowed, it is clear that the number of reductions for each pair will be the difference between them. For example, consider the string 'abcd'.
Here the pairs to check are (a,d) and (b,c).
Now difference between 'a' and 'd' is 3, which is obtained by (ord('d')-ord('a')).
I am using absolute value to avoid checking which alphabet has higher ASCII value.
I hope this approach will help.
s=input()
l=len(s)
count=0
m=0
n=l-1
while m<n:
count+=abs(ord(s[m])-ord(s[n]))
m+=1
n-=1
print(count)
This is a common "homework" or competition question. The basic concept here is that you have to find a way to get to minimum values with as few reduction operations as possible. The trick here is to utilize string manipulation to keep that number low. For this particular problem, there are two very simple things to remember: 1) you have to split the string, and 2) you have to apply a bit of symmetry.
First, split the string in half. The following function should do it.
def split_string_to_halves(string):
half, rem = divmod(len(string), 2)
a, b, c = '', '', ''
a, b = string[:half], string[half:]
if rem > 0:
b, c = string[half + 1:], string[rem + 1]
return (a, b, c)
The above should recreate the string if you do a + c + b. Next is you have to convert a and b to lists and map the ord function on each half. Leave the remainder alone, if any.
def convert_to_ord_list(string):
return map(ord, list(string))
Since you just have to do a one-way operation (only reduction, no need for addition), you can assume that for each pair of elements in the two converted lists, the higher value less the lower value is the number of operations needed. Easier shown than said:
def convert_to_palindrome(string):
halfone, halftwo, rem = split_string_to_halves(string)
if halfone == halftwo[::-1]:
return halfone + halftwo + rem, 0
halftwo = halftwo[::-1]
zipped = zip(convert_to_ord_list(halfone), convert_to_ord_list(halftwo))
counter = sum([max(x) - min(x) for x in zipped])
floors = [min(x) for x in zipped]
res = "".join(map(chr, floors))
res += rem + res[::-1]
return res, counter
Finally, some tests:
target = 'ideal'
print convert_to_palindrome(target) # ('iaeai', 6)
target = 'euler'
print convert_to_palindrome(target) # ('eelee', 29)
target = 'ohmygodthisisinsane'
print convert_to_palindrome(target) # ('ehasgidihmhidigsahe', 84)
I'm not sure if this is optimized nor if I covered all bases. But I think this pretty much covers the general concept of the approach needed. Compared to your code, this is clearer and actually works (yours does not). Good luck and let us know how this works for you.

Find the smallest positive number not in list

I have a list in python like this:
myList = [1,14,2,5,3,7,8,12]
How can I easily find the first unused value? (in this case '4')
I came up with several different ways:
Iterate the first number not in set
I didn't want to get the shortest code (which might be the set-difference trickery) but something that could have a good running time.
This might be one of the best proposed here, my tests show that it might be substantially faster - especially if the hole is in the beginning - than the set-difference approach:
from itertools import count, filterfalse # ifilterfalse on py2
A = [1,14,2,5,3,7,8,12]
print(next(filterfalse(set(A).__contains__, count(1))))
The array is turned into a set, whose __contains__(x) method corresponds to x in A. count(1) creates a counter that starts counting from 1 to infinity. Now, filterfalse consumes the numbers from the counter, until a number is found that is not in the set; when the first number is found that is not in the set it is yielded by next()
Timing for len(a) = 100000, randomized and the sought-after number is 8:
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
0.9200698399945395
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
3.1420603669976117
Timing for len(a) = 100000, ordered and the first free is 100001
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
1.520096342996112
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
1.987783643999137
(note that this is Python 3 and range is the py2 xrange)
Use heapq
The asymptotically good answer: heapq with enumerate
from heapq import heapify, heappop
heap = list(A)
heapify(heap)
from heapq import heapify, heappop
from functools import partial
# A = [1,2,3] also works
A = [1,14,2,5,3,7,8,12]
end = 2 ** 61 # these are different and neither of them can be the
sentinel = 2 ** 62 # first gap (unless you have 2^64 bytes of memory).
heap = list(A)
heap.append(end)
heapify(heap)
print(next(n for n, v in enumerate(
iter(partial(heappop, heap), sentinel), 1) if n != v))
Now, the one above could be the preferred solution if written in C, but heapq is written in Python and most probably slower than many other alternatives that mainly use C code.
Just sort and enumerate to find the first not matching
Or the simple answer with good constants for O(n lg n)
next(i for i, e in enumerate(sorted(A) + [ None ], 1) if i != e)
This might be fastest of all if the list is almost sorted because of how the Python Timsort works, but for randomized the set-difference and iterating the first not in set are faster.
The + [ None ] is necessary for the edge cases of there being no gaps (e.g. [1,2,3]).
This makes use of the property of sets
>>> l = [1,2,3,5,7,8,12,14]
>>> m = range(1,len(l))
>>> min(set(m)-set(l))
4
I would suggest you to use a generator and use enumerate to determine the missing element
>>> next(a for a, b in enumerate(myList, myList[0]) if a != b)
4
enumerate maps the index with the element so your goal is to determine that element which differs from its index.
Note, I am also assuming that the elements may not start with a definite value, in this case which is 1, and if it is so, you can simplify the expression further as
>>> next(a for a, b in enumerate(myList, 1) if a != b)
4
A for loop with the list will do it.
l = [1,14,2,5,3,7,8,12]
for i in range(1, max(l)):
if i not in l: break
print(i) # result 4
Don't know how efficient, but why not use an xrange as a mask and use set minus?
>>> myList = [1,14,2,5,3,7,8,12]
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
4
You're only creating a set as big as myList, so it can't be that bad :)
This won't work for "full" lists:
>>> myList = range(1, 5)
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: min() arg is an empty sequence
But the fix to return the next value is simple (add one more to the masked set):
>>> min(set(xrange(1, len(myList) + 2)) - set(myList))
5
import itertools as it
next(i for i in it.count() if i not in mylist)
I like this because it reads very closely to what you're trying to do: "start counting, keep going until you reach a number that isn't in the list, then tell me that number". However, this is quadratic since testing i not in mylist is linear.
Solutions using enumerate are linear, but rely on the list being sorted and no value being repeated. Sorting first makes it O(n log n) overall, which is still better than quadratic. However, if you can assume the values are distinct, then you could put them into a set first:
myset = set(mylist)
next(i for i in it.count() if i not in myset)
Since set containment checks are roughly constant time, this will be linear overall.
I just solved this in a probably non pythonic way
def solution(A):
# Const-ish to improve readability
MIN = 1
if not A: return MIN
# Save re-computing MAX
MAX = max(A)
# Loop over all entries with minimum of 1 starting at 1
for num in range(1, MAX):
# going for greatest missing number return optimistically (minimum)
# If order needs to switch, then use max as start and count backwards
if num not in A: return num
# In case the max is < 0 double wrap max with minimum return value
return max(MIN, MAX+1)
I think it reads quite well
My effort, no itertools. Sets "current" to be the one less than the value you are expecting.
list = [1,2,3,4,5,7,8]
current = list[0]-1
for i in list:
if i != current+1:
print current+1
break
current = i
The naive way is to traverse the list which is an O(n) solution. However, since the list is sorted, you can use this feature to perform binary search (a modified version for it). Basically, you are looking for the last occurance of A[i] = i.
The pseudo algorithm will be something like:
binarysearch(A):
start = 0
end = len(A) - 1
while(start <= end ):
mid = (start + end) / 2
if(A[mid] == mid):
result = A[mid]
start = mid + 1
else: #A[mid] > mid since there is no way A[mid] is less than mid
end = mid - 1
return (result + 1)
This is an O(log n) solution. I assumed lists are one indexed. You can modify the indices accordingly
EDIT: if the list is not sorted, you can use the heapq python library and store the list in a min-heap and then pop the elements one by one
pseudo code
H = heapify(A) //Assuming A is the list
count = 1
for i in range(len(A)):
if(H.pop() != count): return count
count += 1
sort + reduce to the rescue!
from functools import reduce # python3
myList = [1,14,2,5,3,7,8,12]
res = 1 + reduce(lambda x, y: x if y-x>1 else y, sorted(myList), 0)
print(res)
Unfortunatelly it won't stop after match is found and will iterate whole list.
Faster (but less fun) is to use for loop:
myList = [1,14,2,5,3,7,8,12]
res = 0
for num in sorted(myList):
if num - res > 1:
break
res = num
res = res + 1
print(res)
you can try this
for i in range(1,max(arr1)+2):
if i not in arr1:
print(i)
break
Easy to read, easy to understand, gets the job done:
def solution(A):
smallest = 1
unique = set(A)
for int in unique:
if int == smallest:
smallest += 1
return smallest
Keep incrementing a counter in a loop until you find the first positive integer that's not in the list.
def getSmallestIntNotInList(number_list):
"""Returns the smallest positive integer that is not in a given list"""
i = 0
while True:
i += 1
if i not in number_list:
return i
print(getSmallestIntNotInList([1,14,2,5,3,7,8,12]))
# 4
I found that this had the fastest performance compared to other answers on this post. I tested using timeit in Python 3.10.8. My performance results can be seen below:
import timeit
def findSmallestIntNotInList(number_list):
# Infinite while-loop until first number is found
i = 0
while True:
i += 1
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.038100800011307 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to len(number_list)+1
for i in range (1, len(number_list)+1):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.05068870005197823 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to max(number_list) (by silgon)
# https://stackoverflow.com/a/49649558/3357935
for i in range (1, max(number_list)):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06317249999847263 seconds
import timeit
from itertools import count, filterfalse
def findSmallestIntNotInList(number_list):
# iterate the first number not in set (by Antti Haapala -- Слава Україні)
# https://stackoverflow.com/a/28178803/3357935
return(next(filterfalse(set(number_list).__contains__, count(1))))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06515420007053763 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Use property of sets (by Bhargav Rao)
# https://stackoverflow.com/a/28176962/3357935
m = range(1, len(number_list))
return min(set(m)-set(number_list))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.08586219989228994 seconds
The easiest way would be just to loop through the sorted list and check if the index is equal the value and if not return the index as solution.
This would have complexity O(nlogn) because of the sorting:
for index,value in enumerate(sorted(myList)):
if index is not value:
print(index)
break
Another option is to use python sets which are somewhat dictionaries without values, just keys. In dictionaries you can look for a key in constant time which make the whol solution look like the following, having only linear complexity O(n):
mySet = set(myList)
for i in range(len(mySet)):
if i not in mySet:
print(i)
break
Edit:
If the solution should also deal with lists where no number is missing (e.g. [0,1]) and output the next following number and should also correctly consider 0, then a complete solution would be:
def find_smallest_positive_number_not_in_list(myList):
mySet = set(myList)
for i in range(1, max(mySet)+2):
if i not in mySet:
return i
A solution that returns all those values is
free_values = set(range(1, max(L))) - set(L)
it does a full scan, but those loops are implemented in C and unless the list or its maximum value are huge this will be a win over more sophisticated algorithms performing the looping in Python.
Note that if this search is needed to implement "reuse" of IDs then keeping a free list around and maintaining it up-to-date (i.e. adding numbers to it when deleting entries and picking from it when reusing entries) is a often a good idea.
The following solution loops all numbers in between 1 and the length of the input list and breaks the loop whenever a number is not found inside it. Otherwise the result is the length of the list plus one.
listOfNumbers=[1,14,2,5,3,7,8,12]
for i in range(1, len(listOfNumbers)+1):
if not i in listOfNumbers:
nextNumber=i
break
else:
nextNumber=len(listOfNumbers)+1

Merging arrays slices in Python

So I have a string that looks like this:
data="ABCABDABDABBCBABABDBCABBDBACBBCDB"
And I am taking random 10 character slices out of it:
start=int(random.random()*100)
end = start+10
slice = data[start:start+10]
But what I am trying to do now is count the number of 'gaps' or 'holes' that were not sliced out at all.
slices_indices = []
for i in xrange(0,100):
start=int(random.random()*100)
end= 10
slice = data[start:end]
...
slices_indices.append([start,end])
For instance, after running this a couple times. I covered this amount:
ABCAB DABD ABBCBABABDB C ABBDBACBBCDB
But left two 'gaps' of slices. Is there a 'Pythonic' way to find the number of these gaps? So basically I am looking for a function that count_gaps given the slice indices.
For example above,
count_gaps(slices_indices)
would give me two
Thanks in advance
There are several, although all involve a bit of messing about
You could compare the removed strings against the original, and work out which characters you didn't hit.
That's a very roundabout way of doing it, though, and won't work properly if you ever have the same 10 characters in the string twice. eg 1234123 or something.
A better solution would be to store the values of i you use, then step back through the data string comparing the current position to the values of i you used (plus 10). If it doesn't match, job done.
eg (pseudo code)
# Make an array the same length as the string
charsUsed = array(data.length)
# Do whatever
for i in xrange(0,100)
someStuffYouWereDoingBefore()
# Store our "used chars" in the array
for(char = i; char < i+10; char++)
if(char <= data.length) # Don't go out of bounds on the array!
charsUsed[i] = true
Then to see which chars weren't used, just walk through charsUsed array and count whatever it is you want to count (consecutive gaps etc)
Edit in response to updated question:
I'd still use the above method to make a "which chars were used" array. Your count_gaps() function then just needs to walk through the array to "find" the gaps
eg (pseudo...something. This isn't even vaguely Python. Hopefully you get the idea though)
The idea is essentially to see if the current position is false (ie not used) and the last position is true (used) meaning it's the start of a "new" gap. If both are false, we're in the middle of a gap, and if both are true, we're in the middle of a "used" string
function find_gaps(array charsUsed)
{
# Count the gaps
numGaps = 0
# What did we look at last (to see if it's the start of a gap)
# Assume it's true if you want to count "gaps" at the start of the string, assume it's false if you don't.
lastPositionUsed = true
for(i = 0; i < charsUsed.length; i++)
{
if(charsUsed[i] = false && lastPositionUsed = true)
{
numGaps++
}
lastPositionUsed = charsUsed[i]
}
return numGaps
}
The other option would be to step through the charsUsed array again and "group" consecutive values into a smaller away, then count the value you want... essentially the same thing but with a different approach. With this example I just ignore group I don't want and the "rest" of the group I do, counting only the boundaries between the group we don't want, and the group we do.
It is a bit of a messy task, but I think sets are the way to go. I hope my code below is self-explanatory, but if there are parts you don't understand please let me know.
#! /usr/bin/env python
''' Count gaps.
Find and count the sections in a sequence that weren't touched by random slicing
From http://stackoverflow.com/questions/26060688/merging-arrays-slices-in-python
Written by PM 2Ring 2014.09.27
'''
import random
from string import ascii_lowercase
def main():
def rand_slice():
start = random.randint(0, len(data) - slice_width)
return start, start + slice_width
#The data to slice
data = 5 * ascii_lowercase
print 'Data:\n%s\nLength : %d\n' % (data, len(data))
random.seed(42)
#A set to capture slice ranges
slices = set()
slice_width = 10
num_slices = 10
print 'Extracting %d slices from data' % num_slices
for i in xrange(num_slices):
start, end = rand_slice()
slices |= set(xrange(start, end))
data_slice = data[start:end].upper()
print '\n%2d, %2d : %s' % (start, end, data_slice)
data = data[:start] + data_slice + data[end:]
print data
#print sorted(slices)
print '\nSlices:\n%s\n' % sorted(slices)
print '\nSearching for gaps missed by slicing'
unsliced = sorted(tuple(set(xrange(len(data))) - slices))
print 'Unsliced:\n%s\n' % (unsliced,)
gaps = []
if unsliced:
last = start = unsliced[0]
for i in unsliced[1:]:
if i > last + 1:
t = (start, last + 1)
gaps.append(t)
print t
start = i
last = i
t = (start, last + 1)
gaps.append(t)
print t
print '\nGaps:\n%s\nCount: %d' % (gaps, len(gaps))
if __name__ == '__main__':
main()
I'd use some kind of bitmap. For example, Extending your code:
data="ABCABDABDABBCBABABDBCABBDBACBBCDB"
slices_indices = [0]*len(data)
for i in xrange(0,100):
start=int(random.random()*len(data))
end=start + 10
slice = data[start:end]
slices_indices[start:end] = [1] * len(slice)
I've used a list here, but you could use any other appropriate data structure, probably something more compact, if your data is rather big.
So, we've initialized the bitmap with zeros, and marked with ones the selected chunks of data. Now we can use something from itertools, for example:
from itertools import groupby
groups = groupby(slices_indices)
groupby returns an iterator where each element is a tuple (element, iterator). To just count gaps you can do something simple, like:
gaps = len([x for x in groups if x[0] == 0])

Why are string keys in python dictionaries slower to write/read than tuples?

In trying to optimize the speed of a program that mimics a tree structure ("Tree" is stored in a DICT with Cartesian coordinate x,y coordinate pairs as keys) I have found that storing their unique addresses in a dictionary as a Tuple, rather than Strings, results in substantially faster run-time.
My question is, if Python is optimized for string keys in dictionaries and hashing, why is using Tuples so much faster in this example? String keys seem to take 60% longer in doing the exact same task. Am I overlooking something simple in my example?
I was referencing this thread as the basis for my question (as well as others that make the same assertion that strings are faster): Is it always faster to use string as key in a dict?
Below is the code I was using to test the methods, and time them:
import time
def writeTuples():
k = {}
for x in range(0,500):
for y in range(0,x):
k[(x,y)] = "%s,%s"%(x,y)
return k
def readTuples(k):
failures = 0
for x in range(0,500):
for y in range(0,x):
if k.get((x,y)) is not None: pass
else: failures += 1
return failures
def writeStrings():
k = {}
for x in range(0,500):
for y in range(0,x):
k["%s,%s"%(x,y)] = "%s,%s"%(x,y)
return k
def readStrings(k):
failures = 0
for x in range(0,500):
for y in range(0,x):
if k.get("%s,%s"%(x,y)) is not None: pass
else: failures += 1
return failures
def calcTuples():
clockTimesWrite = []
clockTimesRead = []
failCounter = 0
trials = 100
st = time.clock()
for x in range(0,trials):
startLoop = time.clock()
k = writeTuples()
writeTime = time.clock()
failCounter += readTuples(k)
readTime = time.clock()
clockTimesWrite.append(writeTime-startLoop)
clockTimesRead.append(readTime-writeTime)
et = time.clock()
print("The average time to loop with tuple keys is %f, and had %i total failed records"%((et-st)/trials,failCounter))
print("The average write time is %f, and average read time is %f"%(sum(clockTimesWrite)/trials,sum(clockTimesRead)/trials))
return None
def calcStrings():
clockTimesWrite = []
clockTimesRead = []
failCounter = 0
trials = 100
st = time.clock()
for x in range(0,trials):
startLoop = time.clock()
k = writeStrings()
writeTime = time.clock()
failCounter += readStrings(k)
readTime = time.clock()
clockTimesWrite.append(writeTime-startLoop)
clockTimesRead.append(readTime-writeTime)
et = time.clock()
print("The average time to loop with string keys is %f, and had %i total failed records"%((et-st)/trials,failCounter))
print("The average write time is %f, and average read time is %f"%(sum(clockTimesWrite)/trials,sum(clockTimesRead)/trials))
return None
calcTuples()
calcStrings()
Thanks!
The tests are not fairly weighted (hence the timing discrepancies). You are making twice as many calls to format in your writeStrings loop as in your writeTuples loop and you are making infinitely more calls to it in readStrings. To be a fairer test you would need to make sure that:
Both write loops only make one call to % per inner loop
That readStrings and readTuples both make either one or zero calls to % per inner loop.
As others said, the string formatting is the issue.
here's a quick version that pre-calculates all the strings...
on my machine, writing strings is about 27% faster than writing tuples. write/reading is about 22% faster.
i just quickly reformatted & simplified your stuff into timeit. if the logic were a bit different , you could calc the difference in reads vs writes.
import timeit
samples = []
for x in range(0,360):
for y in range(0,x):
i = (x,y)
samples.append( ( i, "%s,%s"%i) )
def write_tuples():
k = {}
for pair in samples:
k[pair[0]] = True
return k
def write_strings():
k = {}
for pair in samples:
k[pair[1]] = True
return k
def read_tuples(k):
failures = 0
for pair in samples:
if k.get(pair[0]) is not None: pass
else: failures += 1
return failures
def read_strings(k):
failures = 0
for pair in samples:
if k.get(pair[1]) is not None: pass
else: failures += 1
return failures
stmt_t1 = """k = write_tuples()"""
stmt_t2 = """k = write_strings()"""
stmt_t3 = """k = write_tuples()
read_tuples(k)"""
stmt_t4 = """k = write_strings()
read_strings(k)"""
t1 = timeit.Timer(stmt=stmt_t1, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
t2 = timeit.Timer(stmt=stmt_t2, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
t3 = timeit.Timer(stmt=stmt_t3, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
t4 = timeit.Timer(stmt=stmt_t4, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
print "write tuples : %s" % t1.timeit(100)
print "write strings : %s" % t2.timeit(100)
print "write/read tuples : %s" % t3.timeit(100)
print "write/read strings : %s" % t4.timeit(100)
I ran your code on a Core i5 1.8GHz machine and for the following results
0.076752 vs. 0.085863 tuples to strings for the loop
write 0.049446 vs. 0.050731
read 0.027299 vs. 0.035125
so tuples appear to be winning, but you're doing the string conversion twice in the write function. Changing writeStrings to
def writeStrings():
k = {}
for x in range(0,360):
for y in range(0,x):
s = "%s,%s"%(x,y)
k[s] = s
return k
0.101689 vs. 0.092957 tuples to strings for the loop
write 0.064933 vs. 0.044578
read 0.036748 vs. 0.048371
The first thing to notice is that there's quite a bit of variation in the results, so you may want to change trials=100 to something bigger, recall that python's timeit will do I think 10000 by default. I did trials=5000
0.081944 vs. 0.067829 tuples to strings for the loop
write 0.052264 vs. 0.032866
read 0.029673 vs. 0.034957
so the string version is faster but as already pointed out in other posts, it's not the dict lookup it's the string conversion that's hurting.
I would say that the difference in speed is due to the string formatting of the accessor key.
In writeTuples you have this line:
k[(x,y)] = ...
Which creates a new tuple and assigns it's values (x,y), before passing to the accessor of k.
In writeStrings you have this line:
k["%s,%s"%(x,y)] = ...
Which does all the same computations as in writeTuples but also has the overhead of parsing the string "%s,%s" (this might be done at compile time, I'm not sure) but then it also has to build a new string from the numbers (for example "12,15"). I believe it's this which increases the running time.

Categories