How to achieve the fastest random element lookup in Python? - python

I have a list where it contains 3 lists each containing 8192, 16384, and 16384 number of 16-byte long bytes object respectively. I want to randomly select N number of elements from this list. What I have done is this:
mydata = bytes()
for i in range(N):
sel1 = fastrand.pcg32bounded(3)
sel2 = fastrand.pcg32bounded(len(mylist[sel1]))
mydata += mylist[sel1][sel2]
I have used fastrand to increase the speed of random number generation and it worked. However, still I need more speed. Is there any way I can achieve more speed? I have also tried dictionaries like this but it turned out slower.
# convert list to dictionary
mydict = {}
for i in range(len(mylist)):
keys = range(len(mylist[i]))
mydict[i] = dict(zip(keys, mylist[i]))
mydata = bytes()
for i in range(N):
sel1 = fastrand.pcg32bounded(3)
sel2 = fastrand.pcg32bounded(len(mydict[sel1]))
mydata += mydict[sel1][sel2]
It is not forced to have a list as a initial data struct. If any other data structure (maybe sets?) is more useful for my purpose, I can convert the list to any other datatype beforehand, no problem.

bytes are immutable, making += slow, especially for large N. Try bytearray instead:
mydata = bytearray()
If you really need bytes at the end, do this at the end:
mydata = bytes(mydata)
Put all values into one one-dimensional pool. Pure Python version using it for demo, as I can't test fastrand:
pool = 2 * mylist[0] + mylist[1] + mylist[2]
mydata = b''.join(random.choices(pool, k=N))
Untested with fastrand:
pool = 2 * mylist[0] + mylist[1] + mylist[2]
size = len(pool)
rand = fastrand.pcg32bounded
mydata = bytearray()
for _ in range(N):
mydata += pool[rand(size)]
mydata = bytes(mydata)
Another variation, eliminating most Python interpretation:
from itertools import repeat
from operator import itemgetter
pool = 2 * mylist[0] + mylist[1] + mylist[2]
indices = map(fastrand.pcg32bounded,
repeat(len(pool), N))
mydata = b''.join(itemgetter(*indices)(pool))
Try it online! (the last solution, with fake fastrand)

Related

If the input number is in the list add its index to a new one

I want to check if the input number is in the list, and if so - add its index in the original list to the new one. If it's not in the list - I want to add a -1.
I tried using the for loop and adding it like that, but it is kind of bad on the speed of the program.
n = int(input())
k = [int(x) for x in input().split()]
z = []
m = int(input())
for i in range(m):
a = int(input())
if a in k: z.append(k.index(a))
else: z.append(-1)
The input should look like this :
3
2 1 3
1
8
3
And the output should be :
1
-1
2
How can I do what I'm trying to do more efficiently/quickly
There are many approaches to this problem. This is typical when you're first starting in programming as, the simpler the problem, the more options you have. Choosing which option depends what you have and what you want.
In this case we're expecting user input of this form:
3
2 1 3
1
8
3
One approach is to generate a dict to use for lookups instead of using list operations. Lookup in dict will give you better performance overall. You can use enumerate to give me both the index i and the value x from the list from user input. Then use int(x) as the key and associate it to the index.
The key should always be the data you have, and the value should always be the data you want. (We have a value, we want the index)
n = int(input())
k = {}
for i, x in enumerate(input().split()):
k[int(x)] = i
z = []
for i in range(n):
a = int(input())
if a in k:
z.append(k[a])
else:
z.append(-1)
print(z)
k looks like:
{2: 0, 1: 1, 3: 2}
This way you can call k[3] and it will give you 2 in O(1) or constant time.
(See. Python: List vs Dict for look up table)
There is a structure known as defaultdict which allows you to specify behaviour when a key is not present in the dictionary. This is particularly helpful in this case, as we can just request from the defaultdict and it will return the desired value either way.
from collections import defaultdict
n = int(input())
k = defaultdict(lambda: -1)
for i, x in enumerate(input().split()):
k[int(x)] = i
z = []
for i in range(n):
a = int(input())
z.append(k[a])
print(z)
While this does not speed up your program, it does make your second for loop easier to read. It also makes it easier to move into the comprehension in the next section.
(See. How does collections.defaultdict work?
With these things in place, we can use, yes, list comprehension, to very minimally speed up the construction of z and k. (See. Are list-comprehensions and functional functions faster than “for loops”?
from collections import defaultdict
n = int(input())
k = defaultdict(lambda: -1)
for i, x in enumerate(input().split()):
k[int(x)] = i
z = [k[int(input())] for i in range(n)]
print(z)
All code snippets print z as a list:
[1, -1, 2]
See Printing list elements on separated lines in Python if you'd like different print outs.
Note: The index function will find the index of the first occurrence of the value in a list. Because of the way the dict is built, the index of the last occurrence will be stored in k. If you need to mimic index exactly you should ensure that a later index does not overwrite a previous one.
for i, x in enumerate(input().split()):
x = int(x)
if x not in k:
k[x] = i
Adapt this solution for your problem.
def test(list1,value):
try:
return list1.index(value)
except ValueError as e:
print(e)
return -1
list1=[2, 1, 3]
in1 = [1,8,3]
res= [test(list1,i) for i in in1]
print(res)
output
8 is not in list
[1, -1, 2]

Memoryerror with too big list

I'm writing script in python, and now I have to create pretty big list exactly containing 248956422 integers. The point is, that some of this "0" in this table will be changed for 1,2 or 3, cause I have 8 lists, 4 with beginning positions of genes, and 4 with endings of them.
The point is i have to iterate "anno" several time cause numbers replacing 0 can change with other iteration.
"Anno" has to be written to the file to create annotation file.
Here's my question, how can I divide, or do it on-the-fly , not to get memoryerror including replacing "0" for others, and 1,2,3s for others.
Mabye rewriting the file? I'm waitin for your advice, please ask me if it is not so clear what i wrote :P .
whole_st_gen = [] #to make these lists more clear for example
whole_end_gen = [] # whole_st_gen has element "177"
whole_st_ex = [] # and whole_end_gen has "200" so from position 177to200
whole_end_ex = [] # i need to put "1"
whole_st_mr = [] # of course these list can have even 1kk+ elements
whole_end_mr = [] # note that every st/end of same kind have equal length
whole_st_nc = []
whole_end_nc = [] #these lists are including some values of course
length = 248956422
anno = ['0' for i in range(0,length)] # here i get the memoryerror
#then i wanted to do something like..
for j in range(0, len(whole_st_gen)):
for y in range(whole_st_gen[j],whole_end_gen[j]):
anno[y]='1'
You might be better of by determine the value of each element in anno on the fly:
def anno():
for idx in xrange(248956422):
elm = "0"
for j in range(0, len(whole_st_gen)):
if whole_st_gen[j] <= idx < whole_end_gen[j]:
elm = "1"
for j in range(0, len(whole_st_ex)):
if whole_st_ex[j] <= idx < whole_end_ex[j]:
elm = "2"
for j in range(0, len(whole_st_mr)):
if whole_st_mr[j] <= idx < whole_end_mr[j]:
elm = "3"
for j in range(0, len(whole_st_nc)):
if whole_st_nc[j] <= idx < whole_end_nc[j]:
elm = "4"
yield elm
Then you just iterate using for elm in anno().
I got an edit proposal from the OP suggesting one function for each of whole_*_gen, whole_st_ex and so on, something like this:
def anno_st():
for idx in xrange(248956422):
elm = "0"
for j in range(0, len(whole_st_gen)):
if whole_st_ex[j] <= idx <= whole_end_ex[j]:
elm = "2"
yield elm
That's of course doable, but it will only result in the changes from whole_*_ex applied and one would need to combine them afterwards when writing to file which may be a bit awkward:
for a, b, c, d in zip(anno_st(), anno_ex(), anno_mr(), anno_nc()):
if d != "0":
write_to_file(d)
elif c != "0":
write_to_file(c)
elif b != "0":
write_to_file(b)
else:
write_to_file(a)
However if you only want to apply some of the change sets you could write a function that takes them as parameters:
def anno(*args):
for idx in xrange(248956422):
elm = "0"
for st, end, tag in args:
for j in range(0, len(st)):
if st <= idx < end[j]:
elm = tag
yield tag
And then call by supplying the lists (for example with only the two first changes):
for tag in anno((whole_st_gen, whole_end_gen, "1"),
(whole_st_ex, whole_end_ex, "2")):
write_to_file(tag)
You could use a bytearray object to have a much more compact memory representation than a list of integers:
anno = bytearray(b'\0' * 248956422)
print(anno[0]) # → 0
anno[0] = 2
print(anno[0]) # → 2
print(anno.__sizeof__()) # → 248956447 (on my computer)
Instead of creating a list using list comprehension I suggest to create an iterator using a generator-expression which produce the numbers on demand instead of saving all of them in memory.Also you don't need to use the i in your loop since it's just a throw away variable which you don't use it.
anno = ('0' for _ in range(0,length)) # In python 2.X use xrange() instead of range()
But note that and iterator is a one shot iterable and you can not use it after iterating over it one time.If you want to use it for multiple times you can create N independent iterators from it using itertools.tee().
Also note that you can not change it in-place if you want to change some elements based on a condition you can create a new iterator by iterating over your iterator and applying the condition using a generator expression.
For example :
new_anno =("""do something with i""" for i in anno if #some condition)

Most efficient way to remove entries in a list

I have a massive 4D data set, spread throughout 4 variables, x_list, y_list, z_list, and i_list. Each is a list of N scalars, with X, Y, and Z representing the point's position in space, and I representing intensity.
I already have a function that picks through and marks negligible points (those whose intensity is too low) for deletion, by setting their intensity to 0. However, when I run this on my 2-million point set, the deletion process takes hours.
Currently, I am using the .pop(index) command to remove the data points, because it does so very cleanly. Here is the code:
counter = 0
i = 0
for entry in i_list
if (i_list[i] == 0):
x_list.pop(i)
y_list.pop(i)
z_list.pop(i)
i_list.pop(i)
counter += 1
print (counter, "points removed")
else
i += 1
How can I do this more efficiently?
I think it'll be faster to create new empty lists for each existing list, and append items to them if i_list[i] != 0. Look up the time complexity of the operations you're doing, and you'll see that deleting items is O(n), whereas appending is O(1). Currently you're doing a lot of O(n) deletes with a pretty large n, that will be very slow.
So something like:
new_x = []
new_y = []
new_y = []
new_i = []
for index in range(len(i_list)):
if i_list[index] != 0:
new_x.append(x_list[index])
new_y.append(y_list[index])
# Etc.
Going further, you should look into numpy arrays, where subsetting to find the set of items where i_list != 0 would be very fast.
You should use del:
array = [1, 2, 3]
del array[0]
gives: [2, 3]
And most important, using print() while looping through large file is suicide. Most of the time is consumed by printing. Here's example:
>>> from time import time
>>> def test1(n):
... for i in range(n):
... print(i)
...
>>> def test2(n):
... for i in range(n):
... i += 1
...
>>> def wraper():
... t1 = time()
... test1(1000)
... t2 = time()
... test2(1000)
... t3 = time()
... print("Test1: %s\ntest2: %s: " % (t2-t1, t3-t2))
And output is:
(lots of numbers)
Test1: 0.46030712127685547
test2: 0.0:
This is a job for the happy list comprehension!
x_prime_list = [x for (index, x) in enumerate(x_list)
if i_list[index] != 0]
Which pairs up members of x_list with their ordinal address using enumerate(). It puts all the members x in a new list, if and only if i_list[index] is not zero (otherwise it adds nothing to the list.
The advantage that list comprehensions have over the equivalent code you posted is that the looping and appending is handled in C rather than needing Python to do these tasks.

Find the smallest positive number not in list

I have a list in python like this:
myList = [1,14,2,5,3,7,8,12]
How can I easily find the first unused value? (in this case '4')
I came up with several different ways:
Iterate the first number not in set
I didn't want to get the shortest code (which might be the set-difference trickery) but something that could have a good running time.
This might be one of the best proposed here, my tests show that it might be substantially faster - especially if the hole is in the beginning - than the set-difference approach:
from itertools import count, filterfalse # ifilterfalse on py2
A = [1,14,2,5,3,7,8,12]
print(next(filterfalse(set(A).__contains__, count(1))))
The array is turned into a set, whose __contains__(x) method corresponds to x in A. count(1) creates a counter that starts counting from 1 to infinity. Now, filterfalse consumes the numbers from the counter, until a number is found that is not in the set; when the first number is found that is not in the set it is yielded by next()
Timing for len(a) = 100000, randomized and the sought-after number is 8:
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
0.9200698399945395
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
3.1420603669976117
Timing for len(a) = 100000, ordered and the first free is 100001
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
1.520096342996112
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
1.987783643999137
(note that this is Python 3 and range is the py2 xrange)
Use heapq
The asymptotically good answer: heapq with enumerate
from heapq import heapify, heappop
heap = list(A)
heapify(heap)
from heapq import heapify, heappop
from functools import partial
# A = [1,2,3] also works
A = [1,14,2,5,3,7,8,12]
end = 2 ** 61 # these are different and neither of them can be the
sentinel = 2 ** 62 # first gap (unless you have 2^64 bytes of memory).
heap = list(A)
heap.append(end)
heapify(heap)
print(next(n for n, v in enumerate(
iter(partial(heappop, heap), sentinel), 1) if n != v))
Now, the one above could be the preferred solution if written in C, but heapq is written in Python and most probably slower than many other alternatives that mainly use C code.
Just sort and enumerate to find the first not matching
Or the simple answer with good constants for O(n lg n)
next(i for i, e in enumerate(sorted(A) + [ None ], 1) if i != e)
This might be fastest of all if the list is almost sorted because of how the Python Timsort works, but for randomized the set-difference and iterating the first not in set are faster.
The + [ None ] is necessary for the edge cases of there being no gaps (e.g. [1,2,3]).
This makes use of the property of sets
>>> l = [1,2,3,5,7,8,12,14]
>>> m = range(1,len(l))
>>> min(set(m)-set(l))
4
I would suggest you to use a generator and use enumerate to determine the missing element
>>> next(a for a, b in enumerate(myList, myList[0]) if a != b)
4
enumerate maps the index with the element so your goal is to determine that element which differs from its index.
Note, I am also assuming that the elements may not start with a definite value, in this case which is 1, and if it is so, you can simplify the expression further as
>>> next(a for a, b in enumerate(myList, 1) if a != b)
4
A for loop with the list will do it.
l = [1,14,2,5,3,7,8,12]
for i in range(1, max(l)):
if i not in l: break
print(i) # result 4
Don't know how efficient, but why not use an xrange as a mask and use set minus?
>>> myList = [1,14,2,5,3,7,8,12]
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
4
You're only creating a set as big as myList, so it can't be that bad :)
This won't work for "full" lists:
>>> myList = range(1, 5)
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: min() arg is an empty sequence
But the fix to return the next value is simple (add one more to the masked set):
>>> min(set(xrange(1, len(myList) + 2)) - set(myList))
5
import itertools as it
next(i for i in it.count() if i not in mylist)
I like this because it reads very closely to what you're trying to do: "start counting, keep going until you reach a number that isn't in the list, then tell me that number". However, this is quadratic since testing i not in mylist is linear.
Solutions using enumerate are linear, but rely on the list being sorted and no value being repeated. Sorting first makes it O(n log n) overall, which is still better than quadratic. However, if you can assume the values are distinct, then you could put them into a set first:
myset = set(mylist)
next(i for i in it.count() if i not in myset)
Since set containment checks are roughly constant time, this will be linear overall.
I just solved this in a probably non pythonic way
def solution(A):
# Const-ish to improve readability
MIN = 1
if not A: return MIN
# Save re-computing MAX
MAX = max(A)
# Loop over all entries with minimum of 1 starting at 1
for num in range(1, MAX):
# going for greatest missing number return optimistically (minimum)
# If order needs to switch, then use max as start and count backwards
if num not in A: return num
# In case the max is < 0 double wrap max with minimum return value
return max(MIN, MAX+1)
I think it reads quite well
My effort, no itertools. Sets "current" to be the one less than the value you are expecting.
list = [1,2,3,4,5,7,8]
current = list[0]-1
for i in list:
if i != current+1:
print current+1
break
current = i
The naive way is to traverse the list which is an O(n) solution. However, since the list is sorted, you can use this feature to perform binary search (a modified version for it). Basically, you are looking for the last occurance of A[i] = i.
The pseudo algorithm will be something like:
binarysearch(A):
start = 0
end = len(A) - 1
while(start <= end ):
mid = (start + end) / 2
if(A[mid] == mid):
result = A[mid]
start = mid + 1
else: #A[mid] > mid since there is no way A[mid] is less than mid
end = mid - 1
return (result + 1)
This is an O(log n) solution. I assumed lists are one indexed. You can modify the indices accordingly
EDIT: if the list is not sorted, you can use the heapq python library and store the list in a min-heap and then pop the elements one by one
pseudo code
H = heapify(A) //Assuming A is the list
count = 1
for i in range(len(A)):
if(H.pop() != count): return count
count += 1
sort + reduce to the rescue!
from functools import reduce # python3
myList = [1,14,2,5,3,7,8,12]
res = 1 + reduce(lambda x, y: x if y-x>1 else y, sorted(myList), 0)
print(res)
Unfortunatelly it won't stop after match is found and will iterate whole list.
Faster (but less fun) is to use for loop:
myList = [1,14,2,5,3,7,8,12]
res = 0
for num in sorted(myList):
if num - res > 1:
break
res = num
res = res + 1
print(res)
you can try this
for i in range(1,max(arr1)+2):
if i not in arr1:
print(i)
break
Easy to read, easy to understand, gets the job done:
def solution(A):
smallest = 1
unique = set(A)
for int in unique:
if int == smallest:
smallest += 1
return smallest
Keep incrementing a counter in a loop until you find the first positive integer that's not in the list.
def getSmallestIntNotInList(number_list):
"""Returns the smallest positive integer that is not in a given list"""
i = 0
while True:
i += 1
if i not in number_list:
return i
print(getSmallestIntNotInList([1,14,2,5,3,7,8,12]))
# 4
I found that this had the fastest performance compared to other answers on this post. I tested using timeit in Python 3.10.8. My performance results can be seen below:
import timeit
def findSmallestIntNotInList(number_list):
# Infinite while-loop until first number is found
i = 0
while True:
i += 1
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.038100800011307 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to len(number_list)+1
for i in range (1, len(number_list)+1):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.05068870005197823 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to max(number_list) (by silgon)
# https://stackoverflow.com/a/49649558/3357935
for i in range (1, max(number_list)):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06317249999847263 seconds
import timeit
from itertools import count, filterfalse
def findSmallestIntNotInList(number_list):
# iterate the first number not in set (by Antti Haapala -- Слава Україні)
# https://stackoverflow.com/a/28178803/3357935
return(next(filterfalse(set(number_list).__contains__, count(1))))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06515420007053763 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Use property of sets (by Bhargav Rao)
# https://stackoverflow.com/a/28176962/3357935
m = range(1, len(number_list))
return min(set(m)-set(number_list))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.08586219989228994 seconds
The easiest way would be just to loop through the sorted list and check if the index is equal the value and if not return the index as solution.
This would have complexity O(nlogn) because of the sorting:
for index,value in enumerate(sorted(myList)):
if index is not value:
print(index)
break
Another option is to use python sets which are somewhat dictionaries without values, just keys. In dictionaries you can look for a key in constant time which make the whol solution look like the following, having only linear complexity O(n):
mySet = set(myList)
for i in range(len(mySet)):
if i not in mySet:
print(i)
break
Edit:
If the solution should also deal with lists where no number is missing (e.g. [0,1]) and output the next following number and should also correctly consider 0, then a complete solution would be:
def find_smallest_positive_number_not_in_list(myList):
mySet = set(myList)
for i in range(1, max(mySet)+2):
if i not in mySet:
return i
A solution that returns all those values is
free_values = set(range(1, max(L))) - set(L)
it does a full scan, but those loops are implemented in C and unless the list or its maximum value are huge this will be a win over more sophisticated algorithms performing the looping in Python.
Note that if this search is needed to implement "reuse" of IDs then keeping a free list around and maintaining it up-to-date (i.e. adding numbers to it when deleting entries and picking from it when reusing entries) is a often a good idea.
The following solution loops all numbers in between 1 and the length of the input list and breaks the loop whenever a number is not found inside it. Otherwise the result is the length of the list plus one.
listOfNumbers=[1,14,2,5,3,7,8,12]
for i in range(1, len(listOfNumbers)+1):
if not i in listOfNumbers:
nextNumber=i
break
else:
nextNumber=len(listOfNumbers)+1

Why are string keys in python dictionaries slower to write/read than tuples?

In trying to optimize the speed of a program that mimics a tree structure ("Tree" is stored in a DICT with Cartesian coordinate x,y coordinate pairs as keys) I have found that storing their unique addresses in a dictionary as a Tuple, rather than Strings, results in substantially faster run-time.
My question is, if Python is optimized for string keys in dictionaries and hashing, why is using Tuples so much faster in this example? String keys seem to take 60% longer in doing the exact same task. Am I overlooking something simple in my example?
I was referencing this thread as the basis for my question (as well as others that make the same assertion that strings are faster): Is it always faster to use string as key in a dict?
Below is the code I was using to test the methods, and time them:
import time
def writeTuples():
k = {}
for x in range(0,500):
for y in range(0,x):
k[(x,y)] = "%s,%s"%(x,y)
return k
def readTuples(k):
failures = 0
for x in range(0,500):
for y in range(0,x):
if k.get((x,y)) is not None: pass
else: failures += 1
return failures
def writeStrings():
k = {}
for x in range(0,500):
for y in range(0,x):
k["%s,%s"%(x,y)] = "%s,%s"%(x,y)
return k
def readStrings(k):
failures = 0
for x in range(0,500):
for y in range(0,x):
if k.get("%s,%s"%(x,y)) is not None: pass
else: failures += 1
return failures
def calcTuples():
clockTimesWrite = []
clockTimesRead = []
failCounter = 0
trials = 100
st = time.clock()
for x in range(0,trials):
startLoop = time.clock()
k = writeTuples()
writeTime = time.clock()
failCounter += readTuples(k)
readTime = time.clock()
clockTimesWrite.append(writeTime-startLoop)
clockTimesRead.append(readTime-writeTime)
et = time.clock()
print("The average time to loop with tuple keys is %f, and had %i total failed records"%((et-st)/trials,failCounter))
print("The average write time is %f, and average read time is %f"%(sum(clockTimesWrite)/trials,sum(clockTimesRead)/trials))
return None
def calcStrings():
clockTimesWrite = []
clockTimesRead = []
failCounter = 0
trials = 100
st = time.clock()
for x in range(0,trials):
startLoop = time.clock()
k = writeStrings()
writeTime = time.clock()
failCounter += readStrings(k)
readTime = time.clock()
clockTimesWrite.append(writeTime-startLoop)
clockTimesRead.append(readTime-writeTime)
et = time.clock()
print("The average time to loop with string keys is %f, and had %i total failed records"%((et-st)/trials,failCounter))
print("The average write time is %f, and average read time is %f"%(sum(clockTimesWrite)/trials,sum(clockTimesRead)/trials))
return None
calcTuples()
calcStrings()
Thanks!
The tests are not fairly weighted (hence the timing discrepancies). You are making twice as many calls to format in your writeStrings loop as in your writeTuples loop and you are making infinitely more calls to it in readStrings. To be a fairer test you would need to make sure that:
Both write loops only make one call to % per inner loop
That readStrings and readTuples both make either one or zero calls to % per inner loop.
As others said, the string formatting is the issue.
here's a quick version that pre-calculates all the strings...
on my machine, writing strings is about 27% faster than writing tuples. write/reading is about 22% faster.
i just quickly reformatted & simplified your stuff into timeit. if the logic were a bit different , you could calc the difference in reads vs writes.
import timeit
samples = []
for x in range(0,360):
for y in range(0,x):
i = (x,y)
samples.append( ( i, "%s,%s"%i) )
def write_tuples():
k = {}
for pair in samples:
k[pair[0]] = True
return k
def write_strings():
k = {}
for pair in samples:
k[pair[1]] = True
return k
def read_tuples(k):
failures = 0
for pair in samples:
if k.get(pair[0]) is not None: pass
else: failures += 1
return failures
def read_strings(k):
failures = 0
for pair in samples:
if k.get(pair[1]) is not None: pass
else: failures += 1
return failures
stmt_t1 = """k = write_tuples()"""
stmt_t2 = """k = write_strings()"""
stmt_t3 = """k = write_tuples()
read_tuples(k)"""
stmt_t4 = """k = write_strings()
read_strings(k)"""
t1 = timeit.Timer(stmt=stmt_t1, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
t2 = timeit.Timer(stmt=stmt_t2, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
t3 = timeit.Timer(stmt=stmt_t3, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
t4 = timeit.Timer(stmt=stmt_t4, setup = "from __main__ import samples, read_strings, write_strings, read_tuples, write_tuples")
print "write tuples : %s" % t1.timeit(100)
print "write strings : %s" % t2.timeit(100)
print "write/read tuples : %s" % t3.timeit(100)
print "write/read strings : %s" % t4.timeit(100)
I ran your code on a Core i5 1.8GHz machine and for the following results
0.076752 vs. 0.085863 tuples to strings for the loop
write 0.049446 vs. 0.050731
read 0.027299 vs. 0.035125
so tuples appear to be winning, but you're doing the string conversion twice in the write function. Changing writeStrings to
def writeStrings():
k = {}
for x in range(0,360):
for y in range(0,x):
s = "%s,%s"%(x,y)
k[s] = s
return k
0.101689 vs. 0.092957 tuples to strings for the loop
write 0.064933 vs. 0.044578
read 0.036748 vs. 0.048371
The first thing to notice is that there's quite a bit of variation in the results, so you may want to change trials=100 to something bigger, recall that python's timeit will do I think 10000 by default. I did trials=5000
0.081944 vs. 0.067829 tuples to strings for the loop
write 0.052264 vs. 0.032866
read 0.029673 vs. 0.034957
so the string version is faster but as already pointed out in other posts, it's not the dict lookup it's the string conversion that's hurting.
I would say that the difference in speed is due to the string formatting of the accessor key.
In writeTuples you have this line:
k[(x,y)] = ...
Which creates a new tuple and assigns it's values (x,y), before passing to the accessor of k.
In writeStrings you have this line:
k["%s,%s"%(x,y)] = ...
Which does all the same computations as in writeTuples but also has the overhead of parsing the string "%s,%s" (this might be done at compile time, I'm not sure) but then it also has to build a new string from the numbers (for example "12,15"). I believe it's this which increases the running time.

Categories