Efficient keys for dictionaries - python

I'm new to posting here, so I hope I provide enough detail. I'm trying to find out if key selection effects the efficiency of dictionaries in Python. Some comparisons I'm thinking of are:
numbers vs strings (e.g. would my_dict[20] be faster than my_dict['twenty'])
len of strings (e.g. would my_dict['a'] be faster than my_dict['abcdefg'])
mixing key types within a dictionary, for instance using numbers, strings, and/or tuples (e.g. would my_dict = {0: 'zero', 2: 'two'} perform faster than {0: 'zero', 'two': 2})
I haven't been able to find this topic from a google search, so I thought maybe someone here might know.

First of all, I'd recommend you to understand How are Python's Built In Dictionaries Implemented.
Now, let's make a little random experiment to prove the theory (at least partially):
import timeit
import string
import random
import time
def random_str(N):
return ''.join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(N)
)
def experiment1(dct, keys):
s = time.time()
{dct[k] for k in keys}
return time.time() - s
if __name__ == "__main__":
N = 10000
K = 200
S = 50000
dct1 = {random_str(K): None for k in range(N)}
dct2 = {i: None for i in range(N)}
keys1 = list(dct1.keys())
keys2 = list(dct2.keys())
samples1 = []
samples2 = []
for i in range(S):
samples1.append(experiment1(dct1, keys1))
samples2.append(experiment1(dct2, keys2))
print(sum(samples1), sum(samples2))
# print(
# timeit.timeit('{dct1[k] for k in keys1}', setup='from __main__ import dct1, keys1')
# )
# print(
# timeit.timeit('{dct2[k] for k in keys2}', setup='from __main__ import dct2, keys2')
# )
The results I've got with different sampling sizes on my box were:
N=10000, K=200, S=100 => 0.08300423622131348 0.07200479507446289
N=10000, K=200, S=1000 => 0.885051965713501 0.7120392322540283
N=10000, K=200, S=10000 => 8.88549256324768 7.005417346954346
N=10000, K=200, S=50000 => 43.57453536987305 34.82594871520996
So as you can see, no matter whether you use big random strings to lookup dictionaries or integers, the performance will remain almost the same. The only "real" difference you'd like to consider would be in terms of memory consumption of both dictionaries. That may be relevant when dumping/loading huge dictionaries to/from disk, in that case it may be worth to have compact form of your data structures so you'll be able to shave off few seconds when caching/reading them.
NS: If anyone is able to explain why I was getting really huge times when using timeit (commented parts) please let me... with little experiment constants I'd get really high values... that's why I left it uncommented. Add a comment if you know the reason ;D

I don't know the answer either but it's easy enough to test.
from timeit import default_timer as timer
import string
#Create a few dictionaries, all the values are None
nums = range(1,27)
dict_numbers = dict.fromkeys(nums)
letters = string.ascii_lowercase
dict_singleLetter = dict.fromkeys(letters)
long_names = []
for letter in letters:
long_names.append(''.join([letter, letters]))
dict_longNames = dict.fromkeys(long_names)
#Function to time thousands of runs to average out discrepancies
def testSpeed(dictionary, keys):
x = None
start = timer()
for _ in range(1,100000):
for i in keys:
x = dictionary[i]
end = timer()
return str(end - start)
#Time the different dictionaries
print("Number took " + testSpeed(dict_numbers, nums) + " seconds")
print("Single letters took " + testSpeed(dict_singleLetter, letters) + " seconds")
print("Long keys took " + testSpeed(dict_longNames, long_names) + " seconds")
All of these dictionaries are the same length and contain the same value for each key. When I ran this the dictionary with the long keys was actually always the fastest, albeit by only maybe 5%. Which could possible be accounted for by other minor differences I am unaware of. Numbers and single letters were pretty close in speed but numbers were generally barely faster then single letters. Hopefully this answers your question, and this code should be easy enough to expand to test mixed cases, but I'm out of time at the moment.

Related

Optimizing the run time of the nested for loop

I am just getting started with competitive programming and after writing the solution to certain problem i got the error of RUNTIME exceeded.
max( | a [ i ] - a [ j ] | + | i - j | )
Where a is a list of elements and i,j are index i need to get the max() of the above expression.
Here is a short but complete code snippet.
t = int(input()) # Number of test cases
for i in range(t):
n = int(input()) #size of list
a = list(map(int, str(input()).split())) # getting space separated input
res = []
for s in range(n): # These two loops are increasing the run-time
for d in range(n):
res.append(abs(a[s] - a[d]) + abs(s - d))
print(max(res))
Input File This link may expire(Hope it works)
1<=t<=100
1<=n<=10^5
0<=a[i]<=10^5
Run-time on leader-board for C language is 5sec and that for Python is 35sec while this code takes 80sec.
It is an online judge so independent on machine.numpy is not available.
Please keep it simple i am new to python.
Thanks for reading.
For a given j<=i, |a[i]-a[j]|+|i-j| = max(a[i]-a[j]+i-j, a[j]-a[i]+i-j).
Thus for a given i, the value of j<=i that maximizes |a[i]-a[j]|+|i-j| is either the j that maximizes a[j]-j or the j that minimizes a[j]+j.
Both these values can be computed as you run along the array, giving a simple O(n) algorithm:
def maxdiff(xs):
mp = mn = xs[0]
best = 0
for i, x in enumerate(xs):
mp = max(mp, x-i)
mn = min(mn, x+i)
best = max(best, x+i-mn, -x+i+mp)
return best
And here's some simple testing against a naive but obviously correct algorithm:
def maxdiff_naive(xs):
best = 0
for i in xrange(len(xs)):
for j in xrange(i+1):
best = max(best, abs(xs[i]-xs[j]) + abs(i-j))
return best
import random
for _ in xrange(500):
r = [random.randrange(1000) for _ in xrange(50)]
md1 = maxdiff(r)
md2 = maxdiff_naive(r)
if md1 != md2:
print "%d != %d\n%s" % (md1, md2, r)
exit
It takes a fraction of a second to run maxdiff on an array of size 10^5, which is significantly better than your reported leaderboard scores.
"Competitive programming" is not about saving a few milliseconds by using a different kind of loop; it's about being smart about how you approach a problem, and then implementing the solution efficiently.
Still, one thing that jumps out is that you are wasting time building a list only to scan it to find the max. Your double loop can be transformed to the following (ignoring other possible improvements):
print(max(abs(a[s] - a[d]) + abs(s - d) for s in range(n) for d in range(n)))
But that's small fry. Worry about your algorithm first, and then turn to even obvious time-wasters like this. You can cut the number of comparisons to half, as #Brett showed you, but I would first study the problem and ask myself: Do I really need to calculate this quantity n^2 times, or even 0.5*n^2 times? That's how you get the times down, not by shaving off milliseconds.

Fastest way to get sorted unique list in python?

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn't matter whether the list is modified in place, or I get a new list, or an iterable. In my concrete use case, I'm doing this with a throwaway list, so in place would be more memory efficient.)
I've seen solutions like
input = [5, 4, 2, 8, 4, 2, 1]
sorted(set(input))
but it seems to me that first checking for uniqueness and then sorting is wasteful (since when you sort the list, you basically have to determine insertion points, and thus get the uniqueness test as a side effect). Maybe there is something more along the lines of unix's
cat list | sort | uniq
that just picks out consecutive duplications in an already sorted list?
Note in the question ' Fastest way to uniqify a list in Python ' the list is not sorted, and ' What is the cleanest way to do a sort plus uniq on a Python list? ' asks for the cleanest / most pythonic way, and the accepted answer suggests sorted(set(input)), which I'm trying to improve on.
I believe sorted(set(sequence)) is the fastest way of doing it.
Yes, set iterates over the sequence but that's a C-level loop, which is a lot faster than any looping you would do at python level.
Note that even with groupby you still have O(n) + O(nlogn) = O(nlogn) and what's worst is that groupby will require a python-level loop, which increases dramatically the constants in that O(n) thus in the end you obtain worst results.
When speaking of CPython the way to optimize things is to do as much as you can at C-level (see this answer to have an other example of counter-intuitive performance). To have a faster solution you must reimplement a sort, in a C-extensions. And even then, good luck with obtaining something as fast as python's Timsort!
A small comparison of the "canonical solution" versus the groupby solution:
>>> import timeit
>>> sequence = list(range(500)) + list(range(700)) + list(range(1000))
>>> timeit.timeit('sorted(set(sequence))', 'from __main__ import sequence', number=1000)
0.11532402038574219
>>> import itertools
>>> def my_sort(seq):
... return list(k for k,_ in itertools.groupby(sorted(seq)))
...
>>> timeit.timeit('my_sort(sequence)', 'from __main__ import sequence, my_sort', number=1000)
0.3162040710449219
As you can see it's 3 times slower.
The version provided by jdm is actually even worse:
>>> def make_unique(lst):
... if len(lst) <= 1:
... return lst
... last = lst[-1]
... for i in range(len(lst) - 2, -1, -1):
... item = lst[i]
... if item == last:
... del lst[i]
... else:
... last = item
...
>>> def my_sort2(seq):
... make_unique(sorted(seq))
...
>>> timeit.timeit('my_sort2(sequence)', 'from __main__ import sequence, my_sort2', number=1000)
0.46814608573913574
Almost 5 times slower.
Note that using seq.sort() and then make_unique(seq) and make_unique(sorted(seq)) are actually the same thing, since Timsort uses O(n) space you always have some reallocation, so using sorted(seq) does not actually change much the timings.
The jdm's benchmarks give different results because the input he is using are way too small and thus all the time is taken by the time.clock() calls.
Maybe this is not the answer you are searching for, but anyway, you should take this into your consideration.
Basically, you have 2 operations on a list:
unique_list = set(your_list) # O(n) complexity
sorted_list = sorted(unique_list) # O(nlogn) complexity
Now, you say "it seems to me that first checking for uniqueness and then sorting is wasteful", and you are right. But, how bad really is that redundant step? Take n = 1000000:
# sorted(set(a_list))
O(n) => 1000000
o(nlogn) => 1000000 * 20 = 20000000
Total => 21000000
# Your fastest way
O(nlogn) => 20000000
Total: 20000000
Speed gain: (1 - 20000000/21000000) * 100 = 4.76 %
For n = 5000000, speed gain: ~1.6 %
Now, is that optimization worth it?
This is just something I whipped up in a couple minutes. The function modifies a list in place, and removes consecutive repeats:
def make_unique(lst):
if len(lst) <= 1:
return lst
last = lst[-1]
for i in range(len(lst) - 2, -1, -1):
item = lst[i]
if item == last:
del lst[i]
else:
last = item
Some representative input data:
inp = [
(u"Tomato", "de"), (u"Cherry", "en"), (u"Watermelon", None), (u"Apple", None),
(u"Cucumber", "de"), (u"Lettuce", "de"), (u"Tomato", None), (u"Banana", None),
(u"Squash", "en"), (u"Rubarb", "de"), (u"Lemon", None),
]
Make sure both variants work as wanted:
print inp
print sorted(set(inp))
# copy because we want to modify it in place
inp1 = inp[:]
inp1.sort()
make_unique(inp1)
print inp1
Now to the testing. I'm not using timeit, since I don't want to time the copying of the list, only the sorting. time1 is sorted(set(...), time2 is list.sort() followed by make_unique, and time3 is the solution with itertools.groupby by Avinash Y.
import time
def time1(number):
total = 0
for i in range(number):
start = time.clock()
sorted(set(inp))
total += time.clock() - start
return total
def time2(number):
total = 0
for i in range(number):
inp1 = inp[:]
start = time.clock()
inp1.sort()
make_unique(inp1)
total += time.clock() - start
return total
import itertools
def time3(number):
total = 0
for i in range(number):
start = time.clock()
list(k for k,_ in itertools.groupby(sorted(inp)))
total += time.clock() - start
return total
sort + make_unique is approximately as fast as sorted(set(...)). I'd have to do a couple more iterations to see which one is potentially faster, but within the variations they are very similar. The itertools version is a bit slower.
# done each 3 times
print time1(100000)
# 2.38, 3.01, 2.59
print time2(100000)
# 2.88, 2.37, 2.6
print time3(100000)
# 4.18, 4.44, 4.67
Now with a larger list (the + str(i) is to prevent duplicates):
old_inp = inp[:]
inp = []
for i in range(100):
for j in old_inp:
inp.append((j[0] + str(i), j[1]))
print time1(10000)
# 40.37
print time2(10000)
# 35.09
print time3(10000)
# 40.0
Note that if there are a lot of duplicates in the list, the first version is much faster (since it does less sorting).
inp = []
for i in range(100):
for j in old_inp:
#inp.append((j[0] + str(i), j[1]))
inp.append((j[0], j[1]))
print time1(10000)
# 3.52
print time2(10000)
# 26.33
print time3(10000)
# 20.5
import numpy as np
np.unique(...)
The np.unique function returns an ndarray unique and sorted based on an array-like parameter. This will work with any numpy types, but also regular python values that are orderable.
If you need a regular python list, use np.unique(...).tolist()
>>> import itertools
>>> a=[2,3,4,1,2,7,8,3]
>>> list(k for k,_ in itertools.groupby(sorted(a)))
[1, 2, 3, 4, 7, 8]

Python append performance

I'm having some performance problems with 'append' in Python.
I'm writing an algorithm that checks if there are two overlapping circles in a (large) set of circles.
I start by putting the extreme points of the circles (x_i-R_i & x_i+R_i) in a list and then sorting the list.
class Circle:
def __init__(self, middle, radius):
self.m = middle
self.r = radius
In between I generate N random circles and put them in the 'circles' list.
"""
Makes a list with all the extreme points of the circles.
Format = [Extreme, left/right ~ 0/1 extreme, index]
Seperate function for performance reason, python handles local variables faster.
Garbage collect is temporarily disabled since a bug in Python makes list.append run in O(n) time instead of O(1)
"""
def makeList():
"""gc.disable()"""
list = []
append = list.append
for circle in circles:
append([circle.m[0]-circle.r, 0, circles.index(circle)])
append([circle.m[0] + circle.r, 1, circles.index(circle)])
"""gc.enable()"""
return list
When running this with 50k circles it takes over 75 seconds to generate the list. As you might see in the comments I wrote I disabled garbage collect, put it in a separate function, used
append = list.append
append(foo)
instead of just
list.append(foo)
I disabled gc since after some searching it seems that there's a bug with python causing append to run in O(n) instead of O(c) time.
So is this way the fastest way or is there a way to make this run faster?
Any help is greatly appreciated.
Instead of
for circle in circles:
... circles.index(circle) ...
use
for i, circle in enumerate(circles):
... i ...
This could decrease your O(n^2) to O(n).
Your whole makeList could be written as:
sum([[[circle.m[0]-circle.r, 0, i], [circle.m[0]+circle.r, 1, i]] for i, circle in enumerate(circles)], [])
Your performance problem is not in the append() method, but in your use of circles.index(), which makes the whole thing O(n^2).
A further (comparitively minor) improvement is to use a list comprehension instead of list.append():
mylist = [[circle.m[0] - circle.r, 0, i]
for i, circle in enumerate(circles)]
mylist += [[circle.m[0] + circle.r, 1, i]
for i, circle in enumerate(circles)]
Note that this will give the data in a different order (which should not matter as you are planning to sort it anyway).
I've just tried several tests to improve "append" function's speed. It will definitely helpful for you.
Using Python
Using list(map(lambda - known as a bit faster means than for+append
Using Cython
Using Numba - jit
CODE CONTENT : getting numbers from 0 ~ 9999999, square them, and put them into a new list using append.
Using Python
import timeit
st1 = timeit.default_timer()
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.7 s
Using list(map(lambda
import timeit
st1 = timeit.default_timer()
result = list(map(lambda x : x**2 , range(0,10000000) ))
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.6 s
Using Cython
the coding in a .pyx file
def f1():
cpdef double i
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
and I compiled it and ran it in .py file.
in .py file
import timeit
from c1 import *
st1 = timeit.default_timer()
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 1.6 s
Using Numba - jit
import timeit
from numba import jit
st1 = timeit.default_timer()
#jit(nopython=True, cache=True)
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 0.57 s
CONCLUSION :
As you mentioned above, changing the simple append form boosted up the speed a bit. And using Cython is much faster than in Python. However, turned out using Numba is the best choice in terms of speed improvement for 'for+append' !
Try using deque in the collections package to append large rows of data, without performance diminishing. And then convert a deque back to a DataFrame using List Comprehension.
Sample Case:
from collections import deque
d = deque()
for row in rows:
d.append([value_x, value_y])
df = pd.DataFrame({'column_x':[item[0] for item in d],'column_y':[item[1] for item in d]})
This is a real time-saver.
If performance were an issue, I would avoid using append. Instead, preallocate an array and then fill it up. I would also avoid using index to find position within the list "circles". Here's a rewrite. It's not compact, but I'll bet it's fast because of the unrolled loop.
def makeList():
"""gc.disable()"""
mylist = 6*len(circles)*[None]
for i in range(len(circles)):
j = 6*i
mylist[j] = circles[i].m[0]-circles[i].r
mylist[j+1] = 0
mylist[j+2] = i
mylist[j+3] = circles[i].m[0] + circles[i].r
mylist[j+4] = 1
mylist[j+5] = i
return mylist

Cost of list functions in Python

Based on this older thread, it looks like the cost of list functions in Python is:
Random access: O(1)
Insertion/deletion to front: O(n)
Insertion/deletion to back: O(1)
Can anyone confirm whether this is still true in Python 2.6/3.x?
Take a look here. It's a PEP for a different kind of list. The version specified is 2.6/3.0.
Append (insertion to back) is O(1), while insertion (everywhere else) is O(n). So yes, it looks like this is still true.
Operation...Complexity
Copy........O(n)
Append......O(1)
Insert......O(n)
Get Item....O(1)
Set Item....O(1)
Del Item....O(n)
Iteration...O(n)
Get Slice...O(k)
Del Slice...O(n)
Set Slice...O(n+k)
Extend......O(k)
Sort........O(n log n)
Multiply....O(nk)
Python 3 is mostly an evolutionary change, no big changes in the datastructures and their complexities.
The canonical source for Python collections is TimeComplexity on the Wiki.
That's correct, inserting in front forces a move of all the elements to make place.
collections.deque offers similar functionality, but is optimized for insertion on both sides.
I know this post is old, but I recently did a little test myself. The complexity of list.insert() appears to be O(n)
Code:
'''
Independent Study, Timing insertion list method in python
'''
import time
def make_list_of_size(n):
ret_list = []
for i in range(n):
ret_list.append(n)
return ret_list
#Estimate overhead of timing loop
def get_overhead(niters):
'''
Returns the time it takes to iterate a for loop niter times
'''
tic = time.time()
for i in range(niters):
pass #Just blindly iterate, niter times
toc = time.time()
overhead = toc-tic
return overhead
def tictoc_midpoint_insertion(list_size_initial, list_size_final, niters,file):
overhead = get_overhead(niters)
list_size = list_size_initial
#insertion_pt = list_size//2 #<------- INSERTION POINT ASSIGMNET LOCATION 1
#insertion_pt = 0 #<--------------- INSERTION POINT ASSIGNMENT LOCATION 4 (insert at beginning)
delta = 100
while list_size <= list_size_final:
#insertion_pt = list_size//2 #<----------- INSERTION POINT ASSIGNMENT LOCATION 2
x = make_list_of_size(list_size)
tic = time.time()
for i in range(niters):
insertion_pt = len(x)//2 #<------------- INSERTION POINT ASSIGNMENT LOCATION 3
#insertion_pt = len(x) #<------------- INSERTION POINT ASSIGN LOC 5 insert at true end
x.insert(insertion_pt,0)
toc = time.time()
cost_per_iter = (toc-tic)/niters #overall time cost per iteration
cost_per_iter_no_overhead = (toc - tic - overhead)/niters #the fraction of time/iteration, #without overhead cost of pure iteration
print("list size = {:d}, cost without overhead = {:f} sec/iter".format(list_size,cost_per_iter_no_overhead))
file.write(str(list_size)+','+str(cost_per_iter_no_overhead)+'\n')
if list_size >= 10*delta:
delta *= 10
list_size += delta
def main():
fname = input()
file = open(fname,'w')
niters = 10000
tictoc_midpoint_insertion(100,10000000,niters,file)
file.close()
main()
See 5 positions where insertion can be done. Cost is of course a function of how large the list is, and how close you are to the beginning of the list (i.e. how many memory locations have to be re-organized)
Ignore left image of plot
Fwiw, there is a faster (for some ops... insert is O(log n)) list implementation called BList if you need it. BList

Finding closest match in collection of strings representing numbers

I have a sorted list of datetimes in text format. The format of each entry is '2009-09-10T12:00:00'.
I want to find the entry closest to a target. There are many more entries than the number of searches I would have to do.
I could change each entry to a number, then search numerically (for example these approaches), but that would seem excess effort.
Is there a better way than this:
def mid(res, target):
#res is a list of entries, sorted by dt (dateTtime)
#each entry is a dict with a dt and some other info
n = len(res)
low = 0
high = n-1
# find the first res greater than target
while low < high:
mid = (low + high)/2
t = res[int(mid)]['dt']
if t < target:
low = mid + 1
else:
high = mid
# check if the prior value is closer
i = max(0, int(low)-1)
a = dttosecs(res[i]['dt'])
b = dttosecs(res[int(low)]['dt'])
t = dttosecs(target)
if abs(a-t) < abs(b-t):
return int(low-1)
else:
return int(low)
import time
def dttosecs(dt):
# string to seconds since the beginning
date,tim = dt.split('T')
y,m,d = date.split('-')
h,mn,s = tim.split(':')
y = int(y)
m = int(m)
d = int(d)
h = int(h)
mn = int(mn)
s = min(59,int(float(s)+0.5)) # round to neatest second
s = int(s)
secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
return secs
"Copy and paste coding" (getting bisect's sources into your code) is not recommended as it carries all sorts of costs down the road (lot of extra source code for you to test and maintain, difficulties dealing with upgrades in the upstream code you've copied, etc, etc); the best way to reuse standard library modules is simply to import them and use them.
However, to do one pass transforming the dictionaries into meaningfully comparable entries is O(N), which (even though each step of the pass is simple) will eventually swamp the O(log N) time of the search proper. Since bisect can't support a key= key extractor like sort does, what the solution to this dilemma -- how can you reuse bisect by import and call, without a preliminary O(N) step...?
As quoted here, the solution is in David Wheeler's famous saying, "All problems in computer science can be solved by another level of indirection". Consider e.g....:
import bisect
listofdicts = [
{'dt': '2009-%2.2d-%2.2dT12:00:00' % (m,d) }
for m in range(4,9) for d in range(1,30)
]
class Indexer(object):
def __init__(self, lod, key):
self.lod = lod
self.key = key
def __len__(self):
return len(self.lod)
def __getitem__(self, idx):
return self.lod[idx][self.key]
lookfor = listofdicts[len(listofdicts)//2]['dt']
def mid(res=listofdicts, target=lookfor):
keys = [r['dt'] for r in res]
return res[bisect.bisect_left(keys, target)]
def midi(res=listofdicts, target=lookfor):
wrap = Indexer(res, 'dt')
return res[bisect.bisect_left(wrap, target)]
if __name__ == '__main__':
print '%d dicts on the list' % len(listofdicts)
print 'Looking for', lookfor
print mid(), midi()
assert mid() == midi()
The output (just running this indexer.py as a check, then with timeit, two ways):
$ python indexer.py
145 dicts on the list
Looking for 2009-06-15T12:00:00
{'dt': '2009-06-15T12:00:00'} {'dt': '2009-06-15T12:00:00'}
$ python -mtimeit -s'import indexer' 'indexer.mid()'
10000 loops, best of 3: 27.2 usec per loop
$ python -mtimeit -s'import indexer' 'indexer.midi()'
100000 loops, best of 3: 9.43 usec per loop
As you see, even in a modest task with 145 entries in the list, the indirection approach can have a performance that's three times better than the "key-extraction pass" approach. Since we're comparing O(N) vs O(log N), the advantage of the indirection approach grows without bounds as N increases. (For very small N, the higher multiplicative constants due to the indirection make the key-extraction approach faster, but this is soon surpassed by the big-O difference). Admittedly, the Indexer class is extra code -- however, it's reusable over ALL tasks of binary searching a list of dicts sorted by one entry in each dict, so having it in your "container-utilities back of tricks" offers good return on that investment.
So much for the main search loop. For the secondary task of converting two entries (the one just below and the one just above the target) and the target to a number of seconds, consider, again, a higher-reuse approach, namely:
import time
adt = '2009-09-10T12:00:00'
def dttosecs(dt=adt):
# string to seconds since the beginning
date,tim = dt.split('T')
y,m,d = date.split('-')
h,mn,s = tim.split(':')
y = int(y)
m = int(m)
d = int(d)
h = int(h)
mn = int(mn)
s = min(59,int(float(s)+0.5)) # round to neatest second
s = int(s)
secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
return secs
def simpler(dt=adt):
return time.mktime(time.strptime(dt, '%Y-%m-%dT%H:%M:%S'))
if __name__ == '__main__':
print adt, dttosecs(), simpler()
assert dttosecs() == simpler()
Here, there is no performance advantage to the reuse approach (indeed, and on the contrary, dttosecs is faster) -- but then, you only need to perform three conversions per search, no matter how many entries are on your list of dicts, so it's not clear whether that performance issue is germane. Meanwhile, with simpler you only have to write, test and maintain one simple line of code, while dttosecs is a dozen lines; given this ratio, in most situations (i.e., excluding absolute bottlenecks), I would prefer simpler. The important thing is to be aware of both approaches and of the tradeoffs between them so as to ensure the choice is made wisely.
You want the bisect module from the standard library. It will do a binary search and tell you the correct insertion point for a new value into an already sorted list. Here's an example that will print the place in the list where target would be inserted:
from bisect import bisect
dates = ['2009-09-10T12:00:00', '2009-09-11T12:32:00', '2009-09-11T12:43:00']
target = '2009-09-11T12:40:00'
print bisect(dates, target)
From there you can just compare to the thing before and after your insertion point, which in this case would be dates[i-1] and dates[i] to see which one is closest to your target.
import bisect
def mid(res, target):
keys = [r['dt'] for r in res]
return res[bisect.bisect_left(keys, target)]
First, change to this.
import datetime
def parse_dt(dt):
return datetime.strptime( dt, "%Y-%m-%dT%H:%M:%S" )
This removes much of the "effort".
Consider this as the search.
def mid( res, target ):
"""res is a list of entries, sorted by dt (dateTtime)
each entry is a dict with a dt and some other info
"""
times = [ parse_dt(r['dt']) for r in res ]
index= bisect( times, parse_dt(target) )
return times[index]
This doesn't seem like very much "effort". This does not depend on your timestamps being formatted properly, either. You can change to any timestamp format and be assured that this will always work.

Categories