Related
I have a list which is a million items long of random, repeatable integers. I need to sort that list, and then find the index of the first iteration of every unique element in the list. When I do this, I am running into run time >5 minutes long. Can anyone give me any suggestions to speed up my code? An example of my process is shown below.
import random
a = []
for x in range(1000000):
a.append(random.randint(1,10000))
unique_a = set(a)
inds=[0]
inds = [a.index(i) for i in sorted(unique_a) if i not in inds]
inds = [a.index(i) for i in sorted(unique_a) if i not in inds] is implicitly quadratic is a.index(i) is linear. Use a dictionary to grab the indices in one pass over the sorted list:
a =sorted([0,4,3,5,21,5,6,3,1,23,4,6,1,93,34,10])
unique_a = set(a)
first_inds = {}
for i,x in enumerate(a):
if not x in first_inds:
first_inds[x] = i
my_inds = [first_inds[x] for x in sorted(unique_a)]
Just store the first position for every unique element:
first_position = {}
for i, value in enumerate(a):
if value not in first_position:
first_position[value] = i
And then replace a.index(i) for first_position[i]
Or just use:
_, indices = zip(*sorted(first_position.items()))
You can use the bisect_left function from the standard library's bisect module to do this. On a sorted list, a bisection search is faster than searching through the list as index does.
>>> L = [random.randint(0, 10) for _ in range(100)]
>>> L.sort()
>>> L.index(9)
83
>>> bisect.bisect_left(L, 9)
83
>>> timeit.timeit(setup="from __main__ import L", stmt="L.index(9)")
2.1408978551626205
>>> timeit.timeit(setup="from __main__ import L;from bisect import bisect_left", stmt="bisect_left(L, 9)")
0.5187544231303036
On my machine, using bisect.bisect_left is faster than iterating over the list and accumulating indexes on the way:
>>> L = [random.randint(0, 100) for _ in range(10000)]
>>> L.sort()
>>> def iterative_approach(list_):
... unique = set(list_)
... first_inds = {}
... for i, x in enumerate(list_):
... if x not in first_inds:
... first_inds[x] = i
... return [first_inds[x] for x in sorted(unique)]
...
>>> ia = iterative_approach(L)
>>> bisect_left = bisect.bisect_left
>>> def bisect_approach(list_):
... unique = set(list_)
... out = {}
... for x in unique:
... out[x] = bisect_left(list_, x)
... return [out[x] for x in sorted(unique)]
...
>>> ba = bisect_approach(L)
>>> ia == ba
True
>>> timeit.timeit(setup="from __main__ import L, iterative_approach", stmt="iterative_approach(L)")
1488.956467495067
>>> timeit.timeit(setup="from __main__ import L, bisect_approach", stmt="bisect_approach(L)")
407.6803469741717
I have a sorted list of numbers like:
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
I need to find the max index of each values which is divisible by 100.
Output should be like: 4,10,15
My Code:
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
idx = 1
for i in (a):
if i%100 == 0:
print idx
idx = idx+1
Output of above code:
4
9
10
13
14
15
In case people are curious, I benchmarked the dict comprehension technique against the backward iteration technique. Dict comprehension is about twice the speed. Changing to OrderedDict resulted in MASSIVE slowdown. About 15x slower than the dict comprehension.
def test1():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
max_index = {}
for i, item in enumerate(a[::-1]):
if item not in max_index:
max_index[item] = len(a) - (i + 1)
return max_index
def test2():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
return {item: index for index, item in enumerate(a, 1)}
def test3():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
OrderedDict((item, index) for index, item in enumerate(a, 1))
if __name__ == "__main__":
import timeit
print(timeit.timeit("test1()", setup="from __main__ import test1"))
print(timeit.timeit("test2()", setup="from __main__ import test2"))
print(timeit.timeit("test3()", setup="from __main__ import test3; from collections import OrderedDict"))
3.40622282028
1.97545695305
26.347012043
Use a simple dict-comprehension or OrderedDict with divisible items as the keys, old values will be replaced by newest values automatically.
>>> {item: index for index, item in enumerate(lst, 1) if not item % 100}.values()
dict_values([4, 10, 15])
# if order matters
>>> from collections import OrderedDict
>>> OrderedDict((item, index) for index, item in enumerate(lst, 1) if not item % 100).values()
odict_values([4, 10, 15])
Another way will be to loop over reversed list and use a set to keep track of items seen so far(lst[::-1] may be slightly faster than reversed(lst) for tiny lists).
>>> seen = set()
>>> [len(lst) - index for index, item in enumerate(reversed(lst))
if not item % 100 and item not in seen and not seen.add(item)][::-1]
[4, 10, 15]
You can see the sort-of equivalent code of the above here.
You could use itertools.groupby since your data is sorted:
>>> a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
>>> from itertools import groupby
>>> [list(g)[-1][0] for k,g in groupby(enumerate(a), lambda t: (t[1] % 100, t[1])) if k[0] == 0]
[3, 9, 14]
Although this is a little cryptic.
Here's a complicated approach using only a list-iterator and accumulating into a list:
>>> run, prev, idx = False, None, []
>>> for i, e in enumerate(a):
... if not (e % 100 == 0):
... if not run:
... prev = e
... continue
... idx.append(i - 1)
... run = False
... else:
... if prev != e and run:
... idx.append(i - 1)
... run = True
... prev = e
...
>>> if run:
... idx.append(i)
...
>>> idx
[3, 9, 14]
I think this is best dealt with a dictionary approach like #AshwiniChaudhary It is more straightforward, and much faster:
>>> timeit.timeit("{item: index for index, item in enumerate(a, 1)}", "from __main__ import a")
1.842843743012054
>>> timeit.timeit("[list(g)[-1][0] for k,g in groupby(enumerate(a), lambda t: (t[1] % 100, t[1])) if k[0] == 0]", "from __main__ import a, groupby")
8.479677081981208
The groupby approach is pretty slow, note, the complicated approach is faster, and not far-off form the dict-comprehension approach:
>>> def complicated(a):
... run, prev, idx = False, None, []
... for i, e in enumerate(a):
... if not (e % 100 == 0):
... if not run:
... prev = e
... continue
... idx.append(i - 1)
... run = False
... else:
... if prev != e and run:
... idx.append(i - 1)
... run = True
... prev = e
... if run:
... idx.append(i)
... return idx
...
>>> timeit.timeit("complicated(a)", "from __main__ import a, complicated")
2.6667005629860796
Edit Note, the performance difference narrows if we call list on the dict-comprehension .values():
>>> timeit.timeit("list({item: index for index, item in enumerate(a, 1)}.values())", "from __main__ import a")
2.3839886570058297
>>> timeit.timeit("complicated(a)", "from __main__ import a, complicated")
2.708565960987471
it seemed like a good idea at the start, got a bit twisty, had to patch a couple of cases...
a = [0,77,98,99,100,101,102,198,199,200,200,278,299,300,300,300, 459, 700,700]
bz = [*zip(*((i, d//100) for i, d in enumerate(a) if d%100 == 0 and d != 0))]
[a for a, b, c in zip(*bz, bz[1][1:]) if c-b != 0] + [bz[0][-1]]
Out[78]: [4, 10, 15, 18]
enumerate, zip to create bz which mates 100's numerator(s) with indices
bz = [*zip(*((i, d//100) for i, d in enumerate(a) if d%100 == 0 and d != 0))]
print(*bz, sep='\n')
(4, 9, 10, 13, 14, 15, 17, 18)
(1, 2, 2, 3, 3, 3, 7, 7)
then zip again, zip(*bz, bz[1][1:]) lagging the numerator tuple to allow the lagged difference to give a selection logic if c-b != 0for the last index of each run but the last
add the last 100's match because its always the end of the last run + [bz[0][-1]]
It seems for some reason that a dict can not have a non-duplicate key which is bitarray()
ex.:
data = {}
for _ in xrange(10):
ba = ...generate repeatable bitarrays ...
data[ba] = 1
print ba
{bitarray('11011'): 1, bitarray('11011'): 1, bitarray('11011'): 1, bitarray('01111'): 1, bitarray('11110'): 1, bitarray('11110'): 1, bitarray('01111'): 1, bitarray('01111'): 1, bitarray('11110'): 1, bitarray('11110'): 1}
You can clearly see that duplicate are stored as different keys (f.e. first two elements) !! which is weird. What could be the reason.
My goal is simply to count the number of times a bit pattern shows up, and of course Dict's are perfect for this, but it seems that bitarray() for some reason is opaque to the hashing algorithm.
btw.. i have to use bitarray(), cause i do 10000 bits+ patterns.
Any other idea of efficient way of counting occurrence of bit pattens ..
This answer addresses your first confusion regarding duplicate dictionary keys and I assume you're referring to bitarray() from bitarray module, *I've not used this module myself.
In your example above, you're not actually getting duplicate dictionary keys, you might see them that way, but they're duplicates to the naked eye only, for instance:
>>> class X:
... def __repr__(self):
... return '"X obj"'
...
>>> x1 = X()
>>> x2 = X()
>>> d = {x1:1, x2:2}
>>> d
{"X obj": 2, "X obj": 1}
But x1 isn't exactly equals to to x2 and hence they're not duplicates, they're distinct objects of class X:
>>> x1 == x2
False
>>> #same as
... id(x1) == id(x2)
False
>>> #same as
...x1 is x2
False
Moreover, because X class defines __repr__ which returns the string representation for its objects, you would think dictionary d has duplicate keys, again there are no duplicated keys nor are the keys of type str; key of value 1 is X object and key of value 2 is another object of X -- literally two different objects with a single string representation returned by their class's __repr__ method:
>>> # keys are instance of X not strings
... d
{"X obj": 2, "X obj": 1}
>>> d["X obj"]
KeyError: 'X obj'
>>>[x1]
1
>>>[x2]
2
Till BitArray 0.8.1 (or later) I believe it does not satisfy the hash invariant property.
To work around it, you should convert the bit array to byte format as follows.
>>> from bitarray import bitarray
>>> l = [bitarray('11111'), bitarray('11111'), bitarray('11010'), bitarray('11110'), bitarray('11111'), bitarray('11010')]
>>> for x in l: ht[x.tobytes()] = 0
...
>>> for x in l: ht[x.tobytes()] += 1
...
>>> ht
{'\xf8': 3, '\xf0': 1, '\xd0': 2}
Remember you can get back the bitarray from the byte format by using the command frombytes(byte). Though, in this case you will have to keep track of the size of bitarray explicitly as it will return bitarray of size multiple of 8.
If you want to keep the bitarray in the dictionary also:
>>> from bitarray import bitarray
>>> l = [bitarray('11111'), bitarray('11111'), bitarray('11010'), bitarray('11110'), bitarray('11111'), bitarray('11010')]
>>> ht = {}
>>> for x in l: ht[x.tobytes()] = (0, x)
...
>>> for x in l:
... old_count = ht[x.tobytes()][0]
... ht[x.tobytes()] = (old_count+1, x)
...
>>> ht
{'\xf8': (3, bitarray('11111')), '\xf0': (1, bitarray('11110')), '\xd0': (2, bitarray('11010'))}
>>> for x,y in ht.iteritems(): print(y)
...
(3, bitarray('11111'))
(1, bitarray('11110'))
(2, bitarray('11010'))
I solved it :
desc = bitarray(res).to01()
if desc in data : data[desc] += 1
else : data[desc] = 1
gosh I miss perl no-nonsense autovivification :)
I wonder if there is more Pythonic way to do group by and ordered a list by the order of another list.
The lstNeedOrder has couple pairs in random order. I want the output to be ordered as order in lst. The result should have all pairs containing a's then follow by all b's and c's.
The lstNeedOrder would only have either format in a/c or c/a.
input:
lstNeedOrder = ['a/b','c/b','f/d','a/e','c/d','a/c']
lst = ['a','b','c']
output:
res = ['a/b','a/c','a/e','c/b','c/d','f/d']
update
The lst = ['a','b','c'] is not actual data. it just make logic easy to understand. the actual data are more complex string pairs
Using sorted with customer key function:
>>> lstNeedOrder = ['a/b','c/d','f/d','a/e','c/d','a/c']
>>> lst = ['a','b','c']
>>> order = {ch: i for i, ch in enumerate(lst)} # {'a': 0, 'b': 1, 'c': 2}
>>> def sort_key(x):
... # 'a/b' -> (0, 1), 'c/d' -> (2, 3), ...
... a, b = x.split('/')
... return order.get(a, len(lst)), order.get(b, len(lst))
...
>>> sorted(lstNeedOrder, key=sort_key)
['a/b', 'a/c', 'a/e', 'c/d', 'c/d', 'f/d']
I know you can do
print str(myList)
to get
[1, 2, 3]
and you can do
i = 0
for entry in myList:
print str(i) + ":", entry
i += 1
to get
0: 1
1: 2
2: 3
But is there a way similar to the first to get a result similar to the last?
With my limited knowledge of Python (and some help from the documentation), my best is:
print '\n'.join([str(n) + ": " + str(entry) for (n, entry) in zip(range(0,len(myList)), myList)])
It's not much less verbose, but at least I get a custom string in one (compound) statement.
Can you do better?
>>> lst = [1, 2, 3]
>>> print('\n'.join('{}: {}'.format(*k) for k in enumerate(lst)))
0: 1
1: 2
2: 3
Note: you just need to understand that list comprehension or iterating over a generator expression is explicit looping.
In python 3s print function:
lst = [1, 2, 3]
print('My list:', *lst, sep='\n- ')
Output:
My list:
- 1
- 2
- 3
Con: The sep must be a string, so you can't modify it based on which element you're printing. And you need a kind of header to do this (above it was 'My list:').
Pro: You don't have to join() a list into a string object, which might be advantageous for larger lists. And the whole thing is quite concise and readable.
l = [1, 2, 3]
print '\n'.join(['%i: %s' % (n, l[n]) for n in xrange(len(l))])
Starting from this:
>>> lst = [1, 2, 3]
>>> print('\n'.join('{}: {}'.format(*k) for k in enumerate(lst)))
0: 1
1: 2
2: 3
You can get rid of the join by passing \n as a separator to print
>>> print(*('{}: {}'.format(*k) for k in enumerate(lst)), sep="\n")
0: 1
1: 2
2: 3
Now you see you could use map, but you'll need to change the format string (yuck!)
>>> print(*(map('{0[0]}: {0[1]}'.format, enumerate(lst))), sep="\n")
0: 1
1: 2
2: 3
or pass 2 sequences to map. A separate counter and no longer enumerate lst
>>> from itertools import count
>>> print(*(map('{}: {}'.format, count(), lst)), sep="\n")
0: 1
1: 2
2: 3
>>> from itertools import starmap
>>> lst = [1, 2, 3]
>>> print('\n'.join(starmap('{}: {}'.format, enumerate(lst))))
0: 1
1: 2
2: 3
This uses itertools.starmap, which is like map, except it *s the argument into the function. The function in this case is '{}: {}'.format.
I would prefer the comprehension of SilentGhost, but starmap is a nice function to know about.
Another:
>>> lst=[10,11,12]
>>> fmt="%i: %i"
>>> for d in enumerate(lst):
... print(fmt%d)
...
0: 10
1: 11
2: 12
Yet another form:
>>> for i,j in enumerate(lst): print "%i: %i"%(i,j)
That method is nice since the individual elements in tuples produced by enumerate can be modified such as:
>>> for i,j in enumerate([3,4,5],1): print "%i^%i: %i "%(i,j,i**j)
...
1^3: 1
2^4: 16
3^5: 243
Of course, don't forget you can get a slice from this like so:
>>> for i,j in list(enumerate(lst))[1:2]: print "%i: %i"%(i,j)
...
1: 11
from time import clock
from random import sample
n = 500
myList = sample(xrange(10000),n)
#print myList
A,B,C,D = [],[],[],[]
for i in xrange(100):
t0 = clock()
ecr =( '\n'.join('{}: {}'.format(*k) for k in enumerate(myList)) )
A.append(clock()-t0)
t0 = clock()
ecr = '\n'.join(str(n) + ": " + str(entry) for (n, entry) in zip(range(0,len(myList)), myList))
B.append(clock()-t0)
t0 = clock()
ecr = '\n'.join(map(lambda x: '%s: %s' % x, enumerate(myList)))
C.append(clock()-t0)
t0 = clock()
ecr = '\n'.join('%s: %s' % x for x in enumerate(myList))
D.append(clock()-t0)
print '\n'.join(('t1 = '+str(min(A))+' '+'{:.1%}.'.format(min(A)/min(D)),
't2 = '+str(min(B))+' '+'{:.1%}.'.format(min(B)/min(D)),
't3 = '+str(min(C))+' '+'{:.1%}.'.format(min(C)/min(D)),
't4 = '+str(min(D))+' '+'{:.1%}.'.format(min(D)/min(D))))
For n=500:
150.8%.
142.7%.
110.8%.
100.0%.
For n=5000:
153.5%.
176.2%.
109.7%.
100.0%.
Oh, I see now: only the solution 3 with map() fits with the title of the question.
Take a look on pprint, The pprint module provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. If the formatted structures include objects which are not fundamental Python types, the representation may not be loadable. This may be the case if objects such as files, sockets or classes are included, as well as many other objects which are not representable as Python literals.
>>> import pprint
>>> stuff = ['spam', 'eggs', 'lumberjack', 'knights', 'ni']
>>> stuff.insert(0, stuff[:])
>>> pp = pprint.PrettyPrinter(indent=4)
>>> pp.pprint(stuff)
[ ['spam', 'eggs', 'lumberjack', 'knights', 'ni'],
'spam',
'eggs',
'lumberjack',
'knights',
'ni']
>>> pp = pprint.PrettyPrinter(width=41, compact=True)
>>> pp.pprint(stuff)
[['spam', 'eggs', 'lumberjack',
'knights', 'ni'],
'spam', 'eggs', 'lumberjack', 'knights',
'ni']
>>> tup = ('spam', ('eggs', ('lumberjack', ('knights', ('ni', ('dead',
... ('parrot', ('fresh fruit',))))))))
>>> pp = pprint.PrettyPrinter(depth=6)
>>> pp.pprint(tup)
('spam', ('eggs', ('lumberjack', ('knights', ('ni', ('dead', (...)))))))