Related
I have a list which is a million items long of random, repeatable integers. I need to sort that list, and then find the index of the first iteration of every unique element in the list. When I do this, I am running into run time >5 minutes long. Can anyone give me any suggestions to speed up my code? An example of my process is shown below.
import random
a = []
for x in range(1000000):
a.append(random.randint(1,10000))
unique_a = set(a)
inds=[0]
inds = [a.index(i) for i in sorted(unique_a) if i not in inds]
inds = [a.index(i) for i in sorted(unique_a) if i not in inds] is implicitly quadratic is a.index(i) is linear. Use a dictionary to grab the indices in one pass over the sorted list:
a =sorted([0,4,3,5,21,5,6,3,1,23,4,6,1,93,34,10])
unique_a = set(a)
first_inds = {}
for i,x in enumerate(a):
if not x in first_inds:
first_inds[x] = i
my_inds = [first_inds[x] for x in sorted(unique_a)]
Just store the first position for every unique element:
first_position = {}
for i, value in enumerate(a):
if value not in first_position:
first_position[value] = i
And then replace a.index(i) for first_position[i]
Or just use:
_, indices = zip(*sorted(first_position.items()))
You can use the bisect_left function from the standard library's bisect module to do this. On a sorted list, a bisection search is faster than searching through the list as index does.
>>> L = [random.randint(0, 10) for _ in range(100)]
>>> L.sort()
>>> L.index(9)
83
>>> bisect.bisect_left(L, 9)
83
>>> timeit.timeit(setup="from __main__ import L", stmt="L.index(9)")
2.1408978551626205
>>> timeit.timeit(setup="from __main__ import L;from bisect import bisect_left", stmt="bisect_left(L, 9)")
0.5187544231303036
On my machine, using bisect.bisect_left is faster than iterating over the list and accumulating indexes on the way:
>>> L = [random.randint(0, 100) for _ in range(10000)]
>>> L.sort()
>>> def iterative_approach(list_):
... unique = set(list_)
... first_inds = {}
... for i, x in enumerate(list_):
... if x not in first_inds:
... first_inds[x] = i
... return [first_inds[x] for x in sorted(unique)]
...
>>> ia = iterative_approach(L)
>>> bisect_left = bisect.bisect_left
>>> def bisect_approach(list_):
... unique = set(list_)
... out = {}
... for x in unique:
... out[x] = bisect_left(list_, x)
... return [out[x] for x in sorted(unique)]
...
>>> ba = bisect_approach(L)
>>> ia == ba
True
>>> timeit.timeit(setup="from __main__ import L, iterative_approach", stmt="iterative_approach(L)")
1488.956467495067
>>> timeit.timeit(setup="from __main__ import L, bisect_approach", stmt="bisect_approach(L)")
407.6803469741717
I have a sorted list of numbers like:
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
I need to find the max index of each values which is divisible by 100.
Output should be like: 4,10,15
My Code:
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
idx = 1
for i in (a):
if i%100 == 0:
print idx
idx = idx+1
Output of above code:
4
9
10
13
14
15
In case people are curious, I benchmarked the dict comprehension technique against the backward iteration technique. Dict comprehension is about twice the speed. Changing to OrderedDict resulted in MASSIVE slowdown. About 15x slower than the dict comprehension.
def test1():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
max_index = {}
for i, item in enumerate(a[::-1]):
if item not in max_index:
max_index[item] = len(a) - (i + 1)
return max_index
def test2():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
return {item: index for index, item in enumerate(a, 1)}
def test3():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
OrderedDict((item, index) for index, item in enumerate(a, 1))
if __name__ == "__main__":
import timeit
print(timeit.timeit("test1()", setup="from __main__ import test1"))
print(timeit.timeit("test2()", setup="from __main__ import test2"))
print(timeit.timeit("test3()", setup="from __main__ import test3; from collections import OrderedDict"))
3.40622282028
1.97545695305
26.347012043
Use a simple dict-comprehension or OrderedDict with divisible items as the keys, old values will be replaced by newest values automatically.
>>> {item: index for index, item in enumerate(lst, 1) if not item % 100}.values()
dict_values([4, 10, 15])
# if order matters
>>> from collections import OrderedDict
>>> OrderedDict((item, index) for index, item in enumerate(lst, 1) if not item % 100).values()
odict_values([4, 10, 15])
Another way will be to loop over reversed list and use a set to keep track of items seen so far(lst[::-1] may be slightly faster than reversed(lst) for tiny lists).
>>> seen = set()
>>> [len(lst) - index for index, item in enumerate(reversed(lst))
if not item % 100 and item not in seen and not seen.add(item)][::-1]
[4, 10, 15]
You can see the sort-of equivalent code of the above here.
You could use itertools.groupby since your data is sorted:
>>> a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
>>> from itertools import groupby
>>> [list(g)[-1][0] for k,g in groupby(enumerate(a), lambda t: (t[1] % 100, t[1])) if k[0] == 0]
[3, 9, 14]
Although this is a little cryptic.
Here's a complicated approach using only a list-iterator and accumulating into a list:
>>> run, prev, idx = False, None, []
>>> for i, e in enumerate(a):
... if not (e % 100 == 0):
... if not run:
... prev = e
... continue
... idx.append(i - 1)
... run = False
... else:
... if prev != e and run:
... idx.append(i - 1)
... run = True
... prev = e
...
>>> if run:
... idx.append(i)
...
>>> idx
[3, 9, 14]
I think this is best dealt with a dictionary approach like #AshwiniChaudhary It is more straightforward, and much faster:
>>> timeit.timeit("{item: index for index, item in enumerate(a, 1)}", "from __main__ import a")
1.842843743012054
>>> timeit.timeit("[list(g)[-1][0] for k,g in groupby(enumerate(a), lambda t: (t[1] % 100, t[1])) if k[0] == 0]", "from __main__ import a, groupby")
8.479677081981208
The groupby approach is pretty slow, note, the complicated approach is faster, and not far-off form the dict-comprehension approach:
>>> def complicated(a):
... run, prev, idx = False, None, []
... for i, e in enumerate(a):
... if not (e % 100 == 0):
... if not run:
... prev = e
... continue
... idx.append(i - 1)
... run = False
... else:
... if prev != e and run:
... idx.append(i - 1)
... run = True
... prev = e
... if run:
... idx.append(i)
... return idx
...
>>> timeit.timeit("complicated(a)", "from __main__ import a, complicated")
2.6667005629860796
Edit Note, the performance difference narrows if we call list on the dict-comprehension .values():
>>> timeit.timeit("list({item: index for index, item in enumerate(a, 1)}.values())", "from __main__ import a")
2.3839886570058297
>>> timeit.timeit("complicated(a)", "from __main__ import a, complicated")
2.708565960987471
it seemed like a good idea at the start, got a bit twisty, had to patch a couple of cases...
a = [0,77,98,99,100,101,102,198,199,200,200,278,299,300,300,300, 459, 700,700]
bz = [*zip(*((i, d//100) for i, d in enumerate(a) if d%100 == 0 and d != 0))]
[a for a, b, c in zip(*bz, bz[1][1:]) if c-b != 0] + [bz[0][-1]]
Out[78]: [4, 10, 15, 18]
enumerate, zip to create bz which mates 100's numerator(s) with indices
bz = [*zip(*((i, d//100) for i, d in enumerate(a) if d%100 == 0 and d != 0))]
print(*bz, sep='\n')
(4, 9, 10, 13, 14, 15, 17, 18)
(1, 2, 2, 3, 3, 3, 7, 7)
then zip again, zip(*bz, bz[1][1:]) lagging the numerator tuple to allow the lagged difference to give a selection logic if c-b != 0for the last index of each run but the last
add the last 100's match because its always the end of the last run + [bz[0][-1]]
I have just started learning python and worried that if I use dict.get(key,default_value) or I define my own method for it....so do they have any differences:
[1st method]:
dict={}
for c in string:
if c in dict:
dict[c]+=1
else:
dict[c]=1
and the other dict.get() method that python provides
for c in string:
dict[c]=dict.get(c,0)+1
do they have any differences on efficiency or speed...or they are just the same and 2nd one only saves writing few more lines of code...
For this specific case, use either a collections.Counter() or a collections.defaultdict() object instead:
import collections
dct = collections.defaultdict(int)
for c in string:
dict[c] += 1
or
dct = collections.Counter(string)
Both are subclasses of the standard dict type. The Counter type adds some more helpful functionality like summing two counters or listing the most common entities that have been counted. The defaultdict class can also be given other default types; use defaultdict(list) for example to collect things into lists per key.
When you want to compare performance of two different approaches, you want to use the timeit module:
>>> import timeit
>>> def intest(dct, values):
... for c in values:
... if c in dct:
... dct[c]+=1
... else:
... dct[c]=1
...
>>> def get(dct, values):
... for c in values:
... dct[c] = dct.get(c, 0) + 1
...
>>> values = range(10) * 10
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, intest as test; dct={}')
22.210275888442993
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, get as test; dct={}')
27.442166090011597
This shows that using in is a little faster.
There is, however, a third option to consider; catching the KeyError exception:
>>> def tryexcept(dct, values):
... for c in values:
... try:
... dct[c] += 1
... except KeyError:
... dct[c] = 1
...
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, tryexcept as test; dct={}')
18.023509979248047
which happens to be the fastest, because only 1 in 10 cases are for a new key.
Last but not least, the two alternatives I proposed:
>>> def default(dct, values):
... for c in values:
... dct[c] += 1
...
>>> timeit.timeit('test(dct, values)', 'from __main__ import values, default as test; from collections import defaultdict; dct=defaultdict(int)')
15.277361154556274
>>> timeit.timeit('Counter(values)', 'from __main__ import values; from collections import Counter')
38.657804012298584
So the Counter() type is slowest, but defaultdict is very fast indeed. Counter()s do a lot more work though, and the extra functionality can bring ease of development and execution speed benefits elsewhere.
What is the fastest way to check if a string contains some characters from any items of a list?
Currently, I'm using this method:
lestring = "Text123"
lelist = ["Text", "foo", "bar"]
for x in lelist:
if lestring.count(x):
print 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
Is there any way to do it without iteration (which will make it faster I suppose.)?
You can try list comprehension with membership check
>>> lestring = "Text123"
>>> lelist = ["Text", "foo", "bar"]
>>> [e for e in lelist if e in lestring]
['Text']
Compared to your implementation, though LC has an implicit loop but its faster as there is no explicit function call as in your case with count
Compared to Joe's implementation, yours is way faster, as the filter function would require to call two functions in a loop, lambda and count
>>> def joe(lelist, lestring):
return ''.join(random.sample(x + 'b'*len(x), len(x)))
>>> def uz(lelist, lestring):
for x in lelist:
if lestring.count(x):
return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
>>> def ab(lelist, lestring):
return [e for e in lelist if e in lestring]
>>> t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
>>> t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
>>> t_joe = timeit.Timer("joe(lelist, lestring)", setup="from __main__ import lelist, lestring, joe")
>>> t_ab.timeit(100000)
0.09391469893125759
>>> t_uz.timeit(100000)
0.1528471407273173
>>> t_joe.timeit(100000)
1.4272649857800843
Jamie's commented solution is slower for shorter string's. Here is the test result
>>> def jamie(lelist, lestring):
return next(itertools.chain((e for e in lelist if e in lestring), (None,))) is not None
>>> t_jamie = timeit.Timer("jamie(lelist, lestring)", setup="from __main__ import lelist, lestring, jamie")
>>> t_jamie.timeit(100000)
0.22237164127909637
If you need Boolean values, for shorter strings, just modify the above LC expression
[e in lestring for e in lelist if e in lestring]
Or for longer strings, you can do the following
>>> next(e in lestring for e in lelist if e in lestring)
True
or
>>> any(e in lestring for e in lelist)
filter(lambda x: lestring.count(x), lelist)
That will return all the strings that you're trying to find as a list.
if the test is to see if there are any characters in common (not words or segments), create a set out of the letters in the list and then check the letters agains the string:
char_list = set(''.join(list_of_words))
test_set = set(string_to_teat)
common_chars = char_list.intersection(test_set)
However I'm assuming you're looking for as little as one character in common...
The esmre library does the trick. In your case, the simpler, esm (part of esmre) is what you want.
https://pypi.python.org/pypi/esmre/
https://code.google.com/p/esmre/
They have good documentation and examples:
Taken from their examples:
>>> import esm
>>> index = esm.Index()
>>> index.enter("he")
>>> index.enter("she")
>>> index.enter("his")
>>> index.enter("hers")
>>> index.fix()
>>> index.query("this here is history")
[((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
>>> index.query("Those are his sheep!")
[((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
>>>
I ran some performance tests:
import random, timeit, string, esm
def uz(lelist, lestring):
for x in lelist:
if lestring.count(x):
return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
def ab(lelist, lestring):
return [e for e in lelist if e in lestring]
def use_esm(index, lestring):
return index.query(lestring)
for TEXT_LEN in [5, 50, 1000]:
for SEARCH_LEN in [5, 20]:
for N in [5, 50, 1000, 10000]:
if TEXT_LEN < SEARCH_LEN:
continue
print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N
lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN)))
lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _
in range(N)]
index = esm.Index()
for i in lelist:
index.enter(i)
index.fix()
t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm")
ab_time = t_ab.timeit(1000)
uz_time = t_uz.timeit(1000)
esm_time = t_esm.timeit(1000)
min_time = min(ab_time, uz_time, esm_time)
print ' ab%s: %f' % ('*' if ab_time == min_time else '', ab_time)
print ' uz%s: %f' % ('*' if uz_time == min_time else '', uz_time)
print ' esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)
And got that results depends mostly on the number of items that one is looking for (in my case, 'N'):
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5
ab*: 0.001733
uz: 0.002512
esm 0.126853:
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50
ab*: 0.017564
uz: 0.023701
esm 0.079925:
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000
ab: 0.370371
uz: 0.489523
esm* 0.133783:
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000
ab: 3.678790
uz: 4.883575
esm* 0.259605:
I would like to make a pair of two elements. I don't care about the order of the elements, so I use frozenset.
I can think of the following two methods to iterate the elements back from the frozenset. Isn't there any fancier method? Thanks in advance.
pair = frozenset([element1, element2])
pair2 = list(pair)
elem1 = pair2[0]
elem2 = pair2[1]
pair = frozenset([element1, element2])
elems = []
for elem in pair:
elems.append(elem)
elem1 = elems[0]
elem2 = elems[1]
pair = frozenset([element1, element2])
elem1, elem2 = pair
If you have a lot of those pair things, using frozenset() is NOT a good idea. Use tuples instead.
>>> import sys
>>> fs1 = frozenset([42, 666])
>>> fs2 = frozenset([666, 42])
>>> fs1 == fs2
True
>>> t1 = tuple(sorted([42, 666]))
>>> t2 = tuple(sorted([666, 42]))
>>> t1 == t2
True
>>> sys.getsizeof(fs1)
116
>>> sys.getsizeof(t1)
36
>>>
Update Bonus: sorted tuples have a predictable iteration sequence:
>>> for thing in fs1, fs2, t1, t2: print [x for x in thing]
...
[42, 666]
[666, 42]
[42, 666]
[42, 666]
>>>
Update 2 ... and their repr() is the same:
>>> repr(fs1)
'frozenset([42, 666])'
>>> repr(fs2)
'frozenset([666, 42])' # possible source of confusion
>>> repr(t1)
'(42, 666)'
>>> repr(t2)
'(42, 666)'
>>>
If it is just two elements you are de-sequence them. But I am not sure, what you are trying to do here with the frozenset
>>> s = frozenset([1,2])
>>> s
frozenset({1, 2})
>>> x,y = s
>>> x
1
>>> y
2
Just to elaborate on an above comment, and assuming your elements are easily sortable, you could make an unordered pair class from tuple using:
class Pair(tuple):
def __new__(cls, seq=()):
assert len(seq) == 2
return tuple.__new__(tuple, sorted(seq))
Then you get:
>>> Pair((0, 1))
(0, 1)
>>> Pair((1, 0))
(0, 1)
>>> Pair((0, 1)) == Pair((1, 0))
True