Improving index search speed in a large (million entry) list

Improving index search speed in a large (million entry) list - python

I have a list which is a million items long of random, repeatable integers. I need to sort that list, and then find the index of the first iteration of every unique element in the list. When I do this, I am running into run time >5 minutes long. Can anyone give me any suggestions to speed up my code? An example of my process is shown below.
import random
a = []
for x in range(1000000):
a.append(random.randint(1,10000))
unique_a = set(a)
inds=[0]
inds = [a.index(i) for i in sorted(unique_a) if i not in inds]

inds = [a.index(i) for i in sorted(unique_a) if i not in inds] is implicitly quadratic is a.index(i) is linear. Use a dictionary to grab the indices in one pass over the sorted list:
a =sorted([0,4,3,5,21,5,6,3,1,23,4,6,1,93,34,10])
unique_a = set(a)
first_inds = {}
for i,x in enumerate(a):
if not x in first_inds:
first_inds[x] = i
my_inds = [first_inds[x] for x in sorted(unique_a)]

Just store the first position for every unique element:
first_position = {}
for i, value in enumerate(a):
if value not in first_position:
first_position[value] = i
And then replace a.index(i) for first_position[i]
Or just use:
_, indices = zip(*sorted(first_position.items()))

You can use the bisect_left function from the standard library's bisect module to do this. On a sorted list, a bisection search is faster than searching through the list as index does.
>>> L = [random.randint(0, 10) for _ in range(100)]
>>> L.sort()
>>> L.index(9)
83
>>> bisect.bisect_left(L, 9)
83
>>> timeit.timeit(setup="from __main__ import L", stmt="L.index(9)")
2.1408978551626205
>>> timeit.timeit(setup="from __main__ import L;from bisect import bisect_left", stmt="bisect_left(L, 9)")
0.5187544231303036
On my machine, using bisect.bisect_left is faster than iterating over the list and accumulating indexes on the way:
>>> L = [random.randint(0, 100) for _ in range(10000)]
>>> L.sort()
>>> def iterative_approach(list_):
... unique = set(list_)
... first_inds = {}
... for i, x in enumerate(list_):
... if x not in first_inds:
... first_inds[x] = i
... return [first_inds[x] for x in sorted(unique)]
...
>>> ia = iterative_approach(L)
>>> bisect_left = bisect.bisect_left
>>> def bisect_approach(list_):
... unique = set(list_)
... out = {}
... for x in unique:
... out[x] = bisect_left(list_, x)
... return [out[x] for x in sorted(unique)]
...
>>> ba = bisect_approach(L)
>>> ia == ba
True
>>> timeit.timeit(setup="from __main__ import L, iterative_approach", stmt="iterative_approach(L)")
1488.956467495067
>>> timeit.timeit(setup="from __main__ import L, bisect_approach", stmt="bisect_approach(L)")
407.6803469741717

Related

Max index of repeating elements divisible by 'n' in a List in Python

I have a sorted list of numbers like:
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
I need to find the max index of each values which is divisible by 100.
Output should be like: 4,10,15
My Code:
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
idx = 1
for i in (a):
if i%100 == 0:
print idx
idx = idx+1
Output of above code:
4
9
10
13
14
15

In case people are curious, I benchmarked the dict comprehension technique against the backward iteration technique. Dict comprehension is about twice the speed. Changing to OrderedDict resulted in MASSIVE slowdown. About 15x slower than the dict comprehension.
def test1():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
max_index = {}
for i, item in enumerate(a[::-1]):
if item not in max_index:
max_index[item] = len(a) - (i + 1)
return max_index
def test2():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
return {item: index for index, item in enumerate(a, 1)}
def test3():
a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
OrderedDict((item, index) for index, item in enumerate(a, 1))
if __name__ == "__main__":
import timeit
print(timeit.timeit("test1()", setup="from __main__ import test1"))
print(timeit.timeit("test2()", setup="from __main__ import test2"))
print(timeit.timeit("test3()", setup="from __main__ import test3; from collections import OrderedDict"))
3.40622282028
1.97545695305
26.347012043

Use a simple dict-comprehension or OrderedDict with divisible items as the keys, old values will be replaced by newest values automatically.
>>> {item: index for index, item in enumerate(lst, 1) if not item % 100}.values()
dict_values([4, 10, 15])
# if order matters
>>> from collections import OrderedDict
>>> OrderedDict((item, index) for index, item in enumerate(lst, 1) if not item % 100).values()
odict_values([4, 10, 15])
Another way will be to loop over reversed list and use a set to keep track of items seen so far(lst[::-1] may be slightly faster than reversed(lst) for tiny lists).
>>> seen = set()
>>> [len(lst) - index for index, item in enumerate(reversed(lst))
if not item % 100 and item not in seen and not seen.add(item)][::-1]
[4, 10, 15]
You can see the sort-of equivalent code of the above here.

You could use itertools.groupby since your data is sorted:
>>> a = [77,98,99,100,101,102,198,199,200,200,278,299,300,300,300]
>>> from itertools import groupby
>>> [list(g)[-1][0] for k,g in groupby(enumerate(a), lambda t: (t[1] % 100, t[1])) if k[0] == 0]
[3, 9, 14]
Although this is a little cryptic.
Here's a complicated approach using only a list-iterator and accumulating into a list:
>>> run, prev, idx = False, None, []
>>> for i, e in enumerate(a):
... if not (e % 100 == 0):
... if not run:
... prev = e
... continue
... idx.append(i - 1)
... run = False
... else:
... if prev != e and run:
... idx.append(i - 1)
... run = True
... prev = e
...
>>> if run:
... idx.append(i)
...
>>> idx
[3, 9, 14]
I think this is best dealt with a dictionary approach like #AshwiniChaudhary It is more straightforward, and much faster:
>>> timeit.timeit("{item: index for index, item in enumerate(a, 1)}", "from __main__ import a")
1.842843743012054
>>> timeit.timeit("[list(g)[-1][0] for k,g in groupby(enumerate(a), lambda t: (t[1] % 100, t[1])) if k[0] == 0]", "from __main__ import a, groupby")
8.479677081981208
The groupby approach is pretty slow, note, the complicated approach is faster, and not far-off form the dict-comprehension approach:
>>> def complicated(a):
... run, prev, idx = False, None, []
... for i, e in enumerate(a):
... if not (e % 100 == 0):
... if not run:
... prev = e
... continue
... idx.append(i - 1)
... run = False
... else:
... if prev != e and run:
... idx.append(i - 1)
... run = True
... prev = e
... if run:
... idx.append(i)
... return idx
...
>>> timeit.timeit("complicated(a)", "from __main__ import a, complicated")
2.6667005629860796
Edit Note, the performance difference narrows if we call list on the dict-comprehension .values():
>>> timeit.timeit("list({item: index for index, item in enumerate(a, 1)}.values())", "from __main__ import a")
2.3839886570058297
>>> timeit.timeit("complicated(a)", "from __main__ import a, complicated")
2.708565960987471

it seemed like a good idea at the start, got a bit twisty, had to patch a couple of cases...
a = [0,77,98,99,100,101,102,198,199,200,200,278,299,300,300,300, 459, 700,700]
bz = [*zip(*((i, d//100) for i, d in enumerate(a) if d%100 == 0 and d != 0))]
[a for a, b, c in zip(*bz, bz[1][1:]) if c-b != 0] + [bz[0][-1]]
Out[78]: [4, 10, 15, 18]
enumerate, zip to create bz which mates 100's numerator(s) with indices
bz = [*zip(*((i, d//100) for i, d in enumerate(a) if d%100 == 0 and d != 0))]
print(*bz, sep='\n')
(4, 9, 10, 13, 14, 15, 17, 18)
(1, 2, 2, 3, 3, 3, 7, 7)
then zip again, zip(*bz, bz[1][1:]) lagging the numerator tuple to allow the lagged difference to give a selection logic if c-b != 0for the last index of each run but the last
add the last 100's match because its always the end of the last run + [bz[0][-1]]

Mode of a List - Python

I am trying to write my own function that finds the mode of a list but it sputters when there are multiple modes. Can someone help me in adding something to the function that deals with the case with multiple modes. Thanks in advance!
def ModeList(nums):
subscript = 0
while subscript < len(nums):
if nums.count(nums[subscript]) > nums.count(nums[subscript + 1]):
return "The mode is " + str( nums[subscript] ) + "."
else:
subscript += 1
print ModeList( [2,4,6,8,6,8] )

Easiest is to use a collections.Counter():
from collections import Counter
def ModeList(lst):
return Counter(lst).most_common(1)[0][0]
Demo:
>>> from collections import Counter
>>> def ModeList(lst):
... return Counter(lst).most_common(1)[0][0]
...
>>> ModeList( [2,4,6,8,6,8] )
8
Add in itertools.groupby() if you need all values:
from collections import Counter
from itertools import groupby
from operator import itemgetter
def ModeList(lst):
counts = Counter(lst)
grouped = groupby(counts.most_common(), itemgetter(1))
return [i[0] for i in next(grouped)[1]]
Demo:
>>> from collections import Counter
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
>>> def ModeList(lst):
... counts = Counter(lst)
... grouped = groupby(counts.most_common(), itemgetter(1))
... return [i[0] for i in next(grouped)[1]]
...
>>> ModeList( [2,4,6,8,6,8] )
[8, 6]
Without importing, use a dictionary to track counts, then sort by value:
def ModeList(lst):
counts = {}
for item in lst:
counts[item] = counts.get(item, 0) + 1
return sorted(counts, key=counts.get, reverse=True)[0]
or for a list:
def ModeList(lst):
counts = {}
for item in lst:
counts[item] = counts.get(item, 0) + 1
bycount = sorted(counts, key=counts.get)
result = [bycount.pop()]
while counts and counts[bycount[-1]] == counts[result[0]]:
result.append(bycount.pop())
return result

Fastest way to check if a string contains specific characters in any of the items in a list

What is the fastest way to check if a string contains some characters from any items of a list?
Currently, I'm using this method:
lestring = "Text123"
lelist = ["Text", "foo", "bar"]
for x in lelist:
if lestring.count(x):
print 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
Is there any way to do it without iteration (which will make it faster I suppose.)?

You can try list comprehension with membership check
>>> lestring = "Text123"
>>> lelist = ["Text", "foo", "bar"]
>>> [e for e in lelist if e in lestring]
['Text']
Compared to your implementation, though LC has an implicit loop but its faster as there is no explicit function call as in your case with count
Compared to Joe's implementation, yours is way faster, as the filter function would require to call two functions in a loop, lambda and count
>>> def joe(lelist, lestring):
return ''.join(random.sample(x + 'b'*len(x), len(x)))
>>> def uz(lelist, lestring):
for x in lelist:
if lestring.count(x):
return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
>>> def ab(lelist, lestring):
return [e for e in lelist if e in lestring]
>>> t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
>>> t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
>>> t_joe = timeit.Timer("joe(lelist, lestring)", setup="from __main__ import lelist, lestring, joe")
>>> t_ab.timeit(100000)
0.09391469893125759
>>> t_uz.timeit(100000)
0.1528471407273173
>>> t_joe.timeit(100000)
1.4272649857800843
Jamie's commented solution is slower for shorter string's. Here is the test result
>>> def jamie(lelist, lestring):
return next(itertools.chain((e for e in lelist if e in lestring), (None,))) is not None
>>> t_jamie = timeit.Timer("jamie(lelist, lestring)", setup="from __main__ import lelist, lestring, jamie")
>>> t_jamie.timeit(100000)
0.22237164127909637
If you need Boolean values, for shorter strings, just modify the above LC expression
[e in lestring for e in lelist if e in lestring]
Or for longer strings, you can do the following
>>> next(e in lestring for e in lelist if e in lestring)
True
or
>>> any(e in lestring for e in lelist)

filter(lambda x: lestring.count(x), lelist)
That will return all the strings that you're trying to find as a list.

if the test is to see if there are any characters in common (not words or segments), create a set out of the letters in the list and then check the letters agains the string:
char_list = set(''.join(list_of_words))
test_set = set(string_to_teat)
common_chars = char_list.intersection(test_set)
However I'm assuming you're looking for as little as one character in common...

The esmre library does the trick. In your case, the simpler, esm (part of esmre) is what you want.
https://pypi.python.org/pypi/esmre/
https://code.google.com/p/esmre/
They have good documentation and examples:
Taken from their examples:
>>> import esm
>>> index = esm.Index()
>>> index.enter("he")
>>> index.enter("she")
>>> index.enter("his")
>>> index.enter("hers")
>>> index.fix()
>>> index.query("this here is history")
[((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
>>> index.query("Those are his sheep!")
[((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
>>>
I ran some performance tests:
import random, timeit, string, esm
def uz(lelist, lestring):
for x in lelist:
if lestring.count(x):
return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
def ab(lelist, lestring):
return [e for e in lelist if e in lestring]
def use_esm(index, lestring):
return index.query(lestring)
for TEXT_LEN in [5, 50, 1000]:
for SEARCH_LEN in [5, 20]:
for N in [5, 50, 1000, 10000]:
if TEXT_LEN < SEARCH_LEN:
continue
print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N
lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN)))
lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _
in range(N)]
index = esm.Index()
for i in lelist:
index.enter(i)
index.fix()
t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm")
ab_time = t_ab.timeit(1000)
uz_time = t_uz.timeit(1000)
esm_time = t_esm.timeit(1000)
min_time = min(ab_time, uz_time, esm_time)
print ' ab%s: %f' % ('*' if ab_time == min_time else '', ab_time)
print ' uz%s: %f' % ('*' if uz_time == min_time else '', uz_time)
print ' esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)
And got that results depends mostly on the number of items that one is looking for (in my case, 'N'):
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5
ab*: 0.001733
uz: 0.002512
esm 0.126853:
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50
ab*: 0.017564
uz: 0.023701
esm 0.079925:
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000
ab: 0.370371
uz: 0.489523
esm* 0.133783:
TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000
ab: 3.678790
uz: 4.883575
esm* 0.259605:

Return items in a list that occur with a given minimum frequency

need help creating the function freqitems(S,p) that returns a sorted list of the items in sequence S that occur with at least p percent frequency, with no duplicates. example output should be like this:
>>> freqitems([2,2,2,3],50)
[2]
>>> freqitems(5*["alpha"]+["beta"]+3*["gamma"]+7*["delta"], 25)
['alpha', 'delta']
>>> freqitems(5*["alpha"]+["beta"]+3*["gamma"]+7*["delta"], 33)
['delta']

yourlist = 5*["alpha"]+["beta"]+3*["gamma"]+7*["delta"]
def freqitems(sequence, p):
import collections
counter = collections.Counter(sequence)
return sorted([k for k,v in counter.iteritems() if 100.*v/len(sequence) >= p])
print freqitems(yourlist, 25)
print freqitems(yourlist, 33)

Use frozenset as a pair in python

I would like to make a pair of two elements. I don't care about the order of the elements, so I use frozenset.
I can think of the following two methods to iterate the elements back from the frozenset. Isn't there any fancier method? Thanks in advance.
pair = frozenset([element1, element2])
pair2 = list(pair)
elem1 = pair2[0]
elem2 = pair2[1]
pair = frozenset([element1, element2])
elems = []
for elem in pair:
elems.append(elem)
elem1 = elems[0]
elem2 = elems[1]

pair = frozenset([element1, element2])
elem1, elem2 = pair

If you have a lot of those pair things, using frozenset() is NOT a good idea. Use tuples instead.
>>> import sys
>>> fs1 = frozenset([42, 666])
>>> fs2 = frozenset([666, 42])
>>> fs1 == fs2
True
>>> t1 = tuple(sorted([42, 666]))
>>> t2 = tuple(sorted([666, 42]))
>>> t1 == t2
True
>>> sys.getsizeof(fs1)
116
>>> sys.getsizeof(t1)
36
>>>
Update Bonus: sorted tuples have a predictable iteration sequence:
>>> for thing in fs1, fs2, t1, t2: print [x for x in thing]
...
[42, 666]
[666, 42]
[42, 666]
[42, 666]
>>>
Update 2 ... and their repr() is the same:
>>> repr(fs1)
'frozenset([42, 666])'
>>> repr(fs2)
'frozenset([666, 42])' # possible source of confusion
>>> repr(t1)
'(42, 666)'
>>> repr(t2)
'(42, 666)'
>>>

If it is just two elements you are de-sequence them. But I am not sure, what you are trying to do here with the frozenset
>>> s = frozenset([1,2])
>>> s
frozenset({1, 2})
>>> x,y = s
>>> x
1
>>> y
2

Just to elaborate on an above comment, and assuming your elements are easily sortable, you could make an unordered pair class from tuple using:
class Pair(tuple):
def __new__(cls, seq=()):
assert len(seq) == 2
return tuple.__new__(tuple, sorted(seq))
Then you get:
>>> Pair((0, 1))
(0, 1)
>>> Pair((1, 0))
(0, 1)
>>> Pair((0, 1)) == Pair((1, 0))
True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improving index search speed in a large (million entry) list - python

Just store the first position for every unique element: first_position = {} for i, value in enumerate(a): if value not in first_position: first_position[value] = i And then replace a.index(i) for first_position[i] Or just use: _, indices = zip(*sorted(first_position.items()))

Related

Max index of repeating elements divisible by 'n' in a List in Python

Mode of a List - Python

Fastest way to check if a string contains specific characters in any of the items in a list

Return items in a list that occur with a given minimum frequency

Use frozenset as a pair in python

Categories

Resources