The following is how I'm getting subsets of a list:
def subsets(s):
if not s: return [[]]
rest = subsets(s[1:])
return rest + map(lambda x: [s[0]] + x, rest)
x = sorted(subsets([1,2,3]), key=lambda x: (len(x), x))
# [[], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3]]
It works, but is a bit slow. I'm wondering what might be a better less-naive algorithm to get all subsets in python (without just using something like itertools.combinations).
It would be much more efficient to avoid storing the list of subsets in memory unless you actually need it.
Solution 1: return a generator as output
This code accomplishes that with a recursive generator function:
def subsets(s):
if not s:
yield []
else:
for x in subsets(s[1:]):
yield x
yield [s[0]] + x
This function will lazily do its work only as required by the caller and never stores more than just the call stack in memory.
The first yield in the else clause serves to yield all elements of subsets(s[1:]) unchanged, i.e., those subsets of s that don't include s[0], while the second one yields all the subsets of s that do include s[0].
Of course, if you want this list sorted, the efficiency is lost when you call sorted() on the results, which builds the list in memory.
Solution 2: also accept a generator or an iterator as input
Solution 1 has the weakness that it only works on indexable s types, like lists and tuples, where s[0] and s[1:] are well defined.
If we want to make subsets more generic and accept any iterable (including iterators), we have to avoid that syntax and use next() instead:
def subsets(s):
iterator = iter(s)
try:
item = next(iterator)
except StopIteration:
# s in empty, bottom out of the recursion
yield []
else:
# s is not empty, recurse and expand
for x in subsets(iterator):
yield x
yield [item] + x
Measuring performance
With all this, I wanted to know how much difference these solutions make, so I ran some benchmarks.
I compared your initial code, which I'm calling subsets_lists, my solution1, subsets_generator here, my solution2, subsets_gen_iterable. For completeness, I added the powerset function suggested in itertools.
Results:
All subsets of a list of size 10, each call repeated 10 times:
Code block 'l = subsets_lists(range(10)) repeats=10' took: 5.58828 ms
Code block 'g = subsets_generator(range(10)) repeats=10' took: 0.05693 ms
Code block 'g = subsets_gen_iterable(range(10)) repeats=10' took: 0.01316 ms
Code block 'l = list(subsets_generator(range(10))) repeats=10' took: 4.86464 ms
Code block 'l = list(subsets_gen_iterable(range(10))) repeats=10' took: 3.76597 ms
Code block 'l = list(powerset(range(10))) size=10 repeats=10' took: 1.11228 ms
All subsets of a list of size 20, each call repeated 10 times:
Code block 'l = subsets_lists(range(20)) repeats=10' took: 12144.99487 ms
Code block 'g = subsets_generator(range(20)) repeats=10' took: 65.18992 ms
Code block 'g = subsets_gen_iterable(range(20)) repeats=10' took: 0.01784 ms
Code block 'l = list(subsets_generator(range(20))) repeats=10' took: 10859.75128 ms
Code block 'l = list(subsets_gen_iterable(range(20))) repeats=10' took: 10074.26618 ms
Code block 'l = list(powerset(range(20))) size=20 repeats=10' took: 2336.81373 ms
Observations:
You can see that calling the two generator functions does not do much work unless the generators are fed into something that consumes it, like the list() constructor. Although subsets_generator does seem to do a lot more work up front than subsets_gen_iterable.
My two solutions yield modest speed improvements over your code
The itertools-based solution is still much faster.
Actual benchmarking code:
import itertools
import linetimer
def subsets_lists(s):
if not s: return [[]]
rest = subsets_lists(s[1:])
return rest + list(map(lambda x: [s[0]] + x, rest))
def subsets_generator(s):
if not s:
yield []
else:
for x in subsets_generator(s[1:]):
yield x
yield [s[0]] + x
def subsets_iterable(s):
iterator = iter(s)
try:
item = next(iterator)
except StopIteration:
yield []
else:
for x in subsets_iterable(iterator):
yield x
yield [item] + x
def powerset(iterable):
"Source: https://docs.python.org/3/library/itertools.html#itertools-recipes"
s = list(iterable)
return itertools.chain.from_iterable(
itertools.combinations(s, r)
for r in range(len(s)+1)
)
for repeats in 1, 10:
for size in 4, 10, 14, 20:
with linetimer.CodeTimer(f"l = subsets_lists(range({size})) repeats={repeats}"):
for _ in range(repeats):
l = subsets_lists(range(size))
with linetimer.CodeTimer(f"g = subsets_generator(range({size})) repeats={repeats}"):
for _ in range(repeats):
l = subsets_generator(range(size))
with linetimer.CodeTimer(f"g = subsets_gen_iterable(range({size})) repeats={repeats}"):
for _ in range(repeats):
l = subsets_iterable(range(size))
with linetimer.CodeTimer(f"l = list(subsets_generator(range({size}))) repeats={repeats}"):
for _ in range(repeats):
l = list(subsets_generator(range(size)))
with linetimer.CodeTimer(f"l = list(subsets_gen_iterable(range({size}))) repeats={repeats}"):
for _ in range(repeats):
l = list(subsets_iterable(range(size)))
with linetimer.CodeTimer(f"l = list(powerset(range({size}))) size={size} repeats={repeats}"):
for _ in range(repeats):
l = list(powerset(range(size)))
Related
Let's suppose I have a block of bytes like this:
block = b'0123456789AB'
I want to extract each sequence of 3 bytes from each chunk of 4 bytes and join them together. The result for the block above should be:
b'01245689A' # 3, 7 and B are missed
I could solve this issue with such script:
block = b'0123456789AB'
result = b''
for i in range(0, len(block), 4):
result += block[i:i + 3]
print(result)
But as it's known, Python is quite inefficient with such for-loops and bytes concatenations, thus my approach will never end if I apply it for a really huge block of bytes. So is there a faster way to perform?
Make it mutable and delete the the unwanted slice?
>>> tmp = bytearray(block)
>>> del tmp[3::4]
>>> bytes(tmp)
b'01245689A'
If your chunks are large and you want to remove almost all bytes, it might become faster to instead collect what you do want, similar to yours. Although yours potentially takes quadratic time, better use join:
>>> b''.join([block[i : i+3] for i in range(0, len(block), 4)])
b'01245689A'
(Btw according to PEP 8 it should be block[i : i+3], not block[i:i + 3], and for good reason.)
Although that builds a lot of objects, which could be a memory problem. And for your stated case, it's much faster than yours but much slower than my bytearray one.
Benchmark with block = b'0123456789AB' * 100_000 (much smaller than the 1GB you mentioned in the comments below):
0.00 ms 0.00 ms 0.00 ms baseline
15267.60 ms 14724.33 ms 14712.70 ms original
2.46 ms 2.46 ms 3.45 ms Kelly_Bundy_bytearray
83.66 ms 85.27 ms 122.88 ms Kelly_Bundy_join
Benchmark code:
import timeit
def baseline(block):
pass
def original(block):
result = b''
for i in range(0, len(block), 4):
result += block[i:i + 3]
return result
def Kelly_Bundy_bytearray(block):
tmp = bytearray(block)
del tmp[3::4]
return bytes(tmp)
def Kelly_Bundy_join(block):
return b''.join([block[i : i+3] for i in range(0, len(block), 4)])
funcs = [
baseline,
original,
Kelly_Bundy_bytearray,
Kelly_Bundy_join,
]
block = b'0123456789AB' * 100_000
args = block,
number = 10**0
expect = original(*args)
for func in funcs:
print(func(*args) == expect, func.__name__)
print()
tss = [[] for _ in funcs]
for _ in range(3):
for func, ts in zip(funcs, tss):
t = min(timeit.repeat(lambda: func(*args), number=number)) / number
ts.append(t)
print(*('%8.2f ms ' % (1e3 * t) for t in ts), func.__name__)
print()
Suppose I have a list that goes like :
'''
[1,2,3,4,9,10,11,20]
'''
I need the result to be like :
'''
[[4,9],[11,20]]
'''
I have defined a function that goes like this :
def get_range(lst):
i=0
seqrange=[]
for new in lst:
a=[]
start=new
end=new
if i==0:
i=1
old=new
else:
if new - old >1:
a.append(old)
a.append(new)
old=new
if len(a):
seqrange.append(a)
return seqrange
Is there any other easier and efficient way to do it? I need to do this in the range of millions.
You can use numpy arrays and the diff function that comes along with them. Numpy is so much more efficient than looping when you have millions of rows.
Slight aside:
Why are numpy arrays so fast? Because they are arrays of data instead of arrays of pointers to data (which is what Python lists are), because they offload a whole bunch of computations to a backend written in C, and because they leverage the SIMD paradigm to run a Single Instruction on Multiple Data simultaneously.
Now back to the problem at hand:
The diff function gives us the difference between consecutive elements of the array. Pretty convenient, given that we need to find where this difference is greater than a known threshold!
import numpy as np
threshold = 1
arr = np.array([1,2,3,4,9,10,11,20])
deltas = np.diff(arr)
# There's a gap wherever the delta is greater than our threshold
gaps = deltas > threshold
gap_indices = np.argwhere(gaps)
gap_starts = arr[gap_indices]
gap_ends = arr[gap_indices + 1]
# Finally, stack the two arrays horizontally
all_gaps = np.hstack((gap_starts, gap_ends))
print(all_gaps)
# Output:
# [[ 4 9]
# [11 20]]
You can access all_gaps like a 2D matrix: all_gaps[0, 1] would give you 9, for example. If you really need the answer as a list-of-lists, simply convert it like so:
all_gaps_list = all_gaps.tolist()
print(all_gaps_list)
# Output: [[4, 9], [11, 20]]
Comparing the runtime of the iterative method from #happydave's answer with the numpy method:
import random
import timeit
import numpy
def gaps1(arr, threshold):
deltas = np.diff(arr)
gaps = deltas > threshold
gap_indices = np.argwhere(gaps)
gap_starts = arr[gap_indices]
gap_ends = arr[gap_indices + 1]
all_gaps = np.hstack((gap_starts, gap_ends))
return all_gaps
def gaps2(lst, thr):
seqrange = []
for i in range(len(lst)-1):
if lst[i+1] - lst[i] > thr:
seqrange.append([lst[i], lst[i+1]])
return seqrange
test_list = [i for i in range(100000)]
for i in range(100):
test_list.remove(random.randint(0, len(test_list) - 1))
test_arr = np.array(test_list)
# Make sure both give the same answer:
assert np.all(gaps1(test_arr, 1) == gaps2(test_list, 1))
t1 = timeit.timeit('gaps1(test_arr, 1)', setup='from __main__ import gaps1, test_arr', number=100)
t2 = timeit.timeit('gaps2(test_list, 1)', setup='from __main__ import gaps2, test_list', number=100)
print(f"t1 = {t1}s; t2 = {t2}s; Numpy gives ~{t2 // t1}x speedup")
On my laptop, this gives:
t1 = 0.020834800001466647s; t2 = 1.2446780000027502s; Numpy gives ~59.0x speedup
My word that's fast!
There is iterator based solution. It'is allow to get intervals one by one:
flist = [1,2,3,4,9,10,11,20]
def get_range(lst):
start_idx = lst[0]
for current_idx in flist[1:]:
if current_idx > start_idx+1:
yield [start_idx, current_idx]
start_idx = current_idx
for inverval in get_range(flist):
print(inverval)
I don't think there's anything inefficient about the solution, but you can clean up the code quite a bit:
seqrange = []
for i in range(len(lst)-1):
if lst[i+1] - lst[i] > 1:
seqrange.append([lst[i], lst[i+1]])
I think this could be more efficient and a bit cleaner.
def func(lst):
ans=0
final=[]
sol=[]
for i in range(1,lst[-1]+1):
if(i not in lst):
ans+=1
final.append(i)
elif(i in lst and ans>0):
final=[final[0]-1,i]
sol.append(final)
ans=0
final=[]
else:
final=[]
return(sol)
Disclosure: This is for homework help
I want to find an integer at a given position i while repeatedly constructing and adding to a sequence of integers preferably with decent run time and performance.
You're rebuilding the sub-sequences from 1 at each iteration of the while instead of simply keeping a sequence and adding the next number in the following iteration, and then extend the main list with that.
Also, you should defer the str.join until after the while and not build strings at each iteration:
from itertools import count
def give_output(digitPos):
c = count(1)
l, lst = [], []
while len(lst) <= digitPos:
l.append(next(c)) # update previous sub-sequence
lst.extend(l)
return int(''.join(map(str, lst))[digitPos-1])
Timings:
In [10]: %%timeit
...: giveOutput(500)
...:
1000 loops, best of 3: 219 µs per loop
In [11]: %%timeit
...: give_output(500)
...:
10000 loops, best of 3: 126 µs per loop
About half the time!
You can even do better if you pick the i th item using a div-mod approach instead of building a large string; I'll leave that to you.
In fact, moving the list and index outside the function might cache your results better:
list1 = []
i = 2
def giveOutput():
global list1
global i
digitPos = int(input())
while len(list1) <= digitPos:
list1.extend(list(map(int, ''.join(map(str, range(1, i))))))
i = i + 1
print(list1[digitPos -1])
This only really works well when you are given a number of testcases.
Update: (Thanks to Moses for ideas about building up strings)
in fact your lists could just be a strings:
all_digits = ''
digits_up_to_i = ''
i = 1
def giveOutput():
global all_digits
global digits_up_to_i
global i
digitPos = int(input())
while len(all_digits) <= digitPos:
digits_up_to_i += str(i)
all_digits += digits_up_to_i
i = i + 1
print(all_digits[digitPos -1])
From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...
I'm parsing a file like this:
--header--
data1
data2
--header--
data3
data4
data5
--header--
--header--
...
And I want groups like this:
[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]
so I can iterate over them like this:
for grp in group(open('file.txt'), lambda line: 'header' in line):
for item in grp:
process(item)
and keep the detect-a-group logic separate from the process-a-group logic.
But I need an iterable of iterables, as the groups can be arbitrarily large and I don't want to store them. That is, I want to split an iterable into subgroups every time I encounter a "sentinel" or "header" item, as indicated by a predicate. Seems like this would be a common task, but I can't find an efficient Pythonic implementation.
Here's the dumb append-to-a-list implementation:
def group(iterable, isstart=lambda x: x):
"""Group `iterable` into groups starting with items where `isstart(item)` is true.
Start items are included in the group. The first group may or may not have a
start item. An empty `iterable` results in an empty result (zero groups)."""
items = []
for item in iterable:
if isstart(item) and items:
yield iter(items)
items = []
items.append(item)
if items:
yield iter(items)
It feels like there's got to be a nice itertools version, but it eludes me. The 'obvious' (?!) groupby solution doesn't seem to work because there can be adjacent headers, and they need to go in separate groups. The best I can come up with is (ab)using groupby with a key function that keeps a counter:
def igroup(iterable, isstart=lambda x: x):
def keyfunc(item):
if isstart(item):
keyfunc.groupnum += 1 # Python 2's closures leave something to be desired
return keyfunc.groupnum
keyfunc.groupnum = 0
return (group for _, group in itertools.groupby(iterable, keyfunc))
But I feel like Python can do better -- and sadly, this is even slower than the dumb list version:
# ipython
%time deque(group(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 4.20 s, sys: 0.03 s, total: 4.23 s
%time deque(igroup(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 5.45 s, sys: 0.01 s, total: 5.46 s
To make it easy on you, here's some unit test code:
class Test(unittest.TestCase):
def test_group(self):
MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
isstart = lambda x: x == 0
self.assertEqual(next(igroup([], isstart), None), None)
self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3) # Catch hangs when groups are not consumed
for _ in xrange(NUMTRIALS):
expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
for grpnum, grp in enumerate(igroup(items, isstart)):
start = next(grp)
self.assertTrue(isstart(start) or grpnum == 0)
self.assertEqual(start, next(expected))
for item in grp:
self.assertFalse(isstart(item))
self.assertEqual(item, next(expected))
So: how can I subgroup an iterable by a predicate elegantly and efficiently in Python?
how can I subgroup an iterable by a predicate elegantly and efficiently in Python?
Here's a concise, memory-efficient implementation which is very similar to the one from your question:
from itertools import groupby, imap
from operator import itemgetter
def igroup(iterable, isstart):
def key(item, count=[False]):
if isstart(item):
count[0] = not count[0] # start new group
return count[0]
return imap(itemgetter(1), groupby(iterable, key))
It supports infinite groups.
tee-based solution is slightly faster but it consumes memory for the current group (similar to the list-based solution from the question):
from itertools import islice, tee
def group(iterable, isstart):
it, it2 = tee(iterable)
count = 0
for item in it:
if isstart(item) and count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
count = 0
count += 1
if count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
groupby-solution could be implemented in pure Python:
def igroup_inline_key(iterable, isstart):
it = iter(iterable)
def grouper():
"""Yield items from a single group."""
while not p[START]:
yield p[VALUE] # each group has at least one element (a header)
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])
p = [None]*2 # workaround the absence of `nonlocal` keyword in Python 2.x
START, VALUE = 0, 1
p[VALUE] = next(it)
while True:
p[START] = False # to distinguish EOF and a start of new group
yield grouper()
while not p[START]: # skip to the next group
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])
To avoid repeating the code the while True loop could be written as:
while True:
p[START] = False # to distinguish EOF and a start of new group
g = grouper()
yield g
if not p[START]: # skip to the next group
for _ in g:
pass
if not p[START]: # EOF
break
Though the previous variant might be more explicit and readable.
I think a general memory-efficient solution in pure Python won't be significantly faster than groupby-based one.
If process(item) is fast compared to igroup() and a header could be efficiently found in a string (e.g., for a fixed static header) then you could improve performance by reading your file in large chunks and splitting on the header value. It should make your task IO-bound.
I didn't quite read all your code, but I think this might be of some help:
from itertools import izip, tee, chain
def pairwise(iterable):
a, b = tee(iterable)
return izip(a, chain(b, [next(b, None)]))
def group(iterable, isstart):
pairs = pairwise(iterable)
def extract(current, lookahead, pairs=pairs, isstart=isstart):
yield current
if isstart(lookahead):
return
for current, lookahead in pairs:
yield current
if isstart(lookahead):
return
for start, lookahead in pairs:
gen = extract(start, lookahead)
yield gen
for _ in gen:
pass
for gen in group(xrange(4, 16), lambda x: x % 5 == 0):
print '------------------'
for n in gen:
print n
print [list(g) for g in group([], lambda x: x % 5 == 0)]
Result:
$ python gen.py
------------------
4
------------------
5
6
7
8
9
------------------
10
11
12
13
14
------------------
15
[]
Edit:
And here's another solution, similar to the above, but without the pairwise() and a sentinel instead. I don't know which one is faster:
def group(iterable, isstart):
sentinel = object()
def interleave(iterable=iterable, isstart=isstart, sentinel=sentinel):
for item in iterable:
if isstart(item):
yield sentinel
yield item
items = interleave()
def extract(item, items=items, isstart=isstart, sentinel=sentinel):
if item is not sentinel:
yield item
for item in items:
if item is sentinel:
return
yield item
for lookahead in items:
gen = extract(lookahead)
yield gen
for _ in gen:
pass
Both now pass the test case, thanks to J.F.Sebastians idea for the exhaustion of skipped subgroup generators.
The crucial thing is you have to write a generator that yields sub-generators. My solution is similar in concept to the one by #pillmuncher, but is more self-contained because it avoids using itertools machinery to make ancillary generators. The downside is I have to use a somewhat inelegant temp list. In Python 3 this could perhaps be done more nicely with nonlocal.
def grouper(iterable, isstart):
it = iter(iterable)
last = [next(it)]
def subgroup():
while True:
toYield = last[0]
try:
last.append(next(it))
except StopIteration, e:
last.pop(0)
yield toYield
raise StopIteration
else:
yield toYield
last.pop(0)
if isstart(last[0]):
raise StopIteration
while True:
sg = subgroup()
yield sg
if len(last) == 2:
# subgenerator was aborted before completion, let's finish it
for a in sg:
pass
if last:
# sub-generator left next element waiting, next sub-generator will yield it
pass
else:
# sub-generator left "last" empty because source iterable was exhausted
raise StopIteration
>>> for g in grouper([0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0], lambda x: x==0):
... print "Group",
... for i in g:
... print i,
... print
Group 0 1 1
Group 0 1
Group 0 1 1 1 1
Group 0
I don't know what this is like performance-wise, I just did it because it was just an interesting thing to try to do.
Edit: I ran your unit test on your original two and mine. It looks like mine is a bit faster than your igroup but still slower than the list-based version. It seems natural that you'll have to make a tradeoff between speed and memory here; if you know the groups won't be too terribly large, use the list-based version for speed. If the groups could be huge, use a generator-based version to keep memory usage down.
Edit: The edited version above handles breaking in a different way. If you break out of the sub-generator but resume the outer generator, it will skip the remainder of the aborted group and begin with the next group:
>>> for g in grouper([0, 1, 2, 88, 3, 0, 1, 88, 2, 3, 4, 0, 1, 2, 3, 88, 4], lambda x: x==0):
... print "Group",
... for i in g:
... print i,
... if i==88:
... break
... print
Group 0 1 2 88
Group 0 1 88
Group 0 1 2 3 88
So here's another version that tries to stitch together pairs of subgrouops from groupby with chain. It's noticeably faster for the performance test given, but much much slower when there are many small groups (say isstart = lambda x: x % 2 == 0). It cheats and buffers repeated headers (you could get round this with read-all-but-last iterator tricks). It's also a step backward in the elegance department, so I think I still prefer the original.
def group2(iterable, isstart=lambda x: x):
groups = itertools.groupby(iterable, isstart)
start, group = next(groups)
if not start: # Deal with initial non-start group
yield group
_, group = next(groups)
groups = (grp for _, grp in groups)
while True: # group will always be start item(s) now
group = list(group)
for item in group[0:-1]: # Back-to-back start items... and hope this doesn't get very big. :)
yield iter([item])
yield itertools.chain([group[-1]], next(groups, [])) # Start item plus subsequent non-start items
group = next(groups)
%time deque(group2(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 3.13 s, sys: 0.00 s, total: 3.13 s