I am using Python 3.6 where you can nicely zip individual generators of one kind to get a multidimensional generator of the same kind. Take the following example, where get_random_sequence is a generator that yields an infinite random number sequence to simulate one individual asset at a stock market.
import random
from typing import Iterator
def get_random_sequence(start_value: float) -> Iterator[float]:
value = start_value
yield value
while True:
r = random.uniform(-.1, .1) * value
value = value + r
yield value
g = get_random_sequence(1.)
a = [next(g) for _ in range(5)]
print(a)
# [1.0, 1.015821868415922, 1.051470250712725, 0.9827564500218019, 0.9001851912863]
This generator can be easily extended with Python's zip function and yield from to generate successive asset values at a simulated market with an arbitrary number of assets.
def get_market(no_assets: int = 10) -> Iterator[Sequence[float]]:
rg = tuple(get_random_sequence(random.uniform(10., 60.)) for _ in range(no_assets))
yield from zip(*rg)
gm = get_market(2)
b = [next(gm) for _ in range(5)]
print(b)
# [(55.20719435959121, 54.15552382961163), (51.64409510285255, 53.6327489348457), (50.16517363363749, 52.92881727359184), (48.8692976247231, 52.7090801870517), (52.49414777987645, 49.733746261206036)]
What I like about this approach is the use of yield from to avoid a while True: loop in which a tuple of n assets would have to be constructed explicitly.
My question is: Is there a way to apply yield from in a similar manner when the zipped generators receive values over send()?
Consider the following generator that yields the ratio of successive values in an infinite sequence.
from typing import Optional, Generator
def ratio_generator() -> Generator[float, Optional[float], None]:
value_last = yield
value = yield
while True:
ratio = 0. if value_last == 0. else value / value_last
value_last = value
value = yield ratio
gr = ratio_generator()
next(gr) # move to the first yield
g = get_random_sequence(1.)
a = []
for _v in g:
_r = gr.send(_v)
if _r is None:
# two values are required for a ratio
continue
a.append(_r)
if len(a) >= 5:
break
print(a)
# [1.009041186223442, 0.9318419861800313, 1.0607677437816718, 0.9237896996817375, 0.9759635921282439]
The best way to "zip" this generator I could come up with, unfortunately, does not involve yield from at all... but instead the ugly while True: solution mentioned above.
def ratio_generator_multiple(no_values: int) -> Generator[Sequence[float], Optional[Sequence[float]], None]:
gs = tuple(ratio_generator() for _ in range(no_values))
for each_g in gs:
next(each_g)
values = yield
ratios = tuple(g.send(v) for g, v in zip(gs, values))
while True: # :(
values = yield None if None in ratios else ratios
ratios = tuple(g.send(v) for g, v in zip(gs, values))
rgm = ratio_generator_multiple(2)
next(rgm) # move to the first yield
gm = get_market(2)
b = []
for _v in gm:
_r = rgm.send(_v)
if _r is None:
# two values are required for a ratio
continue
b.append(_r)
if len(b) >= 5:
break
print(b)
# [(1.0684036496767984, 1.0531433541856687), (1.0279604693226763, 1.0649271401851732), (1.0469406709985847, 0.9350856571355237), (0.9818403001921499, 1.0344633443394962), (1.0380945284830183, 0.9081599684720663)]
Is there a way to do something like values = yield from zip(*(g.send(v) for g, v in zip(generators, values))) so I can still use yield from on the zipped generators without a while True:?
(The given example doesn't work, because it doesn't refresh the values on the right-hand side with the values on the left-hand side.)
I realize that this is more of an aesthetic problem. It would still be nice to know though...
Related
Tl Dr. If I were to explain the problem in short:
I have signals:
np.random.seed(42)
x = np.random.randn(1000)
y = np.random.randn(1000)
z = np.random.randn(1000)
and human readable string tuple logic like :
entry_sig_ = ((x,y,'crossup',False),)
exit_sig_ = ((x,z,'crossup',False), 'or_',(x,y,'crossdown',False))
where:
'entry_sig_' means the output will be 1 when the time series unfolds from left to right and 'entry_sig_' is hit. (x,y,'crossup',False) means: x crossed y up at a particular time i, and False means signal doesn't have "memory". Otherwise number of hits accumulates.
'exit_sig_' means the output will again become '0' when the 'exit_sig_' is hit.
The output is generated through:
#njit
def run(x, entry_sig, exit_sig):
'''
x: np.array
entry_sig, exit_sig: homogeneous tuples of tuple signals
Returns: sequence of 0 and 1 satisfying entry and exit sigs
'''
L = x.shape[0]
out = np.empty(L)
out[0] = 0.0
out[-1] = 0.0
i = 1
trade = True
while i < L-1:
out[i] = 0.0
if reduce_sig(entry_sig,i) and i<L-1:
out[i] = 1.0
trade = True
while trade and i<L-2:
i += 1
out[i] = 1.0
if reduce_sig(exit_sig,i):
trade = False
i+= 1
return out
reduce_sig(sig,i) is a function (see definition below) that parses the tuple and returns resulting output for a given point in time.
Question:
As of now, an object of SingleSig class is instantiated in the for loop from scratch for any given point in time; thus, not having "memory", which totally cancels the merits of having a class, a bare function will do. Does there exist a workaround (a different class template, a different approach, etc) so that:
combined tuple signal can be queried for its value at a particular point in time i.
"memory" can be reset; i.e. e.g. MultiSig(sig_tuple).memory_field can be set to 0 at a constituent signals levels.
Following code adds a memory to the signals which can be wiped using MultiSig.reset() to reset the count of all signals to 0. The memory can be queried using MultiSig.query_memory(key) to return the number of hits for that signal at that time.
For the memory function to work, I had to add unique keys to the signals to identify them.
from numba import njit, int64, float64, types
from numba.types import Array, string, boolean
from numba import jitclass
import numpy as np
np.random.seed(42)
x = np.random.randn(1000000)
y = np.random.randn(1000000)
z = np.random.randn(1000000)
# Example of "human-readable" signals
entry_sig_ = ((x,y,'crossup',False),)
exit_sig_ = ((x,z,'crossup',False), 'or_',(x,y,'crossdown',False))
# Turn signals into homogeneous tuple
#entry_sig_
entry_sig = (((x,y,'crossup',False),'NOP','1'),)
#exit_sig_
exit_sig = (((x,z,'crossup',False),'or_','2'),((x,y,'crossdown',False),'NOP','3'))
#njit
def cross(x, y, i):
'''
x,y: np.array
i: int - point in time
Returns: 1 or 0 when condition is met
'''
if (x[i - 1] - y[i - 1])*(x[i] - y[i]) < 0:
out = 1
else:
out = 0
return out
kv_ty = (types.string,types.int64)
spec = [
('memory', types.DictType(*kv_ty)),
]
#njit
def single_signal(x, y, how, acc, i):
'''
i: int - point in time
Returns either signal or accumulator
'''
if cross(x, y, i):
if x[i] < y[i] and how == 'crossdown':
out = 1
elif x[i] > y[i] and how == "crossup":
out = 1
else:
out = 0
else:
out = 0
return out
#jitclass(spec)
class MultiSig:
def __init__(self,entry,exit):
'''
initialize memory at single signal level
'''
memory_dict = {}
for i in entry:
memory_dict[str(i[2])] = 0
for i in exit:
memory_dict[str(i[2])] = 0
self.memory = memory_dict
def reduce_sig(self, sig, i):
'''
Parses multisignal
sig: homogeneous tuple of tuples ("human-readable" signal definition)
i: int - point in time
Returns: resulting value of multisignal
'''
L = len(sig)
out = single_signal(*sig[0][0],i)
logic = sig[0][1]
if out:
self.update_memory(sig[0][2])
for cnt in range(1, L):
s = single_signal(*sig[cnt][0],i)
if s:
self.update_memory(sig[cnt][2])
out = out | s if logic == 'or_' else out & s
logic = sig[cnt][1]
return out
def update_memory(self, key):
'''
update memory
'''
self.memory[str(key)] += 1
def reset(self):
'''
reset memory
'''
dicti = {}
for i in self.memory:
dicti[i] = 0
self.memory = dicti
def query_memory(self, key):
'''
return number of hits on signal
'''
return self.memory[str(key)]
#njit
def run(x, entry_sig, exit_sig):
'''
x: np.array
entry_sig, exit_sig: homogeneous tuples of tuples
Returns: sequence of 0 and 1 satisfying entry and exit sigs
'''
L = x.shape[0]
out = np.empty(L)
out[0] = 0.0
out[-1] = 0.0
i = 1
multi = MultiSig(entry_sig,exit_sig)
while i < L-1:
out[i] = 0.0
if multi.reduce_sig(entry_sig,i) and i<L-1:
out[i] = 1.0
trade = True
while trade and i<L-2:
i += 1
out[i] = 1.0
if multi.reduce_sig(exit_sig,i):
trade = False
i+= 1
return out
run(x, entry_sig, exit_sig)
To reiterate what I said in the comments, | and & are bitwise operators, not logical operators. 1 & 2 outputs 0/False which is not what I believe you want this to evaluate to so I made sure the out and s can only be 0/1 in order for this to produce the expected output.
You are aware that the because of:
out = out | s if logic == 'or_' else out & s
the order of the time-series inside entry_sig and exit_sig matters?
Let (output, logic) be tuples where output is 0 or 1 according to how crossup and crossdown would evalute the passed information of the tuple and logic is or_ or and_.
tuples = ((0,'or_'),(1,'or_'),(0,'and_'))
out = tuples[0][0]
logic = tuples[0][1]
for i in range(1,len(tuples)):
s = tuples[i][0]
out = out | s if logic == 'or_' else out & s
out = s
logic = tuples[i][1]
print(out)
0
changing the order of the tuple yields the other signal:
tuples = ((0,'or_'),(0,'and_'),(1,'or_'))
out = tuples[0][0]
logic = tuples[0][1]
for i in range(1,len(tuples)):
s = tuples[i][0]
out = out | s if logic == 'or_' else out & s
out = s
logic = tuples[i][1]
print(out)
1
The performance hinges on how many times the count needs to be updated. Using n=1,000,000 for all three time series, your code had a mean run-time of 0.6s on my machine, my code had 0.63s.
I then changed the crossing logic up a bit to save the number of if/else so that the nested if/else is only triggered if the time-series crossed which can be checked by one comparison only. This further halved the difference in run-time so above code now sits at 2.5% longer run-time your original code.
I'm a beginner, trying to write code listing the most frequently overlapping ranges in a list of ranges.
So, input is various ranges (#1 through #7 in the example figure; https://prntscr.com/kj80xl) and I would like to find the most common range (in the example 3,000- 4,000 in 6 out of 7 - 86 %). Actually, I would like to find top 5 most frequent.
Not all ranges overlap. Ranges are always positive and given as integers with 1 distance (standard range).
What I have now is only code comparing one sequence to another and returning the overlap, but after that I'm stuck.
def range_overlap(range_x,range_y):
x = (range_x[0], (range_x[-1])+1)
y = (range_y[0], (range_y[-1])+1)
overlap = (max(x[0],y[0]),min(x[-1],(y[-1])))
if overlap[0] <= overlap[1]:
return range(overlap[0], overlap[1])
else:
return "Out of range"
I would be very grateful for any help.
Better solution
I came up with a simpler solution (at least IMHO) so here it is:
def get_abs_min(ranges):
return min([min(r) for r in ranges])
def get_abs_max(ranges):
return max([max(r) for r in ranges])
def count_appearances(i, ranges):
return sum([1 for r in ranges if i in r])
def create_histogram(ranges):
keys = [str(i) for i in range(len(ranges) + 1)]
histogram = dict.fromkeys(keys)
results = []
min = get_abs_min(range_list)
max = get_abs_max(range_list)
for i in range(min, max):
count = str(count_appearances(i, ranges))
if histogram[count] is None:
histogram[count] = dict(start=i, end=None)
elif histogram[count]['end'] is None:
histogram[count]['end'] = i
elif histogram[count]['end'] == i - 1:
histogram[count]['end'] = i
else:
start = histogram[count]['start']
end = histogram[count]['end']
results.append((range(start, end + 1), count))
histogram[count]['start'] = i
histogram[count]['end'] = None
for count, d in histogram.items():
if d is not None and d['start'] is not None and d['end'] is not None:
results.append((range(d['start'], d['end'] + 1), count))
return results
def main(ranges, top):
appearances = create_histogram(ranges)
return sorted(appearances, key=lambda t: t[1], reverse=True)[:top]
The idea here is as simple as iterating through a superposition of all the ranges and building a histogram of appearances (e.g. the number of original ranges this current i appears in)
After that just sort and slice according to the chosen size of the results.
Just call main with the ranges and the top number you want (or None if you want to see all results).
OLDER EDITS BELOW
I (almost) agree with #Kasramvd's answer.
here is my take on it:
from collections import Counter
from itertools import combinations
def range_overlap(x, y):
common_part = list(set(x) & set(y))
if common_part:
return range(common_part[0], common_part[-1] +1)
else:
return False
def get_most_common(range_list, top_frequent):
overlaps = Counter(range_overlap(i, j) for i, j in
combinations(list_of_ranges, 2))
return [(r, i) for (r, i) in overlaps.most_common(top_frequent) if r]
you need to input the range_list and the number of top_frequent you want.
EDIT
the previous answer solved this question for all 2's combinations over the range list.
This edit is tested against your input and results with the correct answer:
from collections import Counter
from itertools import combinations
def range_overlap(*args):
sets = [set(r) for r in args]
common_part = list(set(args[0]).intersection(*sets))
if common_part:
return range(common_part[0], common_part[-1] +1)
else:
return False
def get_all_possible_combinations(range_list):
all_combos = []
for i in range(2, len(range_list)):
all_combos.append(combinations(range_list, i))
all_combos = [list(combo) for combo in all_combos]
return all_combos
def get_most_common_for_combo(combo):
return list(filter(None, [range_overlap(*option) for option in combo]))
def get_most_common(range_list, top_frequent):
all_overlaps = []
combos = get_all_possible_combinations(range_list)
for combo in combos:
all_overlaps.extend(get_most_common_for_combo(combo))
return [r for (r, i) in Counter(all_overlaps).most_common(top_frequent) if r]
And to get the results just run get_most_common(range_list, top_frequent)
Tested on my machine (ubunut 16.04 with python 3.5.2) with your input range_list and top_frequent = 5 with the results:
[range(3000, 4000), range(2500, 4000), range(1500, 4000), range(3000, 6000), range(1, 4000)]
You can first change your function to return a valid range in both cases so that you can use it in a set of comparisons. Also, since Python's range objects are not already created iterables but smart objects that only get start, stop and step attributes of a range and create the range on-demand, you can do a little change on your function as well.
def range_overlap(range_x,range_y):
rng = range(max(range_x.start, range_y.start),
min(range_x.stop, range_y.stop)+1)
if rng.start < rng.stop:
return rng.start, rng.stop
Now, if you have a set of ranges and you want to compare all the pairs you can use itertools.combinations to get all the pairs and then using range_overlap and collections.Counter you can find the number of overlapped ranges.
from collections import Counter
from itertools import combinations
overlaps = Counter(range_overlap(i,j) for i, j in
combinations(list_of_ranges, 2))
I am trying to implement dijkstra's algorithm (on an undirected graph) to find the shortest path and my code is this.
Note: I am not using heap/priority queue or anything but an adjacency list, a dictionary to store weights and a bool list to avoid cycling in the loops/recursion forever. Also, the algorithm works for most test cases but fails for this particular one here: https://ideone.com/iBAT0q
Important : Graph can have multiple edges from v1 to v2 (or vice versa), you have to use the minimum weight.
import sys
sys.setrecursionlimit(10000)
def findMin(n):
for i in x[n]:
cost[n] = min(cost[n],cost[i]+w[(n,i)])
def dik(s):
for i in x[s]:
if done[i]:
findMin(i)
done[i] = False
dik(i)
return
q = int(input())
for _ in range(q):
n,e = map(int,input().split())
x = [[] for _ in range(n)]
done = [True]*n
w = {}
cost = [1000000000000000000]*n
for k in range(e):
i,j,c = map(int,input().split())
x[i-1].append(j-1)
x[j-1].append(i-1)
try: #Avoiding multiple edges
w[(i-1,j-1)] = min(c,w[(i-1,j-1)])
w[(j-1,i-1)] = w[(i-1,j-1)]
except:
try:
w[(i-1,j-1)] = min(c,w[(j-1,i-1)])
w[(j-1,i-1)] = w[(i-1,j-1)]
except:
w[(j-1,i-1)] = c
w[(i-1,j-1)] = c
src = int(input())-1
#for i in sorted(w.keys()):
# print(i,w[i])
done[src] = False
cost[src] = 0
dik(src) #First iteration assigns possible minimum to all nodes
done = [True]*n
dik(src) #Second iteration to ensure they are minimum
for val in cost:
if val == 1000000000000000000:
print(-1,end=' ')
continue
if val!=0:
print(val,end=' ')
print()
The optimum isn't always found in the second pass. If you add a third pass to your example, you get closer to the expected result and after the fourth iteration, you're there.
You could iterate until no more changes are made to the cost array:
done[src] = False
cost[src] = 0
dik(src)
while True:
ocost = list(cost) # copy for comparison
done = [True]*n
dik(src)
if cost == ocost:
break
This question already has answers here:
Union of multiple ranges
(5 answers)
Closed 7 years ago.
I'm trying to remove overlapping values from a collection of ranges.
The ranges are represented by a string like this:
499-505 100-115 80-119 113-140 500-550
I want the above to be reduced to two ranges: 80-140 499-550. That covers all the values without overlap.
Currently I have the following code.
cr = "100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250".split(" ")
ar = []
br = []
for i in cr:
(left,right) = i.split("-")
ar.append(left);
br.append(right);
inc = 0
for f in br:
i = int(f)
vac = []
jnc = 0
for g in ar:
j = int(g)
if(i >= j):
vac.append(j)
del br[jnc]
jnc += jnc
print vac
inc += inc
I split the array by - and store the range limits in ar and br. I iterate over these limits pairwise and if the i is at least as great as the j, I want to delete the element. But the program doesn't work. I expect it to produce this result: 80-125 500-550 200-250 180-185
For a quick and short solution,
from operator import itemgetter
from itertools import groupby
cr = "499-505 100-115 80-119 113-140 500-550".split(" ")
fullNumbers = []
for i in cr:
a = int(i.split("-")[0])
b = int(i.split("-")[1])
fullNumbers+=range(a,b+1)
# Remove duplicates and sort it
fullNumbers = sorted(list(set(fullNumbers)))
# Taken From http://stackoverflow.com/questions/2154249
def convertToRanges(data):
result = []
for k, g in groupby(enumerate(data), lambda (i,x):i-x):
group = map(itemgetter(1), g)
result.append(str(group[0])+"-"+str(group[-1]))
return result
print convertToRanges(fullNumbers)
#Output: ['80-140', '499-550']
For the given set in your program, output is ['80-125', '180-185', '200-250', '500-550']
Main Possible drawback of this solution: This may not be scalable!
Let me offer another solution that doesn't take time linearly proportional to the sum of the range sizes. Its running time is linearly proportional to the number of ranges.
def reduce(range_text):
parts = range_text.split()
if parts == []:
return ''
ranges = [ tuple(map(int, part.split('-'))) for part in parts ]
ranges.sort()
new_ranges = []
left, right = ranges[0]
for range in ranges[1:]:
next_left, next_right = range
if right + 1 < next_left: # Is the next range to the right?
new_ranges.append((left, right)) # Close the current range.
left, right = range # Start a new range.
else:
right = max(right, next_right) # Extend the current range.
new_ranges.append((left, right)) # Close the last range.
return ' '.join([ '-'.join(map(str, range)) for range in new_ranges ]
This function works by sorting the ranges, then looking at them in order and merging consecutive ranges that intersect.
Examples:
print(reduce('499-505 100-115 80-119 113-140 500-550'))
# => 80-140 499-550
print(reduce('100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250'))
# => 80-125 180-185 200-250 500-550
I'm parsing a file like this:
--header--
data1
data2
--header--
data3
data4
data5
--header--
--header--
...
And I want groups like this:
[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]
so I can iterate over them like this:
for grp in group(open('file.txt'), lambda line: 'header' in line):
for item in grp:
process(item)
and keep the detect-a-group logic separate from the process-a-group logic.
But I need an iterable of iterables, as the groups can be arbitrarily large and I don't want to store them. That is, I want to split an iterable into subgroups every time I encounter a "sentinel" or "header" item, as indicated by a predicate. Seems like this would be a common task, but I can't find an efficient Pythonic implementation.
Here's the dumb append-to-a-list implementation:
def group(iterable, isstart=lambda x: x):
"""Group `iterable` into groups starting with items where `isstart(item)` is true.
Start items are included in the group. The first group may or may not have a
start item. An empty `iterable` results in an empty result (zero groups)."""
items = []
for item in iterable:
if isstart(item) and items:
yield iter(items)
items = []
items.append(item)
if items:
yield iter(items)
It feels like there's got to be a nice itertools version, but it eludes me. The 'obvious' (?!) groupby solution doesn't seem to work because there can be adjacent headers, and they need to go in separate groups. The best I can come up with is (ab)using groupby with a key function that keeps a counter:
def igroup(iterable, isstart=lambda x: x):
def keyfunc(item):
if isstart(item):
keyfunc.groupnum += 1 # Python 2's closures leave something to be desired
return keyfunc.groupnum
keyfunc.groupnum = 0
return (group for _, group in itertools.groupby(iterable, keyfunc))
But I feel like Python can do better -- and sadly, this is even slower than the dumb list version:
# ipython
%time deque(group(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 4.20 s, sys: 0.03 s, total: 4.23 s
%time deque(igroup(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 5.45 s, sys: 0.01 s, total: 5.46 s
To make it easy on you, here's some unit test code:
class Test(unittest.TestCase):
def test_group(self):
MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
isstart = lambda x: x == 0
self.assertEqual(next(igroup([], isstart), None), None)
self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3) # Catch hangs when groups are not consumed
for _ in xrange(NUMTRIALS):
expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
for grpnum, grp in enumerate(igroup(items, isstart)):
start = next(grp)
self.assertTrue(isstart(start) or grpnum == 0)
self.assertEqual(start, next(expected))
for item in grp:
self.assertFalse(isstart(item))
self.assertEqual(item, next(expected))
So: how can I subgroup an iterable by a predicate elegantly and efficiently in Python?
how can I subgroup an iterable by a predicate elegantly and efficiently in Python?
Here's a concise, memory-efficient implementation which is very similar to the one from your question:
from itertools import groupby, imap
from operator import itemgetter
def igroup(iterable, isstart):
def key(item, count=[False]):
if isstart(item):
count[0] = not count[0] # start new group
return count[0]
return imap(itemgetter(1), groupby(iterable, key))
It supports infinite groups.
tee-based solution is slightly faster but it consumes memory for the current group (similar to the list-based solution from the question):
from itertools import islice, tee
def group(iterable, isstart):
it, it2 = tee(iterable)
count = 0
for item in it:
if isstart(item) and count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
count = 0
count += 1
if count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
groupby-solution could be implemented in pure Python:
def igroup_inline_key(iterable, isstart):
it = iter(iterable)
def grouper():
"""Yield items from a single group."""
while not p[START]:
yield p[VALUE] # each group has at least one element (a header)
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])
p = [None]*2 # workaround the absence of `nonlocal` keyword in Python 2.x
START, VALUE = 0, 1
p[VALUE] = next(it)
while True:
p[START] = False # to distinguish EOF and a start of new group
yield grouper()
while not p[START]: # skip to the next group
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])
To avoid repeating the code the while True loop could be written as:
while True:
p[START] = False # to distinguish EOF and a start of new group
g = grouper()
yield g
if not p[START]: # skip to the next group
for _ in g:
pass
if not p[START]: # EOF
break
Though the previous variant might be more explicit and readable.
I think a general memory-efficient solution in pure Python won't be significantly faster than groupby-based one.
If process(item) is fast compared to igroup() and a header could be efficiently found in a string (e.g., for a fixed static header) then you could improve performance by reading your file in large chunks and splitting on the header value. It should make your task IO-bound.
I didn't quite read all your code, but I think this might be of some help:
from itertools import izip, tee, chain
def pairwise(iterable):
a, b = tee(iterable)
return izip(a, chain(b, [next(b, None)]))
def group(iterable, isstart):
pairs = pairwise(iterable)
def extract(current, lookahead, pairs=pairs, isstart=isstart):
yield current
if isstart(lookahead):
return
for current, lookahead in pairs:
yield current
if isstart(lookahead):
return
for start, lookahead in pairs:
gen = extract(start, lookahead)
yield gen
for _ in gen:
pass
for gen in group(xrange(4, 16), lambda x: x % 5 == 0):
print '------------------'
for n in gen:
print n
print [list(g) for g in group([], lambda x: x % 5 == 0)]
Result:
$ python gen.py
------------------
4
------------------
5
6
7
8
9
------------------
10
11
12
13
14
------------------
15
[]
Edit:
And here's another solution, similar to the above, but without the pairwise() and a sentinel instead. I don't know which one is faster:
def group(iterable, isstart):
sentinel = object()
def interleave(iterable=iterable, isstart=isstart, sentinel=sentinel):
for item in iterable:
if isstart(item):
yield sentinel
yield item
items = interleave()
def extract(item, items=items, isstart=isstart, sentinel=sentinel):
if item is not sentinel:
yield item
for item in items:
if item is sentinel:
return
yield item
for lookahead in items:
gen = extract(lookahead)
yield gen
for _ in gen:
pass
Both now pass the test case, thanks to J.F.Sebastians idea for the exhaustion of skipped subgroup generators.
The crucial thing is you have to write a generator that yields sub-generators. My solution is similar in concept to the one by #pillmuncher, but is more self-contained because it avoids using itertools machinery to make ancillary generators. The downside is I have to use a somewhat inelegant temp list. In Python 3 this could perhaps be done more nicely with nonlocal.
def grouper(iterable, isstart):
it = iter(iterable)
last = [next(it)]
def subgroup():
while True:
toYield = last[0]
try:
last.append(next(it))
except StopIteration, e:
last.pop(0)
yield toYield
raise StopIteration
else:
yield toYield
last.pop(0)
if isstart(last[0]):
raise StopIteration
while True:
sg = subgroup()
yield sg
if len(last) == 2:
# subgenerator was aborted before completion, let's finish it
for a in sg:
pass
if last:
# sub-generator left next element waiting, next sub-generator will yield it
pass
else:
# sub-generator left "last" empty because source iterable was exhausted
raise StopIteration
>>> for g in grouper([0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0], lambda x: x==0):
... print "Group",
... for i in g:
... print i,
... print
Group 0 1 1
Group 0 1
Group 0 1 1 1 1
Group 0
I don't know what this is like performance-wise, I just did it because it was just an interesting thing to try to do.
Edit: I ran your unit test on your original two and mine. It looks like mine is a bit faster than your igroup but still slower than the list-based version. It seems natural that you'll have to make a tradeoff between speed and memory here; if you know the groups won't be too terribly large, use the list-based version for speed. If the groups could be huge, use a generator-based version to keep memory usage down.
Edit: The edited version above handles breaking in a different way. If you break out of the sub-generator but resume the outer generator, it will skip the remainder of the aborted group and begin with the next group:
>>> for g in grouper([0, 1, 2, 88, 3, 0, 1, 88, 2, 3, 4, 0, 1, 2, 3, 88, 4], lambda x: x==0):
... print "Group",
... for i in g:
... print i,
... if i==88:
... break
... print
Group 0 1 2 88
Group 0 1 88
Group 0 1 2 3 88
So here's another version that tries to stitch together pairs of subgrouops from groupby with chain. It's noticeably faster for the performance test given, but much much slower when there are many small groups (say isstart = lambda x: x % 2 == 0). It cheats and buffers repeated headers (you could get round this with read-all-but-last iterator tricks). It's also a step backward in the elegance department, so I think I still prefer the original.
def group2(iterable, isstart=lambda x: x):
groups = itertools.groupby(iterable, isstart)
start, group = next(groups)
if not start: # Deal with initial non-start group
yield group
_, group = next(groups)
groups = (grp for _, grp in groups)
while True: # group will always be start item(s) now
group = list(group)
for item in group[0:-1]: # Back-to-back start items... and hope this doesn't get very big. :)
yield iter([item])
yield itertools.chain([group[-1]], next(groups, [])) # Start item plus subsequent non-start items
group = next(groups)
%time deque(group2(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 3.13 s, sys: 0.00 s, total: 3.13 s