python - unique set of ranges, merging when needed

python - unique set of ranges, merging when needed - python

Is there a datastructure that will maintain a unique set of ranges, merging an contiguous or overlapping ranges that are added? I need to track which ranges have been processed, but this may occur in an arbitrary order. E.g.:
range_set = RangeSet() # doesn't exist that I know of, this is what I need help with
def process_data(start, end):
global range_set
range_set.add_range(start, end)
# ...
process_data(0, 10)
process_data(20, 30)
process_data(5, 15)
process_data(50, 60)
print(range_set.missing_ranges())
# [[16,19], [31, 49]]
print(range_set.ranges())
# [[0,15], [20,30], [50, 60]]
Notice that overlapping or contiguous ranges get merged together. What is the best way to do this? I looked at using the bisect module, but its use didn't seem terribly clear.

Another approach is based on sympy.sets.
>>> import sympy as sym
>>> a = sym.Interval(1, 2, left_open=False, right_open=False)
>>> b = sym.Interval(3, 4, left_open=False, right_open=False)
>>> domain = sym.Interval(0, 10, left_open=False, right_open=False)
>>> missing = domain - a - b
>>> missing
[0, 1) U (2, 3) U (4, 10]
>>> 2 in missing
False
>>> missing.complement(domain)
[1, 2] U [3, 4]

You could get some similar functionality with pythons built-in set data structure; supposing only integer values are valid for start and end.
>>> whole_domain = set(range(12))
>>> A = set(range(0,1))
>>> B = set(range(4,9))
>>> C = set(range(3,6)) # processed range(3,5) twice
>>> done = A | B | C
>>> print done
set([0, 3, 4, 5, 6, 7, 8])
>>> missing = whole_domain - done
>>> print missing
set([1, 2, 9, 10, 11])
This still lacks many 'range'-features but might be sufficient.
A simple query if a certain range was already processed could look like this:
>>> isprocessed = [foo in done for foo in set(range(2,6))]
>>> print isprocessed
[False, True, True, True]

I've only lightly tested it, but it sounds like you're looking for something like this. You'll need to add the methods to get the ranges and missing ranges yourself, but it should be very straighforward as RangeSet.ranges is a list of Range objects maintained in sorted order. For a more pleasant interface you could write a convenience method that converted it to a list of 2-tuples, for example.
EDIT: I've just modified it to use less-than-or-equal comparisons for merging. Note, however, that this won't merge "adjacent" entries (e.g. it won't merge (1, 5) and (6, 10)). To do this you'd need to simply modify the condition in Range.check_merge().
import bisect
class Range(object):
# Reduces memory usage, overkill unless you're using a lot of these.
__slots__ = ["start", "end"]
def __init__(self, start, end):
"""Initialise this range."""
self.start = start
self.end = end
def __cmp__(self, other):
"""Sort ranges by their initial item."""
return cmp(self.start, other.start)
def check_merge(self, other):
"""Merge in specified range and return True iff it overlaps."""
if other.start <= self.end and other.end >= self.start:
self.start = min(other.start, self.start)
self.end = max(other.end, self.end)
return True
return False
class RangeSet(object):
def __init__(self):
self.ranges = []
def add_range(self, start, end):
"""Merge or insert the specified range as appropriate."""
new_range = Range(start, end)
offset = bisect.bisect_left(self.ranges, new_range)
# Check if we can merge backwards.
if offset > 0 and self.ranges[offset - 1].check_merge(new_range):
new_range = self.ranges[offset - 1]
offset -= 1
else:
self.ranges.insert(offset, new_range)
# Scan for forward merges.
check_offset = offset + 1
while (check_offset < len(self.ranges) and
new_range.check_merge(self.ranges[offset+1])):
check_offset += 1
# Remove any entries that we've just merged.
if check_offset - offset > 1:
self.ranges[offset+1:check_offset] = []

You have hit on a good solution in your example use case. Rather than try to maintain a set of the ranges that have been used, keep track of the ranges that haven't been used. This makes the problem pretty easy.
class RangeSet:
def __init__(self, min, max):
self.__gaps = [(min, max)]
self.min = min
self.max = max
def add(self, lo, hi):
new_gaps = []
for g in self.__gaps:
for ng in (g[0],min(g[1],lo)),(max(g[0],hi),g[1]):
if ng[1] > ng[0]: new_gaps.append(ng)
self.__gaps = new_gaps
def missing_ranges(self):
return self.__gaps
def ranges(self):
i = iter([self.min] + [x for y in self.__gaps for x in y] + [self.max])
return [(x,y) for x,y in zip(i,i) if y > x]
The magic is in the add method, which checks each existing gap to see whether it is affected by the new range, and adjusts the list of gaps accordingly.
Note that the behaviour of the tuples used for ranges here is the same as Python's range objects, i.e. they are inclusive of the start value and exclusive of the stop value. This class will not behave in exactly the way you described in your question, where your ranges seem to be inclusive of both.

Have a look at portion (https://pypi.org/project/portion/). I'm the maintainer of this library, and it supports disjuction of continuous intervals out of the box. It automatically simplifies adjacent and overlapping intervals.
Consider the intervals provided in your example:
>>> import portion as P
>>> i = P.closed(0, 10) | P.closed(20, 30) | P.closed(5, 15) | P.closed(50, 60)
>>> # get "used ranges"
>>> i
[0,15] | [20,30] | [50,60]
>>> # get "missing ranges"
>>> i.enclosure - i
(15,20) | (30,50)

Similar to DavidT's answer – also based on sympy's sets, but using a list of any length and addition (union) in a single operation:
import sympy
intervals = [[1,4], [6,10], [3,5], [7,8]] # pairs of left,right
print(intervals)
symintervals = [sympy.Interval(i[0],i[1], left_open=False, right_open=False) for i in intervals]
print(symintervals)
merged = sympy.Union(*symintervals) # one operation; adding to an union one by one is much slower for a large number of intervals
print(merged)
for i in merged.args: # assumes that the "merged" result is an union, not a single interval
print(i.left, i.right) # getting bounds of merged intervals

Here's my solution:
def flatten(collection):
subset = set()
for elem in collection:
to_add = elem
to_remove = set()
for s in subset:
if s[0] <= to_add[0] <= s[1] or s[0] <= to_add[1] <= s[1] or (s[0] > to_add[0] and s[1] < to_add[1]):
to_remove.add(s)
to_add = (min(to_add[0], s[0]), max(to_add[1], s[1]))
subset -= to_remove
subset.add(to_add)
return subset
range_set = {(-12, 4), (3, 20), (21, 25), (25, 30), (-13, -11), (5, 10), (-13, 20)}
print(flatten(range_set))
# {(21, 30), (-13, 20)}

Related

Finding singulars/sets of local maxima/minima in a 1D-NumPy array (once again)

I would like to have a function that can detect where the local maxima/minima are in an array (even if there is a set of local maxima/minima). Example:
Given the array
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
I would like to have an output like:
set of 2 local minima => array[0]:array[1]
set of 3 local minima => array[3]:array[5]
local minima, i = 9
set of 2 local minima => array[11]:array[12]
set of 2 local minima => array[15]:array[16]
As you can see from the example, not only are the singular values detected but, also, sets of local maxima/minima.
I know in this question there are a lot of good answers and ideas, but none of them do the job described: some of them simply ignore the extreme points of the array and all ignore the sets of local minima/maxima.
Before asking the question, I wrote a function by myself that does exactly what I described above (the function is at the end of this question: local_min(a). With the test I did, it works properly).
Question: However, I am also sure that is NOT the best way to work with Python. Are there builtin functions, APIs, libraries, etc. that I can use? Any other function suggestion? A one-line instruction? A full vectored solution?
def local_min(a):
candidate_min=0
for i in range(len(a)):
# Controlling the first left element
if i==0 and len(a)>=1:
# If the first element is a singular local minima
if a[0]<a[1]:
print("local minima, i = 0")
# If the element is a candidate to be part of a set of local minima
elif a[0]==a[1]:
candidate_min=1
# Controlling the last right element
if i == (len(a)-1) and len(a)>=1:
if candidate_min > 0:
if a[len(a)-1]==a[len(a)-2]:
print("set of " + str(candidate_min+1)+ " local minima => array["+str(i-candidate_min)+"]:array["+str(i)+"]")
if a[len(a)-1]<a[len(a)-2]:
print("local minima, i = " + str(len(a)-1))
# Controlling the other values in the middle of the array
if i>0 and i<len(a)-1 and len(a)>2:
# If a singular local minima
if (a[i]<a[i-1] and a[i]<a[i+1]):
print("local minima, i = " + str(i))
# print(str(a[i-1])+" > " + str(a[i]) + " < "+str(a[i+1])) #debug
# If it was found a set of candidate local minima
if candidate_min >0:
# The candidate set IS a set of local minima
if a[i] < a[i+1]:
print("set of " + str(candidate_min+1)+ " local minima => array["+str(i-candidate_min)+"]:array["+str(i)+"]")
candidate_min = 0
# The candidate set IS NOT a set of local minima
elif a[i] > a[i+1]:
candidate_min = 0
# The set of local minima is growing
elif a[i] == a[i+1]:
candidate_min = candidate_min + 1
# It never should arrive in the last else
else:
print("Something strange happen")
return -1
# If there is a set of candidate local minima (first value found)
if (a[i]<a[i-1] and a[i]==a[i+1]):
candidate_min = candidate_min + 1
Note: I tried to enrich the code with some comments to let understand what I do. I know that the function that I propose is
not clean and just prints the results that can be stored and returned
at the end. It was written to give an example. The algorithm I propose should be O(n).
UPDATE:
Somebody was suggesting to import from scipy.signal import argrelextrema and use the function like:
def local_min_scipy(a):
minima = argrelextrema(a, np.less_equal)[0]
return minima
def local_max_scipy(a):
minima = argrelextrema(a, np.greater_equal)[0]
return minima
To have something like that is what I am really looking for. However, it doesn't work properly when the sets of local minima/maxima have more than two values. For example:
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
print(local_max_scipy(test03))
The output is:
[ 0 2 4 8 10 13 14 16]
Of course in test03[4] I have a minimum and not a maximum. How do I fix this behavior? (I don't know if this is another question or if this is the right place where to ask it.)

A full vectored solution:
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1]) # Size 17
extended = np.empty(len(test03)+2) # Rooms to manage edges, size 19
extended[1:-1] = test03
extended[0] = extended[-1] = np.inf
flag_left = extended[:-1] <= extended[1:] # Less than successor, size 18
flag_right = extended[1:] <= extended[:-1] # Less than predecessor, size 18
flagmini = flag_left[1:] & flag_right[:-1] # Local minimum, size 17
mini = np.where(flagmini)[0] # Indices of minimums
spl = np.where(np.diff(mini)>1)[0]+1 # Places to split
result = np.split(mini, spl)
result:
[0, 1] [3, 4, 5] [9] [11, 12] [15, 16]
EDIT
Unfortunately, This detects also maxima as soon as they are at least 3 items large, since they are seen as flat local minima. A numpy patch will be ugly this way.
To solve this problem I propose 2 other solutions, with numpy, then with numba.
Whith numpy using np.diff :
import numpy as np
test03=np.array([12,13,12,4,4,4,5,6,7,2,6,5,5,7,7,17,17])
extended=np.full(len(test03)+2,np.inf)
extended[1:-1]=test03
slope = np.sign(np.diff(extended)) # 1 if ascending,0 if flat, -1 if descending
not_flat,= slope.nonzero() # Indices where data is not flat.
local_min_inds, = np.where(np.diff(slope[not_flat])==2)
#local_min_inds contains indices in not_flat of beginning of local mins.
#Indices of End of local mins are shift by +1:
start = not_flat[local_min_inds]
stop = not_flat[local_min_inds+1]-1
print(*zip(start,stop))
#(0, 1) (3, 5) (9, 9) (11, 12) (15, 16)
A direct solution compatible with numba acceleration :
##numba.njit
def localmins(a):
begin= np.empty(a.size//2+1,np.int32)
end = np.empty(a.size//2+1,np.int32)
i=k=0
begin[k]=0
search_end=True
while i<a.size-1:
if a[i]>a[i+1]:
begin[k]=i+1
search_end=True
if search_end and a[i]<a[i+1]:
end[k]=i
k+=1
search_end=False
i+=1
if search_end and i>0 : # Final plate if exists
end[k]=i
k+=1
return begin[:k],end[:k]
print(*zip(*localmins(test03)))
#(0, 1) (3, 5) (9, 9) (11, 12) (15, 16)

I think another function from scipy.signal would be interesting.
from scipy.signal import find_peaks
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
find_peaks(test03)
Out[]: (array([ 2, 8, 10, 13], dtype=int64), {})
find_peaks has lots of options and might be quite useful, especially for noisy signals.
Update
The function is really powerful and versatile. You can set several parameters for peak minimal width, height, distance from each other and so on. As example:
test04 = np.array([1,1,5,5,5,5,5,5,5,5,1,1,1,1,1,5,5,5,1,5,1,5,1])
find_peaks(test04, width=1)
Out[]:
(array([ 5, 16, 19, 21], dtype=int64),
{'prominences': array([4., 4., 4., 4.]),
'left_bases': array([ 1, 14, 18, 20], dtype=int64),
'right_bases': array([10, 18, 20, 22], dtype=int64),
'widths': array([8., 3., 1., 1.]),
'width_heights': array([3., 3., 3., 3.]),
'left_ips': array([ 1.5, 14.5, 18.5, 20.5]),
'right_ips': array([ 9.5, 17.5, 19.5, 21.5])})
See documentation for more examples.

There can be multiple ways to solve this. One approach listed here.
You can create a custom function, and use the maximums to handle edge cases while finding mimima.
import numpy as np
a = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
def local_min(a):
temp_list = list(a)
maxval = max(a) #use max while finding minima
temp_list = temp_list + [maxval] #handles last value edge case.
prev = maxval #prev stores last value seen
loc = 0 #used to store starting index of minima
count = 0 #use to count repeated values
#match_start = False
matches = []
for i in range(0, len(temp_list)): #need to check all values including the padded value
if prev == temp_list[i]:
if count > 0: #only increment for minima candidates
count += 1
elif prev > temp_list[i]:
count = 1
loc = i
# match_start = True
else: #prev < temp_list[i]
if count > 0:
matches.append((loc, count))
count = 0
loc = i
prev = temp_list[i]
return matches
result = local_min(a)
for match in result:
print ("{} minima found starting at location {} and ending at location {}".format(
match[1],
match[0],
match[0] + match[1] -1))
Let me know if this does the trick for you. The idea is simple, you want to iterate through the list once and keep storing minima as you see them. Handle the edges by padding with maximum values on either end. (or by padding the last end, and using the max value for initial comparison)

Here's an answer based on restriding the array into an iterable of windows:
import numpy as np
from numpy.lib.stride_tricks import as_strided
def windowstride(a, window):
return as_strided(a, shape=(a.size - window + 1, window), strides=2*a.strides)
def local_min(a, maxwindow=None, doends=True):
if doends: a = np.pad(a.astype(float), 1, 'constant', constant_values=np.inf)
if maxwindow is None: maxwindow = a.size - 1
mins = []
for i in range(3, maxwindow + 1):
for j,w in enumerate(windowstride(a, i)):
if (w[0] > w[1]) and (w[-2] < w[-1]):
if (w[1:-1]==w[1]).all():
mins.append((j, j + i - 2))
mins.sort()
return mins
Testing it out:
test03=np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
local_min(test03)
Output:
[(0, 2), (3, 6), (9, 10), (11, 13), (15, 17)]
Not the most efficient algorithm, but at least it's short. I'm pretty sure it's O(n^2), since there's roughly 1/2*(n^2 + n) windows to iterate over. This is only partially vectorized, so there may be a way to improve it.
Edit
To clarify, the output is the indices of the slices that contain the runs of local minimum values. The fact that they go one past the end of the run is intentional (someone just tried to "fix" that in an edit). You can use the output to iterate over the slices of minimum values in your input array like this:
for s in local_mins(test03):
print(test03[slice(*s)])
Output:
[2 2]
[4 4 4]
[2]
[5 5]
[1 1]

A pure numpy solution (revised answer):
import numpy as np
y = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
x = np.r_[y[0]+1, y, y[-1]+1] # pad edges, gives possibility for minima
ups, = np.where(x[:-1] < x[1:])
downs, = np.where(x[:-1] > x[1:])
minend = ups[np.unique(np.searchsorted(ups, downs))]
minbeg = downs[::-1][np.unique(np.searchsorted(-downs[::-1], -ups[::-1]))][::-1]
minlen = minend - minbeg
for line in zip(minlen, minbeg, minend-1): print("set of %d minima %d - %d" % line)
This gives
set of 2 minima 0 - 1
set of 3 minima 3 - 5
set of 1 minima 9 - 9
set of 2 minima 11 - 12
set of 2 minima 15 - 16
np.searchsorted(ups, downs) finds the first ups after every down. This is the "true" end of a minimum.
For the start of the minima, we do it similar, but now in reverse order.
It is working for the example, yet not fully tested. But I would say a good starting point.

You can use argrelmax, as long as there no multiple consecutive equal elements, so first you need to run length encode the array, then use argrelmax (or argrelmin):
import numpy as np
from scipy.signal import argrelmax
from itertools import groupby
def local_max_scipy(a):
start = 0
result = [[a[0] - 1, 0, 0]] # this is to guarantee the left edge is included
for k, g in groupby(a):
length = sum(1 for _ in g)
result.append([k, start, length])
start += length
result.append([a[-1] - 1, 0, 0]) # this is to guarantee the right edge is included
arr = np.array(result)
maxima, = argrelmax(arr[:, 0])
return arr[maxima]
test03 = np.array([2, 2, 10, 4, 4, 4, 5, 6, 7, 2, 6, 5, 5, 7, 7, 1, 1])
output = local_max_scipy(test03)
for val, start, length in output:
print(f'set of {length} maxima start:{start} end:{start + length}')
Output
set of 1 maxima start:2 end:3
set of 1 maxima start:8 end:9
set of 1 maxima start:10 end:11
set of 2 maxima start:13 end:15

Most frequently overlapping range - Python3.x

I'm a beginner, trying to write code listing the most frequently overlapping ranges in a list of ranges.
So, input is various ranges (#1 through #7 in the example figure; https://prntscr.com/kj80xl) and I would like to find the most common range (in the example 3,000- 4,000 in 6 out of 7 - 86 %). Actually, I would like to find top 5 most frequent.
Not all ranges overlap. Ranges are always positive and given as integers with 1 distance (standard range).
What I have now is only code comparing one sequence to another and returning the overlap, but after that I'm stuck.
def range_overlap(range_x,range_y):
x = (range_x[0], (range_x[-1])+1)
y = (range_y[0], (range_y[-1])+1)
overlap = (max(x[0],y[0]),min(x[-1],(y[-1])))
if overlap[0] <= overlap[1]:
return range(overlap[0], overlap[1])
else:
return "Out of range"
I would be very grateful for any help.

Better solution
I came up with a simpler solution (at least IMHO) so here it is:
def get_abs_min(ranges):
return min([min(r) for r in ranges])
def get_abs_max(ranges):
return max([max(r) for r in ranges])
def count_appearances(i, ranges):
return sum([1 for r in ranges if i in r])
def create_histogram(ranges):
keys = [str(i) for i in range(len(ranges) + 1)]
histogram = dict.fromkeys(keys)
results = []
min = get_abs_min(range_list)
max = get_abs_max(range_list)
for i in range(min, max):
count = str(count_appearances(i, ranges))
if histogram[count] is None:
histogram[count] = dict(start=i, end=None)
elif histogram[count]['end'] is None:
histogram[count]['end'] = i
elif histogram[count]['end'] == i - 1:
histogram[count]['end'] = i
else:
start = histogram[count]['start']
end = histogram[count]['end']
results.append((range(start, end + 1), count))
histogram[count]['start'] = i
histogram[count]['end'] = None
for count, d in histogram.items():
if d is not None and d['start'] is not None and d['end'] is not None:
results.append((range(d['start'], d['end'] + 1), count))
return results
def main(ranges, top):
appearances = create_histogram(ranges)
return sorted(appearances, key=lambda t: t[1], reverse=True)[:top]
The idea here is as simple as iterating through a superposition of all the ranges and building a histogram of appearances (e.g. the number of original ranges this current i appears in)
After that just sort and slice according to the chosen size of the results.
Just call main with the ranges and the top number you want (or None if you want to see all results).
OLDER EDITS BELOW
I (almost) agree with #Kasramvd's answer.
here is my take on it:
from collections import Counter
from itertools import combinations
def range_overlap(x, y):
common_part = list(set(x) & set(y))
if common_part:
return range(common_part[0], common_part[-1] +1)
else:
return False
def get_most_common(range_list, top_frequent):
overlaps = Counter(range_overlap(i, j) for i, j in
combinations(list_of_ranges, 2))
return [(r, i) for (r, i) in overlaps.most_common(top_frequent) if r]
you need to input the range_list and the number of top_frequent you want.
EDIT
the previous answer solved this question for all 2's combinations over the range list.
This edit is tested against your input and results with the correct answer:
from collections import Counter
from itertools import combinations
def range_overlap(*args):
sets = [set(r) for r in args]
common_part = list(set(args[0]).intersection(*sets))
if common_part:
return range(common_part[0], common_part[-1] +1)
else:
return False
def get_all_possible_combinations(range_list):
all_combos = []
for i in range(2, len(range_list)):
all_combos.append(combinations(range_list, i))
all_combos = [list(combo) for combo in all_combos]
return all_combos
def get_most_common_for_combo(combo):
return list(filter(None, [range_overlap(*option) for option in combo]))
def get_most_common(range_list, top_frequent):
all_overlaps = []
combos = get_all_possible_combinations(range_list)
for combo in combos:
all_overlaps.extend(get_most_common_for_combo(combo))
return [r for (r, i) in Counter(all_overlaps).most_common(top_frequent) if r]
And to get the results just run get_most_common(range_list, top_frequent)
Tested on my machine (ubunut 16.04 with python 3.5.2) with your input range_list and top_frequent = 5 with the results:
[range(3000, 4000), range(2500, 4000), range(1500, 4000), range(3000, 6000), range(1, 4000)]

You can first change your function to return a valid range in both cases so that you can use it in a set of comparisons. Also, since Python's range objects are not already created iterables but smart objects that only get start, stop and step attributes of a range and create the range on-demand, you can do a little change on your function as well.
def range_overlap(range_x,range_y):
rng = range(max(range_x.start, range_y.start),
min(range_x.stop, range_y.stop)+1)
if rng.start < rng.stop:
return rng.start, rng.stop
Now, if you have a set of ranges and you want to compare all the pairs you can use itertools.combinations to get all the pairs and then using range_overlap and collections.Counter you can find the number of overlapped ranges.
from collections import Counter
from itertools import combinations
overlaps = Counter(range_overlap(i,j) for i, j in
combinations(list_of_ranges, 2))

How to join integers intervals in python?

I have used the module intervals (http://pyinterval.readthedocs.io/en/latest/index.html)
And created an interval from a set or start, end tuples:
intervals = interval.interval([1,8], [7,10], [15,20])
Which result in interval([1.0, 10.0], [15.0, 20.0]) as the [1,8] and [7,10] overlaps.
But this module interprets the values of the pairs as real numbers, so two continuous intervals in integers will not be joined together.
Example:
intervals = interval.interval([1,8], [9,10], [11,20])
results in: interval([1.0, 8.0], [9.0, 10.0], [11.0, 20.0])
My question is how can I join this intervals as integers and not as real numbers? And in the last example the result would be interval([1.0, 20.0])

The intervals module pyinterval is used for real numbers, not for integers. If you want to use objects, you can create an integer interval class or you can also code a program to join integer intervals using the interval module:
def join_int_intervlas(int1, int2):
if int(int1[-1][-1])+1 >= int(int2[-1][0]):
return interval.interval([int1[-1][0], int2[-1][-1]])
else:
return interval.interval()

I believe you can use pyintervals for integer intervals too by adding interval([-0.5, 0.5]). With your example you get
In[40]: interval([1,8], [9,10], [11,20]) + interval([-0.5, 0.5])
Out[40]: interval([0.5, 20.5])

This takes a list of tuples like l = [(25,24), (17,18), (5,9), (24,16), (10,13), (15,19), (22,25)]
# Idea by Ben Voigt in https://stackoverflow.com/questions/32869247/a-container-for-integer-intervals-such-as-rangeset-for-c
def sort_condense(ivs):
if len(ivs) == 0:
return []
if len(ivs) == 1:
if ivs[0][0] > ivs[0][1]:
return [(ivs[0][1], ivs[0][0])]
else:
return ivs
eps = []
for iv in ivs:
ivl = min(iv)
ivr = max(iv)
eps.append((ivl, False))
eps.append((ivr, True))
eps.sort()
ret = []
level = 0
i = 0
while i < len(eps)-1:
if not eps[i][1]:
level = level+1
if level == 1:
left = eps[i][0]
else:
if level == 1:
if not eps[i+1][1]
and eps[i+1][0] == eps[i][0]+1:
i = i+2
continue
right = eps[i][0]
ret.append((left, right))
level = level-1
i = i+1
ret.append((left, eps[len(eps)-1][0]))
return ret
In [1]: sort_condense(l)
Out[1]: [(5, 13), (15, 25)]
The idea is outlined in Ben Voigt's answer to A container for integer intervals, such as RangeSet, for C++
Python is not my main language, sorry.

I came up with the following program:
ls = [[1,8], [7,10], [15,20]]
ls2 = []
prevList = ls[0]
for lists in ls[1:]:
if lists[0] <= prevList[1]+1:
prevList = [prevList[0], lists[1]]
else:
ls2.append(prevList)
prevList = lists
ls2.append(prevList)
print ls2 # prints [[1, 10], [15, 20]]
It permutes through all lists and checks if the firsy element of each list is less than or equal to the previous element + 1. If so, it clubs the two.

How to make a random but partial shuffle in Python?

Instead of a complete shuffle, I am looking for a partial shuffle function in python.
Example : "string" must give rise to "stnrig", but not "nrsgit"
It would be better if I can define a specific "percentage" of characters that have to be rearranged.
Purpose is to test string comparison algorithms. I want to determine the "percentage of shuffle" beyond which an(my) algorithm will mark two (shuffled) strings as completely different.
Update :
Here is my code. Improvements are welcome !
import random
percent_to_shuffle = int(raw_input("Give the percent value to shuffle : "))
to_shuffle = list(raw_input("Give the string to be shuffled : "))
num_of_chars_to_shuffle = int((len(to_shuffle)*percent_to_shuffle)/100)
for i in range(0,num_of_chars_to_shuffle):
x=random.randint(0,(len(to_shuffle)-1))
y=random.randint(0,(len(to_shuffle)-1))
z=to_shuffle[x]
to_shuffle[x]=to_shuffle[y]
to_shuffle[y]=z
print ''.join(to_shuffle)

This is a problem simpler than it looks. And the language has the right tools not to stay between you and the idea,as usual:
import random
def pashuffle(string, perc=10):
data = list(string)
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return "".join(data)

Your problem is tricky, because there are some edge cases to think about:
Strings with repeated characters (i.e. how would you shuffle "aaaab"?)
How do you measure chained character swaps or re arranging blocks?
In any case, the metric defined to shuffle strings up to a certain percentage is likely to be the same you are using in your algorithm to see how close they are.
My code to shuffle n characters:
import random
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
idx = idx[:n]
mapping = dict((idx[i], idx[i-1]) for i in range(n))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
Basically chooses n positions to swap at random, and then exchanges each of them with the next in the list... This way it ensures that no inverse swaps are generated and exactly n characters are swapped (if there are characters repeated, bad luck).
Explained run with 'string', 3 as input:
idx is [0, 1, 2, 3, 4, 5]
we shuffle it, now it is [5, 3, 1, 4, 0, 2]
we take just the first 3 elements, now it is [5, 3, 1]
those are the characters that we are going to swap
s t r i n g
^ ^ ^
t (1) will be i (3)
i (3) will be g (5)
g (5) will be t (1)
the rest will remain unchanged
so we get 'sirgnt'
The bad thing about this method is that it does not generate all the possible variations, for example, it could not make 'gnrits' from 'string'. This could be fixed by making partitions of the indices to be shuffled, like this:
import random
def randparts(l):
n = len(l)
s = random.randint(0, n-1) + 1
if s >= 2 and n - s >= 2: # the split makes two valid parts
yield l[:s]
for p in randparts(l[s:]):
yield p
else: # the split would make a single cycle
yield l
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
mapping = dict((x[i], x[i-1])
for i in range(len(x))
for x in randparts(idx[:n]))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))

import random
def partial_shuffle(a, part=0.5):
# which characters are to be shuffled:
idx_todo = random.sample(xrange(len(a)), int(len(a) * part))
# what are the new positions of these to-be-shuffled characters:
idx_target = idx_todo[:]
random.shuffle(idx_target)
# map all "normal" character positions {0:0, 1:1, 2:2, ...}
mapper = dict((i, i) for i in xrange(len(a)))
# update with all shuffles in the string: {old_pos:new_pos, old_pos:new_pos, ...}
mapper.update(zip(idx_todo, idx_target))
# use mapper to modify the string:
return ''.join(a[mapper[i]] for i in xrange(len(a)))
for i in xrange(5):
print partial_shuffle('abcdefghijklmnopqrstuvwxyz', 0.2)
prints
abcdefghljkvmnopqrstuxwiyz
ajcdefghitklmnopqrsbuvwxyz
abcdefhwijklmnopqrsguvtxyz
aecdubghijklmnopqrstwvfxyz
abjdefgcitklmnopqrshuvwxyz

Evil and using a deprecated API:
import random
# adjust constant to taste
# 0 -> no effect, 0.5 -> completely shuffled, 1.0 -> reversed
# Of course this assumes your input is already sorted ;)
''.join(sorted(
'abcdefghijklmnopqrstuvwxyz',
cmp = lambda a, b: cmp(a, b) * (-1 if random.random() < 0.2 else 1)
))

maybe like so:
>>> s = 'string'
>>> shufflethis = list(s[2:])
>>> random.shuffle(shufflethis)
>>> s[:2]+''.join(shufflethis)
'stingr'
Taking from fortran's idea, i'm adding this to collection. It's pretty fast:
def partial_shuffle(st, p=20):
p = int(round(p/100.0*len(st)))
idx = range(len(s))
sample = random.sample(idx, p)
res=str()
samptrav = 1
for i in range(len(st)):
if i in sample:
res += st[sample[-samptrav]]
samptrav += 1
continue
res += st[i]
return res

Counting number of values between interval

Is there any efficient way in python to count the times an array of numbers is between certain intervals? the number of intervals i will be using may get quite large
like:
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
some function(mylist, startpoints):
# startpoints = [0,10,20]
count values in range [0,9]
count values in range [10-19]
output = [9,10]

you will have to iterate the list at least once.
The solution below works with any sequence/interval that implements comparision (<, >, etc) and uses bisect algorithm to find the correct point in the interval, so it is very fast.
It will work with floats, text, or whatever. Just pass a sequence and a list of the intervals.
from collections import defaultdict
from bisect import bisect_left
def count_intervals(sequence, intervals):
count = defaultdict(int)
intervals.sort()
for item in sequence:
pos = bisect_left(intervals, item)
if pos == len(intervals):
count[None] += 1
else:
count[intervals[pos]] += 1
return count
data = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
print count_intervals(data, [10, 20])
Will print
defaultdict(<type 'int'>, {10: 10, 20: 9})
Meaning that you have 10 values <10 and 9 values <20.

I don't know how large your list will get but here's another approach.
import numpy as np
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
np.histogram(mylist, bins=[0,9,19])

You can also use a combination of value_counts() and pd.cut() to help you get the job done.
import pandas as pd
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
split_mylist = pd.cut(mylist, [0, 9, 19]).value_counts(sort = False)
print(split_mylist)
This piece of code will return this:
(0, 10] 10
(10, 20] 9
dtype: int64
Then you can utilise the to_list() function to get what you want
split_mylist = split_mylist.tolist()
print(split_mylist)
Output: [10, 9]

If the numbers are integers, as in your example, representing the intervals as frozensets can perhaps be fastest (worth trying). Not sure if the intervals are guaranteed to be mutually exclusive -- if not, then
intervals = [frozenzet(range(10)), frozenset(range(10, 20))]
counts = [0] * len(intervals)
for n in mylist:
for i, inter in enumerate(intervals):
if n in inter:
counts[i] += 1
if the intervals are mutually exclusive, this code could be sped up a bit by breaking out of the inner loop right after the increment. However for mutually exclusive intervals of integers >= 0, there's an even more attractive option: first, prepare an auxiliary index, e.g. given your startpoints data structure that could be
indices = [sum(i > x for x in startpoints) - 1 for i in range(max(startpoints))]
and then
counts = [0] * len(intervals)
for n in mylist:
if 0 <= n < len(indices):
counts[indices[n]] += 1
this can be adjusted if the intervals can be < 0 (everything needs to be offset by -min(startpoints) in that case.
If the "numbers" can be arbitrary floats (or decimal.Decimals, etc), not just integer, the possibilities for optimization are more restricted. Is that the case...?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - unique set of ranges, merging when needed - python

Related

Finding singulars/sets of local maxima/minima in a 1D-NumPy array (once again)

Most frequently overlapping range - Python3.x

How to join integers intervals in python?

How to make a random but partial shuffle in Python?

Counting number of values between interval

Categories

Resources